> Big dirs are really slow to read at once. If user wants to read items one by one like here

The problem is that readdir doesn't read a directory entry one at a time.
When you call readdir on an open DIR * for the first time, the libc calls the getdents syscall, requesting a whole bunch of dentry at a time (32768 on my box).
Then, the subsequent readdir calls are virtually free, and don't involve any syscall/IO at all (that is, until you hit the last cached dent, and then another getdents is performed until end of directory).

> Also, dir_cache in kernel used more effectively.

You mean the dcache ? Could you elaborate ?

> also, forgot... memory usage on big directories using list is a pain.

This would indeed be a good reason. Do you have numbers ?

> A generator listdir() geared towards performance should probably be able to work in batches, e.g. read 100 entries at once and buffer them in some internal storage (that might mean use readdir_r()).

That's exactly what readdir is doing :-)

> Bonus points if it doesn't release the GIL around each individual entry, but also batches that.

Yes, since only one in 2**15 readdir call actually blocks, that could be a nice optimization (I've no idea of the potential gain though).

> Big dirs are really slow to read at once.

Are you using EXT3 ?
There are records of performance issues with getdents on EXT2/3 filesystems, see:
and this nice post by Linus:

Could you provide the output of an "strace -ttT python <test script>"  (and also the time spent in os.listdir) ?
