New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There is no os.listdir() equivalent returning generator instead of list #55615
Comments
Big dirs are really slow to read at once. If user wants to read items one by one like here: |
Do you a proof for that claim? How big, and how really slow?
Also, how long does use(i) take, and what reduction (in percent) In short, I'm skeptical that there is an actual problem to be solved here. |
also, forgot... memory usage on big directories using list is a pain. This is the same things as range() and xrange(). Why not to add os.xlistdir() ? P.S. |
A generator listdir() geared towards performance should probably be able to work in batches, e.g. read 100 entries at once and buffer them in some internal storage (that might mean use readdir_r()). Bonus points if it doesn't release the GIL around each individual entry, but also batches that. |
The problem is that readdir doesn't read a directory entry one at a time.
You mean the dcache ? Could you elaborate ?
This would indeed be a good reason. Do you have numbers ?
That's exactly what readdir is doing :-)
Yes, since only one in 2**15 readdir call actually blocks, that could be a nice optimization (I've no idea of the potential gain though).
Are you using EXT3 ? Could you provide the output of an "strace -ttT python <test script>" (and also the time spent in os.listdir) ? |
Generator listdir() could be useful if I have a directory with several millions of files and I what to process just a hundred. |
Glibc's readdir() and readdir_r() already do caching, so getdents() syscall is called only once on my '/etc' directory. Should we include another caching level in xlistdir() function? |
Why not to create generator like this? DIR *d;
struct dirent* entry, *e;
entry = malloc(offsetof(struct dirent, d_name) + pathconf(dirpath, _PC_NAME_MAX) + 1);
if (!e)
raise Exception();
if (!(d= opendir(dirname)))
{
free(e)
raise IOException();
} for (;;) ------------------ |
There has been discussion of this before, but it must have been on one of the lists, (possibly py3k list) as searching tracker for 'listdir generator' only returns this. I believe I pointed out then that Miscrosoft C (also) has (did once) a 'nextdir' function. It's been so long that I forget details. I thought then and still do that listdir should (have) change (d) like range, map, and filter did, for same reasons. |
The reasons that applied to map and range don't apply to listdir(). The |
Can't we simply add os.xlistdir() leaving listdir() as is? |
-1 on going back through blah/xblah all over again. |
Only if an advantage can be demonstrated. in a realistic application. |
Originally, I want return value of listdir to be changed from list to generator. But next, I thought about compatibility. It will break some code. For example, SimpleHTTPServer: list = os.listdir(path)
list.sort(key=lambda a: a.lower()) will not work. |
We could, but someone must:
|
http://pastebin.com/NCGmfF49 - here's a kind of test (cached and uncached) |
Can you please be more explicit? What's the application in which you
This isn't really convincing - the test looks at all files, so it isn't
This is not a real-world application - there is no actual processing done. BTW, can you publish your xlistdir implementation somewhere? |
Tests show 10 times smaller memory footprint during directory listing - 25Mb against 286Mb on directory with 800K entries. |
http://lwn.net/Articles/216948/ To proove that readdir is bad thing on large number of items in a directory. Well, EXT4 has fixed some issues (http://ext2.sourceforge.net/2005-ols/paper-html/node3.html) But what about locking in linux kernel (vfs and ext4) code? Also, some conservative linuxes still use ext3. |
That's all independent of the issue at hand. Whether or not getdents is The issue at hand is whether xlistdir actually provides any advantages I stand by my claim that
b) If there is some real-world processing of the files (e.g. There are also good reasons *not* to add xlistdir, primarily to |
My benchmarks show that xlistdir() gives the only memory usage advantage on large directories. No speed gain so far - maybe my patch is wrong. |
I would regard this as Type: resource usage, instead of performance. Given enough RAM, loading the whole directory at once will likely be faster. The downsides of os.listdir: b) Using it in a GUI basically requires you to use threads if you may run into a dir with many files. Especially on a slow filesystem (network). Because you won't regain control until the whole thing is read. I would like to have an iterator version as well, but I also dislike another function (especially the "x" prefix). How about adding a keyword argument to select iterator behaviour? |
Changing the return type based on an argument is generally frown upon |
This depends somewhat on the operating system. On Unix, doing os.stat
Hmm. In a GUI, you would typically want to sort the file names by
I still would like to see a demonstrable improvement in a real-world |
Directory with st_nlink==2 can contains any number of non-directory files. And one subdirectory if this directory is root. |
I'd like to take care of this at Python. At least for posix (someone else can deal with the windows side if they want). I just stumbled upon an extension module at work that someone wrote specifically because os.listdir consumed way too much ram by building up a huge list on large directories with tons of long filenames that it needed to process. (when your process is in charge of keeping that directory in check... the inability to process it once it grows too large simply isn't acceptable) |
IIRC Nick Coghlan had put a bit of work into this a few months ago as an |
Fair enough. I'm cool with scandir().
Yes, you're right. I "solved" this in BetterWalk with the solution you propose of returning a stat_result object with the fields it could get "for free" set, and the others set to None. So on Linux, you'd get a stat_result with only st_mode set (or None for DT_UNKNOWN), and all the other fields None. However -- st_mode is the one you're most likely to use, usually looking just for whether it's a file or directory. So calling code would look something like this: files = []
dirs = []
for name, st in scandir(path):
if st.st_mode is None:
st = os.stat(os.path.join(path, name))
if stat.S_ISDIR(st.st_mode):
dirs.append(name)
else:
files.append(name) Meaning you'd get the speed improvements 99% of the time (when st_mode) was set, but if st_mode is None, you can call stat and handle errors and whatnot yourself.
Agreed. This is in the OS module after all, and there's tons of stuff that's OS-dependent in there. However, I think that doing something like the above, we can make it usable and performant on both Linux and Windows for use cases like walking directory trees.
The Windows scan directory functions (FindFirstFile/FindNextFile) return a *full* stat (or at least, as much info as you get from a stat in Windows). We *could* map them to a common type -- but I'm suggesting that common type might as well be "stat_result with None meaning not present". That way users don't have to learn a completely new type.
We could document any platform-specific stuff, and places you'd users could get bitten. But can you give me an example of where the stat_result-with-st_mode-or-None approach falls over completely? |
I think os.scandir is a case where we *want* a low level call that exposes everything we can retrieve efficiently about the directory entries given the underlying platform - not everything written in Python is written to be portable, especially when it comes to scripts rather than applications (e.g. given where I work, I write a fair bit of code that is Fedora/RHEL specific, and if that code happens to work anywhere else it's just a bonus rather than being of any specific value to me). This may mean that we just return an "info" object for each item, where the available info is explicitly platform specific. Agreed it can be an actual stat object, though. os.walk then become the cross-platform abstraction built on top of the low level scandir call (splitting files from directories is probably about all we can do consistently cross-platform without per-entry stat calls). |
Well, that's easy: size = 0
for name, st in scandir(path):
if stat.S_ISREG(st.st_mode):
size += st.st_size
Well, the nice thing is that we don't have to create yet another info We can probably use the DTTOIF macro to convert d_type to st_mode. |
I really like scandir() -> (name: str, stat: stat structure using None for I expect that this API to optimize use cases like:
But as usual, a benchmark on a real platform would be more convicing. Filtering entries in os.listdir() or os.scandir() would be faster (than |
Yeah, I very much agree with what Nick says -- we really want a way to expose what the platform provides. It's less important (though still the ideal), to expose that in a platform-independent way. Today the only way to get access to opendir/readdir on Linux and FindFirst/Next on Windows is by using a bunch of messy (and slowish) ctypes code. And yes, os.walk() would be the main cross-platform abstraction built on top of this. Charles gave this example of code that would fall over: size = 0
for name, st in scandir(path):
if stat.S_ISREG(st.st_mode):
size += st.st_size I don't see it, though. In this case you need both .st_mode and .st_size, so a caller would check that those are not None, like so: size = 0
for name, st in scandir(path):
if st.st_mode is None or st.st_size is None:
st = os.stat(os.path.join(path, name))
if stat.S_ISREG(st.st_mode):
size += st.st_size One line of extra code for the caller, but a big performance gain in most cases. Stinner said, "But as usual, a benchmark on a real platform would be more convicing". Here's a start: https://github.com/benhoyt/betterwalk#benchmarks -- but it's not nearly as good as it gets yet, because those figures are still using the ctypes version. I've got a C version that's half-finished, and on Windows it makes os.walk() literally 10x the speed of the default version. Not sure about Linux/opendir/readdir yet, but I intend to do that too. |
size = 0
for name, st in scandir(path):
if st.st_mode is None or st.st_size is None:
st = os.stat(os.path.join(path, name))
if stat.S_ISREG(st.st_mode):
size += st.st_size It would be safer to use dir_fd parameter when supported, but I don't |
Well, that's precisely the point. Now, if I'm the only one who finds this trick dangerous and ugly, you |
Don't worry, it sometimes happens :-) |
I don't think that's true in general, or true of how other Python APIs work. For instance, many APIs return a "file-like object", and you can only do certain things on that object, depending on what the documentation says, or what EAFP gets you. Some file-like object don't support seek/tell, some don't support close, etc. I've seen plenty of walk-like-a-duck checks like this: if hasattr(f, 'close'):
f.close() Anyway, my point boils down to:
|
Yes, I'm fully aware duck-typing ;-) Please bring this up on python-dev. |
Actually I'm thinking this duck may only have a beak. Instead of a bunch of
|
Good idea. Thread started: http://mail.python.org/pipermail/python-dev/2013-May/126119.html |
Gregory, did you make any progress on this? |
Indeed! I'd like to see the feature in 3.4 so I can remove my own hack from our code base. |
I haven't had a chance to look at this since May. It'd still be a great addition. |
For reference the current state of things for this is the proposal in: With a prototype using a ctypes based implementation as proof of concept in https://github.com/benhoyt/scandir. A combination of that interface plus my existing scandir patch (-gps02) could be created for the final implementation. As 3.4beta1 happens tonight, this isn't going to make 3.4 so i'm bumping this to 3.5. I really like the proposed design outlined above. |
I'm with Martin and the other respondents who think this shouldn't be done. Without compelling timings, the smacks of feature creep. The platform specific issues may create an on-going maintenance problem. The feature itself is prone to misuse, leaving hard-to-find race condition bugs in its wake. |
Maybethe development should start outside Python stdlib, on a project on |
Raymond, there are very compelling timings/benchmarks for this -- not so much the original issue here (generator vs list, that's not really an issue) but having a scandir() function that returns the stat-like info from the OS so you don't need extra stat calls. This speeds up os.walk() by 7-20 times on Windows and 4-5 times on Linux. See more at: https://github.com/benhoyt/scandir#benchmarks I've written a draft PEP that I've sent to the PEP editors (if you're interested, it's at https://github.com/benhoyt/scandir/blob/master/PEP.txt). If any of the PEP editors are listening here ... would love some feedback on that at some stage. :-) Victor -- development has started outside the stdlib here: https://github.com/benhoyt/scandir and PyPI module here: https://pypi.python.org/pypi/scandir Both are being used by various people. |
"I've written a draft PEP that I've sent to the PEP editors (if you're interested, it's at https://github.com/benhoyt/scandir/blob/master/PEP.txt). If any of the PEP editors are listening here ... would love some feedback on that at some stage. :-)" Oh you wrote a PEP? Great! I pushed it to the PEP repository. It should be online in a few hours: PEP editors are still useful if you want to get access directly the Mercurial repository to modify directly your PEP. If you have a PEP, it's time to send it to the python-dev mailing list. Don't attach it to your mail, but copy PEP in the body of your email for easier inline comments in replies. |
Thanks! Will post the PEP to python-dev in the next day or two. |
I suggest a pass through python-ideas first. python-ideas feedback tends to |
Nick -- sorry, already posted to python-dev before seeing your latest. However, I think it's the right place, as there's already been a fair bit of hashing this idea and API out on python-ideas first and then also python-dev. See links in the PEP here: http://legacy.python.org/dev/peps/pep-0471/#previous-discussion |
I haven't really followed, but now that the PEP is accepted, what is the progress on this one? |
Yes, PEP-471 has been accepted, and I've got a mostly-finished C implementation of os.scandir() for CPython 3.5, as well as tests and docs. If you want a sneak preview, see posixmodule_scandir*.c, test/test_scandir.py, and os.rst here: https://github.com/benhoyt/scandir It's working well on Windows, but the Linux version has a couple of tiny issues yet (core dumps ;-). Given that os.scandir() will solve this issue (as well as the bigger performance problem due to listdir throwing away file type info), can we close this issue and open another one to track the implementation of os.scandir() / PEP-471? |
This makes sense. Can you do it? |
Okay, I've opened http://bugs.python.org/issue22524, but I don't have the permissions to close this one, so could someone with bugs.python.org superpowers please do that? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: