classification
Title: iglob() has misleading documentation (does indeed store names internally)
Type: enhancement Stage:
Components: Documentation Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Gumnos, docs@python, gvanrossum, r.david.murray, roysmith, steven.daprano
Priority: normal Keywords:

Created on 2014-08-08 01:09 by roysmith, last changed 2016-01-11 20:41 by gvanrossum.

Messages (6)
msg225048 - (view) Author: Roy Smith (roysmith) Date: 2014-08-08 01:09
For background, see:

https://mail.python.org/pipermail/python-list/2014-August/676291.html

In a nutshell, the iglob() docs say, "Return an iterator which yields the same values as glob() without actually storing them all simultaneously."  The problem is, internally, it calls os.listdir(), which apparently *does* store the entire list internally, defeating the whole purpose of iglob()

I recognize that iglob() is not going to get fixed in 2.7, but at least the documentation should be updated to point out that it doesn't really do what it says it does.  Or rather, it doesn't really not do what it says it doesn't :-)
msg225050 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-08-08 01:59
I agree that the documentation could be improved, but it's not really *wrong*. Consider a glob like "spam/[abc]/*.txt". What iglob does is conceptually closer to:

(1) generate the list of files matching "spam/a/*.txt" and yield them;
(2) generate the list of files matching "spam/b/*.txt" and yield them;
(3) generate the list of files matching "spam/c/*.txt" and yield them

rather than:

(1) generate the list of files matching "spam/a/*.txt";
(2) append the files matching "spam/b/*.txt";
(3) append the files matching "spam/c/*.txt";
(4) finally yield them

(see the source code here: http://hg.python.org/cpython/file/3.4/Lib/glob.py ). I think the documentation is trying to say that iglob doesn't *always* store all the matching files, without implying that it *never* stores all the matching files. I can't think of a clean way to explain that, so a doc patch is welcome.
msg225051 - (view) Author: Roy Smith (roysmith) Date: 2014-08-08 02:28
The thread that led to this started out with the use case of a directory that had 200k files in it.  If I ran iglob() on that and discovered that it had internally generated a list of all 200k names in memory at the same time, I would be pretty darn surprised, based on what the docs say now.

We're shooting for principle of least astonishment here.
msg225069 - (view) Author: Roy Smith (roysmith) Date: 2014-08-08 13:03
How about something like this:

Note: The current iglob() implementation is optimized for the case of many files distributed in a large directory tree.  Internally, it iterates over the directory tree, and stores all the names from each directory at once.  This will lead to pathologically inefficient behavior when any individual directory has a large number of files in it.
msg225070 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-08-08 13:22
IMO the documentation isn't *wrong*, just misleading :)

What it is saying is that *your program* doesn't have to store the full list returned by iglob before being able to use it (ie: iglob doesn't return a list).  It says nothing about what resources are used internally, other than an implied contract that there is *some* efficiency over calling glob; which, as explained above, there is.  The fact that the implementation uses lots of memory if any single directory is large is then a performance bug, which can theoretically be fixed in 3.5 using scandir.

The reason iglob was introduced, if you check the revision history, is that glob used to call itself recursively for each sub-directory, which meant it held *all* of the files in *all* of the scanned tree in memory at one time.  It is literally true that the difference between glob and iglob is that with iglob your program doesn't have to store the full list of matches from all subdirectories, but talking about "your program" is not something we typically do in python docs, it is implied.

Perhaps in 2.7/3.4 we can mention in the module docs that at most one directory's worth of data will be held in memory during the globbing process, but it feels a little weird to document an implementation detail like that.  Still, if someone can come up with improved wording for the docs, we can add it.
msg258010 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-11 20:41
Once http://bugs.python.org/issue25596 (switching glob to use scandir) is solved this issue can be closed IMO.
History
Date User Action Args
2016-01-11 20:41:40gvanrossumsetnosy: + gvanrossum
messages: + msg258010
2014-08-08 13:22:33r.david.murraysetnosy: + r.david.murray
messages: + msg225070
2014-08-08 13:03:05roysmithsetmessages: + msg225069
2014-08-08 02:28:02roysmithsetmessages: + msg225051
2014-08-08 01:59:25steven.dapranosetnosy: + steven.daprano
messages: + msg225050
2014-08-08 01:24:52Gumnossetnosy: + Gumnos
2014-08-08 01:09:01roysmithcreate