iglob() has misleading documentation (does indeed store names internally) #66363

roysmith · 2014-08-08T01:09:01Z

BPO	22167
Nosy	@gvanrossum, @stevendaprano, @bitdancer, @serhiy-storchaka, @akulakov
PRs	bpo-22167: Updated glob.iglob doc with details on memory use #25767

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2014-08-08.01:09:01.123>
labels = ['3.7', '3.8', '3.9', '3.10', 'type-feature', 'docs']
title = 'iglob() has misleading documentation (does indeed store names internally)'
updated_at = <Date 2021-06-21.19:36:46.362>
user = 'https://bugs.python.org/roysmith'

bugs.python.org fields:

activity = <Date 2021-06-21.19:36:46.362>
actor = 'andrei.avk'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation']
creation = <Date 2014-08-08.01:09:01.123>
creator = 'roysmith'
dependencies = []
files = []
hgrepos = []
issue_num = 22167
keywords = ['patch']
message_count = 18.0
messages = ['225048', '225050', '225051', '225069', '225070', '258010', '312257', '312275', '370489', '370491', '370492', '370499', '370505', '370544', '371613', '371615', '371687', '396283']
nosy_count = 9.0
nosy_names = ['gvanrossum', 'roysmith', 'steven.daprano', 'r.david.murray', 'docs@python', 'Gumnos', 'serhiy.storchaka', 'Roger Erens', 'andrei.avk']
pr_nums = ['25767']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue22167'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10']

Linked PRs

gh-66363: Make glob.iglob() less eager #117472

roysmith · 2014-08-08T01:09:00Z

For background, see:

https://mail.python.org/pipermail/python-list/2014-August/676291.html

In a nutshell, the iglob() docs say, "Return an iterator which yields the same values as glob() without actually storing them all simultaneously." The problem is, internally, it calls os.listdir(), which apparently *does* store the entire list internally, defeating the whole purpose of iglob()

I recognize that iglob() is not going to get fixed in 2.7, but at least the documentation should be updated to point out that it doesn't really do what it says it does. Or rather, it doesn't really not do what it says it doesn't :-)

stevendaprano · 2014-08-08T01:59:25Z

I agree that the documentation could be improved, but it's not really *wrong*. Consider a glob like "spam/[abc]/*.txt". What iglob does is conceptually closer to:

(1) generate the list of files matching "spam/a/.txt" and yield them;
(2) generate the list of files matching "spam/b/.txt" and yield them;
(3) generate the list of files matching "spam/c/*.txt" and yield them

rather than:

(1) generate the list of files matching "spam/a/.txt";
(2) append the files matching "spam/b/.txt";
(3) append the files matching "spam/c/*.txt";
(4) finally yield them

(see the source code here: http://hg.python.org/cpython/file/3.4/Lib/glob.py ). I think the documentation is trying to say that iglob doesn't *always* store all the matching files, without implying that it *never* stores all the matching files. I can't think of a clean way to explain that, so a doc patch is welcome.

roysmith · 2014-08-08T02:28:02Z

The thread that led to this started out with the use case of a directory that had 200k files in it. If I ran iglob() on that and discovered that it had internally generated a list of all 200k names in memory at the same time, I would be pretty darn surprised, based on what the docs say now.

We're shooting for principle of least astonishment here.

roysmith · 2014-08-08T13:03:05Z

How about something like this:

Note: The current iglob() implementation is optimized for the case of many files distributed in a large directory tree. Internally, it iterates over the directory tree, and stores all the names from each directory at once. This will lead to pathologically inefficient behavior when any individual directory has a large number of files in it.

bitdancer · 2014-08-08T13:22:34Z

IMO the documentation isn't *wrong*, just misleading :)

What it is saying is that *your program* doesn't have to store the full list returned by iglob before being able to use it (ie: iglob doesn't return a list). It says nothing about what resources are used internally, other than an implied contract that there is *some* efficiency over calling glob; which, as explained above, there is. The fact that the implementation uses lots of memory if any single directory is large is then a performance bug, which can theoretically be fixed in 3.5 using scandir.

The reason iglob was introduced, if you check the revision history, is that glob used to call itself recursively for each sub-directory, which meant it held *all* of the files in *all* of the scanned tree in memory at one time. It is literally true that the difference between glob and iglob is that with iglob your program doesn't have to store the full list of matches from all subdirectories, but talking about "your program" is not something we typically do in python docs, it is implied.

Perhaps in 2.7/3.4 we can mention in the module docs that at most one directory's worth of data will be held in memory during the globbing process, but it feels a little weird to document an implementation detail like that. Still, if someone can come up with improved wording for the docs, we can add it.

gvanrossum · 2016-01-11T20:41:40Z

Once http://bugs.python.org/issue25596 (switching glob to use scandir) is solved this issue can be closed IMO.

RogerErens · 2018-02-16T23:57:38Z

http://bugs.python.org/issue25596 has been closed...

serhiy-storchaka · 2018-02-17T08:09:22Z

Unfortunately bpo-25596 didn't change anything about this issue. iglob() still stores names (actually DirEntry objects) of all files in a directory before starting yielding the first of them. Otherwise we cold exceed the limit of open file descriptors.

gvanrossum · 2020-05-31T16:47:49Z

Serhiy, what do you mean by "otherwise we could run out of file descriptiors"? I looked a bit at the code and there are different kinds of algorithms involved for different forms of patterns, and the code also takes vastly different paths for recursive matches.

I found one bit of code that looked like it could be improved, with some effort: _glob1(). This constructs a list of all files in one directory and then filters then. It looks like this could be a problem if there are e.g. 100_000 files in one directory. To fix, we could implement fnmatch.ifilter() which would be like fnmatch.filter() but uses yield name instead of result.append(name); then _glob1() could be rewritten as follows (untested):

def _glob1(dirname, pattern, dironly):
    names = _iterdir(dirname, dironly))
    if not _ishidden(pattern):
        yield from fnmatch.ifilter(x for x in names if not _ishidden(x))
    else:
        yield from fnmatch.ifilter(names, pattern)

Thoughts? Would this increase the number of open file descriptors in some edge case?

serhiy-storchaka · 2020-05-31T17:50:02Z

Yes, for the pattern 'a*/b*/c*' you will have an open file descriptor for every component with metacharacters:

    for a in scandir('.'):
        if fnmatch(a.name, 'a*'):
            for b in scandir(a.path):
                if fnmatch(b.name, 'b*'):
                    for c in scandir(b.path):
                        if fnmatch(c.name, 'c*'):
                            yield c.path

You can have hundreds nested directories. Looks not bad, because by default the limit on the number of file descriptors is 1024 on Linux. But imagine you run a server and it handles tens requests simultaneously. Some of them or all of them will fail, and not just return an error, but return an incorrect result, because all OSError, including "Too many open files", are silenced in glob().

Also all these file descriptors will not be closed until you finish the iteration, or, in case of error, until the garbage collector close them (because interrupted generators tend to create reference loops).

So it is vital to close the file descriptor before you open other file descriptors in the recursion.

gvanrossum · 2020-05-31T18:00:26Z

Hm, yeah.

Perhaps we can add some buffering so that for directories with a small
number of files (say < 1000) the FD is closed before recursing into it,
while for large directories it keeps the FD open until the last block of
1000 DirEntries is read? It's unlikely that someone has a tree that's both
deep *and* wide at many levels, since that would mean an inordinate number
of files indeed.

serhiy-storchaka · 2020-05-31T19:21:08Z

This is an interesting idea, but I afraid it will complicate the code too much, destroying the remnants of the initial elegant design. I am going to try to experiment with this idea after implementing other features, so it will not block them.

For now we need to document that iglob() may collect some names internally before starting to emit them. If my English be better I would did it myself.

gvanrossum · 2020-05-31T21:05:34Z

I hope some volunteer will submit a doc PR.

In the meantime, throwing out one more idea: perhaps my first idea, to make
_glob1() an iterator, could work if we only do it for leaves of the
pattern, so for the a*/b*/c* example, only for the c* part.

serhiy-storchaka · 2020-06-01T09:16:04Z

Brilliant idea! I played with it yesterday, and it is easy to generalize it to work with a*/b*/c*/d/e/f and to "use not more than N simultaneously opened file descriptors per glob iterator". The only problem is if we want to use this idea with recursive "**" without losing performance. It is just a technical problem which can be solved by adding more code. Although I need to do some benchmarks.

And I need to test how this idea will work with bpo-38144.

gvanrossum · 2020-06-16T05:28:10Z

How's this going?

serhiy-storchaka · 2020-06-16T07:01:07Z

I am going to add the bpo-38144 feature first. Then maybe implement a dir_fd based optimization.

gvanrossum · 2020-06-16T18:11:46Z

Sounds good.

FWIW, and totally off-topic, I find it annoying that pathlib's .glob() method supports ** in patterns, but its cousing .match() does not:

>>> p = pathlib.Path("Lib/test/support/os_helper.py")
>>> p.match("Lib/**/*.py")
False
>>>

akulakov · 2021-06-21T19:36:46Z

I have put up a PR here: #25767

serhiy-storchaka · 2024-04-02T18:12:24Z

#117472 makes iglob() less eager. It no longer stores the content of the directory if it is not needed. As result, it starts yielding results earlier (the difference depends on the directory size), but the total iteration time is increased.

For example (the first line is main, the second line is the PR):

$ ./python -m timeit -s 'from glob import iglob' 'next(iglob("Lib/*.py"))'
2000 loops, best of 5: 184 usec per loop
5000 loops, best of 5: 80.1 usec per loop
$ ./python -m timeit -s 'from glob import glob' 'glob("Lib/*.py")'
1000 loops, best of 5: 286 usec per loop
1000 loops, best of 5: 303 usec per loop
$ ./python -m timeit -s 'from glob import iglob' 'next(iglob("Lib/*/*.py"))'
1000 loops, best of 5: 274 usec per loop
1000 loops, best of 5: 199 usec per loop
$ ./python -m timeit -s 'from glob import glob' 'glob("Lib/*/*.py")'
100 loops, best of 5: 2.17 msec per loop
100 loops, best of 5: 2.22 msec per loop

serhiy-storchaka · 2024-04-02T18:38:09Z

It works not only with last patch component, but with last patch component containing wildcards.

$ ./python -m timeit -s 'from glob import iglob' 'next(iglob("Lib/*/__pycache__"))'
2000 loops, best of 5: 147 usec per loop
5000 loops, best of 5: 89.9 usec per loop
$ ./python -m timeit -s 'from glob import glob' 'glob("Lib/*/__pycache__")'
1000 loops, best of 5: 308 usec per loop
1000 loops, best of 5: 346 usec per loop

roysmith mannequin assigned docspython Aug 8, 2014

roysmith mannequin added docs Documentation in the Doc dir type-feature A feature request or enhancement labels Aug 8, 2014

serhiy-storchaka added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes labels May 31, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 2, 2024

pythongh-66363: Make glob.iglob() less eager

07175d3

bedevere-app bot mentioned this issue Apr 2, 2024

gh-66363: Make glob.iglob() less eager #117472

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iglob() has misleading documentation (does indeed store names internally) #66363

iglob() has misleading documentation (does indeed store names internally) #66363

roysmith mannequin commented Aug 8, 2014 •

edited by bedevere-app bot

roysmith mannequin commented Aug 8, 2014

stevendaprano commented Aug 8, 2014

roysmith mannequin commented Aug 8, 2014

roysmith mannequin commented Aug 8, 2014

bitdancer commented Aug 8, 2014

gvanrossum commented Jan 11, 2016

RogerErens mannequin commented Feb 16, 2018

serhiy-storchaka commented Feb 17, 2018

gvanrossum commented May 31, 2020

serhiy-storchaka commented May 31, 2020

gvanrossum commented May 31, 2020

serhiy-storchaka commented May 31, 2020

gvanrossum commented May 31, 2020

serhiy-storchaka commented Jun 1, 2020

gvanrossum commented Jun 16, 2020

serhiy-storchaka commented Jun 16, 2020

gvanrossum commented Jun 16, 2020

akulakov commented Jun 21, 2021

serhiy-storchaka commented Apr 2, 2024

serhiy-storchaka commented Apr 2, 2024

iglob() has misleading documentation (does indeed store names internally) #66363

iglob() has misleading documentation (does indeed store names internally) #66363

Comments

roysmith mannequin commented Aug 8, 2014 • edited by bedevere-app bot

Linked PRs

roysmith mannequin commented Aug 8, 2014

stevendaprano commented Aug 8, 2014

roysmith mannequin commented Aug 8, 2014

roysmith mannequin commented Aug 8, 2014

bitdancer commented Aug 8, 2014

gvanrossum commented Jan 11, 2016

RogerErens mannequin commented Feb 16, 2018

serhiy-storchaka commented Feb 17, 2018

gvanrossum commented May 31, 2020

serhiy-storchaka commented May 31, 2020

gvanrossum commented May 31, 2020

serhiy-storchaka commented May 31, 2020

gvanrossum commented May 31, 2020

serhiy-storchaka commented Jun 1, 2020

gvanrossum commented Jun 16, 2020

serhiy-storchaka commented Jun 16, 2020

gvanrossum commented Jun 16, 2020

akulakov commented Jun 21, 2021

serhiy-storchaka commented Apr 2, 2024

serhiy-storchaka commented Apr 2, 2024

roysmith mannequin commented Aug 8, 2014 •

edited by bedevere-app bot