This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: pathlib.Path.glob's generator is not a real generator
Type: performance Stage:
Components: IO Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Elijah Rippeth, pitrou, serhiy.storchaka, terry.reedy
Priority: normal Keywords:

Created on 2021-05-07 15:46 by Elijah Rippeth, last changed 2022-04-11 14:59 by admin.

Messages (3)
msg393190 - (view) Author: Elijah Rippeth (Elijah Rippeth) * Date: 2021-05-07 15:46
I have a directory with hundreds of thousands of text files. I wanted to explore one file, so I wrote the following code expecting it to happen basically instantaneously because of how generators work:

```python
from pathlib import Path

base_dir = Path("/path/to/lotta/files/")
files = base_dir.glob("*.txt")            # return immediately
first_file = next(files)                  # doesn't return immediately
```

to my surprise, this took a long time to finish since `next` on a generator should be O(1).

A colleague pointed me to the following code: https://github.com/python/cpython/blob/adcd2205565f91c6719f4141ab4e1da6d7086126/Lib/pathlib.py#L431

I assume calling this list is to "freeze" a potentially changing directory since `scandir` relies on `os.stat`, but this causes a huge penalty and makes the generator return-type a bit disingenuous. In any case, I think this is bug worthy in someo sense.
msg393562 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-05-13 07:42
The reason is different. The scandir() iterator should be closed before we go recursively deep in the directory tree. Otherwise we can reach the limit of open file descriptors (especially if several glob()s are called in parallel). See issue22167.
msg394125 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-05-21 16:32
I agree that from the outside is seems slightly bizarre to make an internal list to implement a function documented as returning an iterator.  However, list(scandir) was added by Serhiy in #26032 with the comment that it made globbing 1.5-4 times faster.  This is, of course, if one runs the iterator to completion, as is the normal use.

For your presented use case, I suggest something like the following:

next(f for f in scandir(path) if os.path.splitext(f)[1] == '.txt')
History
Date User Action Args
2022-04-11 14:59:45adminsetgithub: 88235
2021-05-21 16:32:39terry.reedysetnosy: + terry.reedy

messages: + msg394125
versions: - Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10
2021-05-13 07:42:27serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg393562
2021-05-13 04:01:19rhettingersetnosy: + pitrou
2021-05-11 18:01:08Elijah Rippethsetversions: + Python 3.7, Python 3.8, Python 3.9, Python 3.10, Python 3.11
2021-05-07 15:46:22Elijah Rippethcreate