classification
Title: Deterministic globbing.
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.9
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: brandtbucher, gvanrossum, mrabarnett, njs, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2019-11-11 02:31 by brandtbucher, last changed 2019-11-12 23:02 by gvanrossum. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 17105 closed brandtbucher, 2019-11-11 02:31
Messages (10)
msg356344 - (view) Author: Brandt Bucher (brandtbucher) * (Python triager) Date: 2019-11-11 02:31
This has been discussed before, but we now have examples in the news of glob's non-deterministic behavior causing some real headaches for hundreds of people in the scientific community. After some cursory discussion (https://discuss.python.org/t/a-code-glitch-may-have-caused-errors-in-more-than-100-published-studies/2583) it was suggested that this issue would at least be worth revisiting, given the damage... bad programming practices aside.

The attached patch guarantees glob/iglob traversal order across all platforms, and modifies all of the existing tests to check for this.

One reason this was rejected twice before (in both issue 21748, four years ago, and issue 30461, two years ago) was the argument that the user could always just call "sorted" on the returned list if they wanted ordered results. Aside from losing the laziness of iglob, this solution also loses the traversal order, since recursive globs use a breadth-first search. In the attached patch, iglob is still just as lazy as ever, and the traversal is unchanged for anyone using a platform that already returns sorted results.
msg356347 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-11-11 03:41
I saw the article as well, but think auto-sorting would have just provided a thin mask over their serious data pipeline bugs.

Also, this isn't the only pathway to seeing file lists:  os.listdir, fnmatch, etc.

I say that we leave it alone and not unnecessarily introduce a breadth-first, recursive sort.  The reasons for the previous rejections still apply.  For the most part our tools that access os services only provide a thin pass-through and tend to neither promise nor do anything extra (even SQL only gives sorted data when the user explicitly requests ORDER By -- the could have provided a default sort but choose not to).
msg356351 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2019-11-11 04:26
> I saw the article as well, but think auto-sorting would have just provided a thin mask over their serious data pipeline bugs.

This seems like an inappropriately elitist attitude. I'm sure their code has bugs; so does mine and yours. But in fact they did test their code thoroughly, demonstrated that it produced correct results on their system, and it would have produced correct results for their downstream users too if Python's 'glob' had behaved consistently across platforms. Python isn't just for folks who can "git gud" to meet your arbitrary standards.

In addition, auto-sorting has a number of ergonomic benefits, beyond this one case: it makes output more deterministic, makes it easy to estimate progress given just a list of filenames being processed ("hmm, this script has been running for ten minutes and it's only on the Cs, I guess this will take a few hours"), and so on.

> Also, this isn't the only pathway to seeing file lists:  os.listdir, fnmatch, etc.

os.listdir is a genuinely low-level OS-level tool, as indicated by the name, versus 'glob' which is designed as a convenient interface for high-level scripting.

fnmatch doesn't deal with either files or lists, so I'm not sure why you're bringing it up here.

> The reasons for the previous rejections still apply.

The reasons I see in the previous issues are:

- Sorting adds overhead
- It's hard to implement, esp. if iglob uses scandir internally
- iglob can't sort, so glob should be consistent
- it's easy to sort after the fact

But these are all false.

- The overhead added by sorting is negligible. In some informal benchmarks I measured it as ~4% extra CPU time in the worst case, if you're (a) running on Linux with it's ultra-optimized VFS caching, (b) all the directories are already cached in RAM, (c) you're not actually doing anything with the files except calling glob() and then discarding the result. Of course, it's always possible my benchmarks missed an important case; if you have benchmarks that show otherwise please post them so we can discuss.

- The implementation is trivial, as shown from the PR – literally just replacing two calls to 'list' with calls to 'sorted'.

- iglob being incremental doesn't actually pose any obstacles to sorting. Even if it's using scandir(), it still has to load each individual directory list into memory in one batch, to avoid the risk of leaking file descriptors.

- It's actually not trivial to sort filenames in the natural order afterwards, because you have to split each filename into components. It's also asymptotically slower than doing the sort as-you-go, O(total-files log total-files), versus O(total-files log maximum-individual-directory).

Given that there's at least one extremely clear case where not-sorting caused substantial harm, that there are lots of other minor cases where it's a nice win, and that the costs are negligible, sorting seems like the obviously correct default.
msg356352 - (view) Author: Brandt Bucher (brandtbucher) * (Python triager) Date: 2019-11-11 04:27
I disagree somewhat with the assessment that glob provides "thin" access to OS services. It is composed of a few low-level utilities, but it exposes them through what I consider to be a fairly high-level, abstract, friendly interface that (aside from ordering, and some symlink stuff) is surprisingly cross-platform.

I originally felt that this was "their own fault", since the behavior is well-documented and the data pipeline was so susceptible to error. But I started thinking more about the fact that the bad globbing made it into peer-reviewed cancer research over a hundred times. Python is a friendly language and glob is a friendly library, so it would be natural for someone without formal engineering experience to reach for both. I'm becoming more and more surprised that glob exposes this type of quirk, for trivial gain.

I respect your opinion though. Another option would be to leave the patch in, but not document the new traversal order. That way we rid ourselves of a class of user errors without exposing implementation details like the type of search used.
msg356356 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-11 09:51
If sort the output of glob(), we should sort also the output of Path.glob(), listdir(), and maybe walk().
msg356370 - (view) Author: Brandt Bucher (brandtbucher) * (Python triager) Date: 2019-11-11 17:00
I'm not sure listdir and walk should be sorted, for the reasons Raymond mentioned. And I was actually surprised to learn today that pathlib.Path.glob doesn't use the glob module internally. Was there some reasoning behind this decision? Separator handling, perhaps?

Either way, these seem to be out of the scope of this issue... not every change needs to be a repo-wide omnibus! I think incremental improvements like the proposed one are often fine.
msg356384 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-11-11 20:27
> This seems like an inappropriately elitist attitude.

Please drop the personal attacks and abusive tone.  The tracker is for technical discussions.
msg356385 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-11-11 20:43
ISTM that if glob had been sorted, it would have only thinly masked the data pipeline issues that were encountered.  The code would still have been easily broken by someone renaming, adding, or removing a file.  In their case, it was the semantic context of the files where the application order was important, the actual filenames were incidental to the bug.

Also having user depend on sorting may lead to other bugs as well.  1) Sorts in one version of Python but not another, 2) Sorts in some functions but not others, and 3) The sort is text based so that file1, file2, ..., file11, file12 don't sort as expected.

FWIW, I didn't put a strong -1.  However, the "sorting of discovered children" is a warning sign that we're doing too much.
msg356391 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-11-11 21:48
I could also add: would sorting be case-sensitive or case-insensitive? Windows is case-insensitive, Linux is case-sensitive.
msg356504 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2019-11-12 23:02
Let's not do this, for all the reasons brought up.
History
Date User Action Args
2019-11-12 23:02:14gvanrossumsetstatus: open -> closed

nosy: + gvanrossum
messages: + msg356504

resolution: wont fix
stage: patch review -> resolved
2019-11-11 21:48:13mrabarnettsetnosy: + mrabarnett
messages: + msg356391
2019-11-11 20:43:51rhettingersetnosy: rhettinger, njs, serhiy.storchaka, brandtbucher
messages: + msg356385
2019-11-11 20:27:19rhettingersetmessages: + msg356384
2019-11-11 17:00:47brandtbuchersetmessages: + msg356370
2019-11-11 09:51:20serhiy.storchakasetmessages: + msg356356
2019-11-11 04:37:29xtreaksetnosy: + serhiy.storchaka
2019-11-11 04:27:17brandtbuchersetmessages: + msg356352
2019-11-11 04:26:19njssetmessages: + msg356351
2019-11-11 03:41:44rhettingersetnosy: + rhettinger
messages: + msg356347
2019-11-11 02:31:58brandtbuchersetkeywords: + patch
stage: patch review
pull_requests: + pull_request16611
2019-11-11 02:31:05brandtbuchercreate