Issue 38764: Deterministic globbing.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/82945

classification

Title:	Deterministic globbing.
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.9

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	brandtbucher, gvanrossum, mrabarnett, njs, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2019-11-11 02:31 by brandtbucher, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 17105	closed	brandtbucher, 2019-11-11 02:31

Messages (10)
msg356344 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2019-11-11 02:31
This has been discussed before, but we now have examples in the news of glob's non-deterministic behavior causing some real headaches for hundreds of people in the scientific community. After some cursory discussion (https://discuss.python.org/t/a-code-glitch-may-have-caused-errors-in-more-than-100-published-studies/2583) it was suggested that this issue would at least be worth revisiting, given the damage... bad programming practices aside. The attached patch guarantees glob/iglob traversal order across all platforms, and modifies all of the existing tests to check for this. One reason this was rejected twice before (in both issue 21748, four years ago, and issue 30461, two years ago) was the argument that the user could always just call "sorted" on the returned list if they wanted ordered results. Aside from losing the laziness of iglob, this solution also loses the traversal order, since recursive globs use a breadth-first search. In the attached patch, iglob is still just as lazy as ever, and the traversal is unchanged for anyone using a platform that already returns sorted results.
msg356347 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-11-11 03:41
I saw the article as well, but think auto-sorting would have just provided a thin mask over their serious data pipeline bugs. Also, this isn't the only pathway to seeing file lists: os.listdir, fnmatch, etc. I say that we leave it alone and not unnecessarily introduce a breadth-first, recursive sort. The reasons for the previous rejections still apply. For the most part our tools that access os services only provide a thin pass-through and tend to neither promise nor do anything extra (even SQL only gives sorted data when the user explicitly requests ORDER By -- the could have provided a default sort but choose not to).
msg356351 - (view)	Author: Nathaniel Smith (njs) *	Date: 2019-11-11 04:26
> I saw the article as well, but think auto-sorting would have just provided a thin mask over their serious data pipeline bugs. This seems like an inappropriately elitist attitude. I'm sure their code has bugs; so does mine and yours. But in fact they did test their code thoroughly, demonstrated that it produced correct results on their system, and it would have produced correct results for their downstream users too if Python's 'glob' had behaved consistently across platforms. Python isn't just for folks who can "git gud" to meet your arbitrary standards. In addition, auto-sorting has a number of ergonomic benefits, beyond this one case: it makes output more deterministic, makes it easy to estimate progress given just a list of filenames being processed ("hmm, this script has been running for ten minutes and it's only on the Cs, I guess this will take a few hours"), and so on. > Also, this isn't the only pathway to seeing file lists: os.listdir, fnmatch, etc. os.listdir is a genuinely low-level OS-level tool, as indicated by the name, versus 'glob' which is designed as a convenient interface for high-level scripting. fnmatch doesn't deal with either files or lists, so I'm not sure why you're bringing it up here. > The reasons for the previous rejections still apply. The reasons I see in the previous issues are: - Sorting adds overhead - It's hard to implement, esp. if iglob uses scandir internally - iglob can't sort, so glob should be consistent - it's easy to sort after the fact But these are all false. - The overhead added by sorting is negligible. In some informal benchmarks I measured it as ~4% extra CPU time in the worst case, if you're (a) running on Linux with it's ultra-optimized VFS caching, (b) all the directories are already cached in RAM, (c) you're not actually doing anything with the files except calling glob() and then discarding the result. Of course, it's always possible my benchmarks missed an important case; if you have benchmarks that show otherwise please post them so we can discuss. - The implementation is trivial, as shown from the PR – literally just replacing two calls to 'list' with calls to 'sorted'. - iglob being incremental doesn't actually pose any obstacles to sorting. Even if it's using scandir(), it still has to load each individual directory list into memory in one batch, to avoid the risk of leaking file descriptors. - It's actually not trivial to sort filenames in the natural order afterwards, because you have to split each filename into components. It's also asymptotically slower than doing the sort as-you-go, O(total-files log total-files), versus O(total-files log maximum-individual-directory). Given that there's at least one extremely clear case where not-sorting caused substantial harm, that there are lots of other minor cases where it's a nice win, and that the costs are negligible, sorting seems like the obviously correct default.
msg356352 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2019-11-11 04:27
I disagree somewhat with the assessment that glob provides "thin" access to OS services. It is composed of a few low-level utilities, but it exposes them through what I consider to be a fairly high-level, abstract, friendly interface that (aside from ordering, and some symlink stuff) is surprisingly cross-platform. I originally felt that this was "their own fault", since the behavior is well-documented and the data pipeline was so susceptible to error. But I started thinking more about the fact that the bad globbing made it into peer-reviewed cancer research over a hundred times. Python is a friendly language and glob is a friendly library, so it would be natural for someone without formal engineering experience to reach for both. I'm becoming more and more surprised that glob exposes this type of quirk, for trivial gain. I respect your opinion though. Another option would be to leave the patch in, but not document the new traversal order. That way we rid ourselves of a class of user errors without exposing implementation details like the type of search used.
msg356356 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-11-11 09:51
If sort the output of glob(), we should sort also the output of Path.glob(), listdir(), and maybe walk().
msg356370 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2019-11-11 17:00
I'm not sure listdir and walk should be sorted, for the reasons Raymond mentioned. And I was actually surprised to learn today that pathlib.Path.glob doesn't use the glob module internally. Was there some reasoning behind this decision? Separator handling, perhaps? Either way, these seem to be out of the scope of this issue... not every change needs to be a repo-wide omnibus! I think incremental improvements like the proposed one are often fine.
msg356384 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-11-11 20:27
> This seems like an inappropriately elitist attitude. Please drop the personal attacks and abusive tone. The tracker is for technical discussions.
msg356385 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-11-11 20:43
ISTM that if glob had been sorted, it would have only thinly masked the data pipeline issues that were encountered. The code would still have been easily broken by someone renaming, adding, or removing a file. In their case, it was the semantic context of the files where the application order was important, the actual filenames were incidental to the bug. Also having user depend on sorting may lead to other bugs as well. 1) Sorts in one version of Python but not another, 2) Sorts in some functions but not others, and 3) The sort is text based so that file1, file2, ..., file11, file12 don't sort as expected. FWIW, I didn't put a strong -1. However, the "sorting of discovered children" is a warning sign that we're doing too much.
msg356391 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2019-11-11 21:48
I could also add: would sorting be case-sensitive or case-insensitive? Windows is case-insensitive, Linux is case-sensitive.
msg356504 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2019-11-12 23:02
Let's not do this, for all the reasons brought up.

History
Date	User	Action	Args
2022-04-11 14:59:23	admin	set	github: 82945
2019-11-12 23:02:14	gvanrossum	set	status: open -> closed nosy: + gvanrossum messages: + msg356504 resolution: wont fix stage: patch review -> resolved
2019-11-11 21:48:13	mrabarnett	set	nosy: + mrabarnett messages: + msg356391
2019-11-11 20:43:51	rhettinger	set	nosy: rhettinger, njs, serhiy.storchaka, brandtbucher messages: + msg356385
2019-11-11 20:27:19	rhettinger	set	messages: + msg356384
2019-11-11 17:00:47	brandtbucher	set	messages: + msg356370
2019-11-11 09:51:20	serhiy.storchaka	set	messages: + msg356356
2019-11-11 04:37:29	xtreak	set	nosy: + serhiy.storchaka
2019-11-11 04:27:17	brandtbucher	set	messages: + msg356352
2019-11-11 04:26:19	njs	set	messages: + msg356351
2019-11-11 03:41:44	rhettinger	set	nosy: + rhettinger messages: + msg356347
2019-11-11 02:31:58	brandtbucher	set	keywords: + patch stage: patch review pull_requests: + pull_request16611
2019-11-11 02:31:05	brandtbucher	create