classification
Title: glob.glob should explicitly note that results aren't sorted
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.8, Python 3.7, Python 3.6, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Ben FrantzDale, cheryl.sabella, docs@python, eryksun, rhettinger, serhiy.storchaka, terry.reedy
Priority: normal Keywords: easy, patch

Created on 2018-04-13 18:38 by Ben FrantzDale, last changed 2018-11-04 14:51 by mdk. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 6587 merged Elena.Oat, 2018-04-24 13:34
Messages (11)
msg315254 - (view) Author: Ben FrantzDale (Ben FrantzDale) Date: 2018-04-13 18:38
The sortedness of glob.glob's output is platform-dependent. While the docs do not mention sorting, and so are strictly correct, if you are on a platform where its output is sorted, it's easy to believe that the output is always sorted.

I propose we a Note maybe next to "Note: Using the “**” pattern in large directory trees may consume an inordinate amount of time." that says "Note: While the output of glob.glob may be sorted on some architectures, ordering is not guaranteed. Use `sort(glob.glob(...))` if ordering is important."

This wrong assumption burned us when scripts inexplicably stopped working on OSX High Sierra.
msg315259 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-04-13 19:34
This seems reasonable.  I would like like it to be part of the regular text rather rather than appearing as a big ..note entry which can be visually distracting from the core functionality.
msg315273 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2018-04-13 23:55
> The sortedness of glob.glob's output is platform-dependent.

It's typically file-system dependent (e.g. NTFS, FAT, ISO9660, UDF) -- at least on Windows. NTFS and ISO9660 store directories in sorted order based on the filename (Unicode or ASCII ordinal sort).
msg315275 - (view) Author: Ben FrantzDale (Ben FrantzDale) Date: 2018-04-14 00:09
Fascinating. That seems like an even wilder gotcha: It sounds like a script assuming sorted results would work in one directory (on one filesystem) but not on another. Or even weirder, if I had a mounted scratch partition, the script could work until I (or a sys admin) mounts a larger drive with a different filesystem on the same mountpoint. Yikes! Either way, this gotcha seems worth mentioning explicitly.
msg315545 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-04-21 00:36
How about adding a sentence to the end of the first paragraph.

 glob.glob(pathname, *, recursive=False)

    Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. pathname can be either absolute (like /usr/src/Python-1.5/Makefile) or relative (like ../../Tools/*/*.gif), and can contain shell-style wildcards. Broken symlinks are included in the results (as in the shell).  Whether or not the results are sorted depends on the file system.
msg315701 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-04-24 13:41
Are there such notes in the descriptions of os.listdir(), os.scandir(), os.walk(), os.fwalk() and corresponding Path methods? If explicitly document the sorting, this should be made for all files enumerating functions.
msg315702 - (view) Author: Ben FrantzDale (Ben FrantzDale) Date: 2018-04-24 14:15
Great point. Looks like the phrase is "in arbitrary order" in the docs for
those (both 2.7 and 3), which is better than saying nothing. I'd still
prefer a bit more specificity about the potential gotcha since "arbitrary"
seems a lot less deterministic than "some file systems will give you sorted
order, some won't".

On Tue, Apr 24, 2018 at 9:41 AM, Serhiy Storchaka <report@bugs.python.org>
wrote:

>
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
>
> Are there such notes in the descriptions of os.listdir(), os.scandir(),
> os.walk(), os.fwalk() and corresponding Path methods? If explicitly
> document the sorting, this should be made for all files enumerating
> functions.
>
> ----------
> nosy: +serhiy.storchaka
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue33275>
> _______________________________________
>
msg315710 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-04-24 17:35
I agree that anything that has the same FS-determined sorted or not behavior should get the same note, for the same reason.  Ben, can you test?  Eryk, can you enlighten us further?

PS: Ben, when responding by email, please delete the quote, as it is duplicate noise on the web page.
msg315748 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2018-04-25 15:47
As I said, some file systems such as NTFS and ISO 9660 (or Joliet) store directories in lexicographically sorted order. NTFS does this using a b-tree and case-insensitive comparison, which helps the driver efficiently implement filtering a directory listing using a pattern such as "spam*eggs?.txt". (Filtering of a directory listing at the syscall level is peculiar to Windows and not supported by Python.)

I like the phrase "arbitrary order". I don't think it's wise for an application to ever depend on the order. Also, we usually want natural-language collation for display purposes (e.g. spam2.txt should come before spam10.txt), so we have to sort the result regardless of the file system.
msg316071 - (view) Author: Ben FrantzDale (Ben FrantzDale) Date: 2018-05-02 14:03
I looked into it a bit more. With python 2.7 on macOS High Sierra on APFS (Encrypted) with a FAT32 thumb drive... I have a directory that glob.glob('/Volumes/thumb/tmp/*') shows as sorted. I cp -r that to /tmp with bash. glob.glob('/tmp/tmp/*') is now not sorted. and cp -r /tmp/tmp /Volumes/thumb/tmp1. Then glob.glob('/Volumes/thumb/tmp/*') shows a different order, but if I cp -r /Volumes/thumb/tmp/ /Volumes/thumb/tmp2 then glob.glob('/Volumes/thumb/tmp2/*') is sorted by file name just like glob.glob('/Volumes/thumb/tmp/*'). I'm not sue what that's saying other than that glob.glob can return things out of order on FAT32. It appears that glob.glob's ordering agrees with that of ls -f ("unsorted").
msg316084 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2018-05-02 18:10
FAT inserts a new file entry in a directory at the first available position. (If it's a long filename, this could be up to 21 contiguous dirents for a combined long/short dirent set.) This means a directory listing is usually in the same order that files were added. One caveat is that dirents for deleted files may be reused once there are no more unused entries available in a cluster. (I'd expect this depends on the implementation. Also, this is less likely with a long filename, since it needs a large-enough contiguous block of dirents.) Given a volume with a 4 KiB cluster size, sans overhead there are 127 32-byte dirents in a cluster.

I used to have an MP3 player that used FAT32 and only played files in directory order, so I had to resort directories on disk after adding files. In Ubuntu Linux, I see there's a "fatsort" package that implements this. There's probably a build available for MacOS.
History
Date User Action Args
2018-11-04 14:51:19mdksetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-05-02 18:10:08eryksunsetmessages: + msg316084
2018-05-02 14:03:26Ben FrantzDalesetmessages: + msg316071
2018-04-25 15:47:39eryksunsetmessages: + msg315748
2018-04-24 17:35:31terry.reedysetmessages: + msg315710
2018-04-24 14:15:41Ben FrantzDalesetmessages: + msg315702
2018-04-24 13:41:32serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg315701
2018-04-24 13:34:46Elena.Oatsetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request6287
2018-04-21 00:36:55terry.reedysetnosy: + terry.reedy
messages: + msg315545
2018-04-14 00:09:48Ben FrantzDalesetmessages: + msg315275
2018-04-13 23:55:52eryksunsetnosy: + eryksun
messages: + msg315273
2018-04-13 19:35:33rhettingersetnosy: + cheryl.sabella
2018-04-13 19:34:47rhettingersetassignee: docs@python
components: + Documentation, - Library (Lib)
versions: - Python 3.4, Python 3.5
keywords: + easy
nosy: + rhettinger, docs@python

messages: + msg315259
stage: needs patch
2018-04-13 18:38:46Ben FrantzDalecreate