msg315254 - (view) |
Author: Ben FrantzDale (Ben FrantzDale) |
Date: 2018-04-13 18:38 |
The sortedness of glob.glob's output is platform-dependent. While the docs do not mention sorting, and so are strictly correct, if you are on a platform where its output is sorted, it's easy to believe that the output is always sorted.
I propose we a Note maybe next to "Note: Using the “**” pattern in large directory trees may consume an inordinate amount of time." that says "Note: While the output of glob.glob may be sorted on some architectures, ordering is not guaranteed. Use `sort(glob.glob(...))` if ordering is important."
This wrong assumption burned us when scripts inexplicably stopped working on OSX High Sierra.
|
msg315259 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2018-04-13 19:34 |
This seems reasonable. I would like like it to be part of the regular text rather rather than appearing as a big ..note entry which can be visually distracting from the core functionality.
|
msg315273 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2018-04-13 23:55 |
> The sortedness of glob.glob's output is platform-dependent.
It's typically file-system dependent (e.g. NTFS, FAT, ISO9660, UDF) -- at least on Windows. NTFS and ISO9660 store directories in sorted order based on the filename (Unicode or ASCII ordinal sort).
|
msg315275 - (view) |
Author: Ben FrantzDale (Ben FrantzDale) |
Date: 2018-04-14 00:09 |
Fascinating. That seems like an even wilder gotcha: It sounds like a script assuming sorted results would work in one directory (on one filesystem) but not on another. Or even weirder, if I had a mounted scratch partition, the script could work until I (or a sys admin) mounts a larger drive with a different filesystem on the same mountpoint. Yikes! Either way, this gotcha seems worth mentioning explicitly.
|
msg315545 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2018-04-21 00:36 |
How about adding a sentence to the end of the first paragraph.
glob.glob(pathname, *, recursive=False)
Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. pathname can be either absolute (like /usr/src/Python-1.5/Makefile) or relative (like ../../Tools/*/*.gif), and can contain shell-style wildcards. Broken symlinks are included in the results (as in the shell). Whether or not the results are sorted depends on the file system.
|
msg315701 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2018-04-24 13:41 |
Are there such notes in the descriptions of os.listdir(), os.scandir(), os.walk(), os.fwalk() and corresponding Path methods? If explicitly document the sorting, this should be made for all files enumerating functions.
|
msg315702 - (view) |
Author: Ben FrantzDale (Ben FrantzDale) |
Date: 2018-04-24 14:15 |
Great point. Looks like the phrase is "in arbitrary order" in the docs for
those (both 2.7 and 3), which is better than saying nothing. I'd still
prefer a bit more specificity about the potential gotcha since "arbitrary"
seems a lot less deterministic than "some file systems will give you sorted
order, some won't".
On Tue, Apr 24, 2018 at 9:41 AM, Serhiy Storchaka <report@bugs.python.org>
wrote:
>
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
>
> Are there such notes in the descriptions of os.listdir(), os.scandir(),
> os.walk(), os.fwalk() and corresponding Path methods? If explicitly
> document the sorting, this should be made for all files enumerating
> functions.
>
> ----------
> nosy: +serhiy.storchaka
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue33275>
> _______________________________________
>
|
msg315710 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2018-04-24 17:35 |
I agree that anything that has the same FS-determined sorted or not behavior should get the same note, for the same reason. Ben, can you test? Eryk, can you enlighten us further?
PS: Ben, when responding by email, please delete the quote, as it is duplicate noise on the web page.
|
msg315748 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2018-04-25 15:47 |
As I said, some file systems such as NTFS and ISO 9660 (or Joliet) store directories in lexicographically sorted order. NTFS does this using a b-tree and case-insensitive comparison, which helps the driver efficiently implement filtering a directory listing using a pattern such as "spam*eggs?.txt". (Filtering of a directory listing at the syscall level is peculiar to Windows and not supported by Python.)
I like the phrase "arbitrary order". I don't think it's wise for an application to ever depend on the order. Also, we usually want natural-language collation for display purposes (e.g. spam2.txt should come before spam10.txt), so we have to sort the result regardless of the file system.
|
msg316071 - (view) |
Author: Ben FrantzDale (Ben FrantzDale) |
Date: 2018-05-02 14:03 |
I looked into it a bit more. With python 2.7 on macOS High Sierra on APFS (Encrypted) with a FAT32 thumb drive... I have a directory that glob.glob('/Volumes/thumb/tmp/*') shows as sorted. I cp -r that to /tmp with bash. glob.glob('/tmp/tmp/*') is now not sorted. and cp -r /tmp/tmp /Volumes/thumb/tmp1. Then glob.glob('/Volumes/thumb/tmp/*') shows a different order, but if I cp -r /Volumes/thumb/tmp/ /Volumes/thumb/tmp2 then glob.glob('/Volumes/thumb/tmp2/*') is sorted by file name just like glob.glob('/Volumes/thumb/tmp/*'). I'm not sue what that's saying other than that glob.glob can return things out of order on FAT32. It appears that glob.glob's ordering agrees with that of ls -f ("unsorted").
|
msg316084 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2018-05-02 18:10 |
FAT inserts a new file entry in a directory at the first available position. (If it's a long filename, this could be up to 21 contiguous dirents for a combined long/short dirent set.) This means a directory listing is usually in the same order that files were added. One caveat is that dirents for deleted files may be reused once there are no more unused entries available in a cluster. (I'd expect this depends on the implementation. Also, this is less likely with a long filename, since it needs a large-enough contiguous block of dirents.) Given a volume with a 4 KiB cluster size, sans overhead there are 127 32-byte dirents in a cluster.
I used to have an MP3 player that used FAT32 and only played files in directory order, so I had to resort directories on disk after adding files. In Ubuntu Linux, I see there's a "fatsort" package that implements this. There's probably a build available for MacOS.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:59 | admin | set | github: 77456 |
2018-11-04 14:51:19 | mdk | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2018-05-02 18:10:08 | eryksun | set | messages:
+ msg316084 |
2018-05-02 14:03:26 | Ben FrantzDale | set | messages:
+ msg316071 |
2018-04-25 15:47:39 | eryksun | set | messages:
+ msg315748 |
2018-04-24 17:35:31 | terry.reedy | set | messages:
+ msg315710 |
2018-04-24 14:15:41 | Ben FrantzDale | set | messages:
+ msg315702 |
2018-04-24 13:41:32 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg315701
|
2018-04-24 13:34:46 | Elena.Oat | set | keywords:
+ patch stage: needs patch -> patch review pull_requests:
+ pull_request6287 |
2018-04-21 00:36:55 | terry.reedy | set | nosy:
+ terry.reedy messages:
+ msg315545
|
2018-04-14 00:09:48 | Ben FrantzDale | set | messages:
+ msg315275 |
2018-04-13 23:55:52 | eryksun | set | nosy:
+ eryksun messages:
+ msg315273
|
2018-04-13 19:35:33 | rhettinger | set | nosy:
+ cheryl.sabella
|
2018-04-13 19:34:47 | rhettinger | set | assignee: docs@python components:
+ Documentation, - Library (Lib) versions:
- Python 3.4, Python 3.5 keywords:
+ easy nosy:
+ rhettinger, docs@python
messages:
+ msg315259 stage: needs patch |
2018-04-13 18:38:46 | Ben FrantzDale | create | |