Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is no os.listdir() equivalent returning generator instead of list #55615

Closed
socketpair mannequin opened this issue Mar 5, 2011 · 77 comments
Closed

There is no os.listdir() equivalent returning generator instead of list #55615

socketpair mannequin opened this issue Mar 5, 2011 · 77 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@socketpair
Copy link
Mannequin

socketpair mannequin commented Mar 5, 2011

BPO 11406
Nosy @loewis, @Yhg1s, @rhettinger, @terryjreedy, @gpshead, @ncoghlan, @pitrou, @vstinner, @giampaolo, @tiran, @tjguk, @merwok, @Trundle, @benhoyt, @Bluehorn, @ethanfurman, @socketpair, @MojoVampire
Superseder
  • bpo-22524: PEP 471 implementation: os.scandir() directory scanning function
  • Files
  • issue11406-gps01.diff: Implements os.scandir(path='.') that iterates
  • issue11406-gps02.diff: implements os.scandir(path='.') that iterates
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2014-09-30.14:47:15.649>
    created_at = <Date 2011-03-05.10:17:58.318>
    labels = ['type-feature', 'library']
    title = 'There is no os.listdir() equivalent returning generator instead of list'
    updated_at = <Date 2014-09-30.14:47:15.647>
    user = 'https://github.com/socketpair'

    bugs.python.org fields:

    activity = <Date 2014-09-30.14:47:15.647>
    actor = 'ncoghlan'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = <Date 2014-09-30.14:47:15.649>
    closer = 'ncoghlan'
    components = ['Library (Lib)']
    creation = <Date 2011-03-05.10:17:58.318>
    creator = 'socketpair'
    dependencies = []
    files = ['29568', '29638']
    hgrepos = []
    issue_num = 11406
    keywords = ['patch']
    message_count = 77.0
    messages = ['130111', '130112', '130114', '130125', '130129', '130167', '130256', '130286', '130391', '130392', '130414', '130415', '130426', '130428', '130440', '130467', '130485', '130530', '130545', '130552', '130554', '130713', '130714', '130715', '155407', '183610', '183611', '183612', '183617', '183618', '183620', '185167', '185168', '185169', '185174', '185180', '185673', '185674', '185676', '185679', '185723', '188254', '188256', '188260', '188293', '188298', '188403', '188425', '188431', '188432', '188433', '188434', '188443', '188468', '188478', '188492', '188493', '188495', '188500', '188566', '188896', '200757', '200758', '200800', '204083', '221618', '221634', '221645', '221647', '221648', '221659', '221661', '227879', '227882', '227929', '227934', '227936']
    nosy_count = 22.0
    nosy_names = ['loewis', 'twouters', 'rhettinger', 'terry.reedy', 'gregory.p.smith', 'ncoghlan', 'pitrou', 'vstinner', 'giampaolo.rodola', 'christian.heimes', 'tim.golden', 'eric.araujo', 'Trundle', 'benhoyt', 'torsten', 'nvetoshkin', 'neologix', 'abacabadabacaba', 'ethan.furman', 'nailor', 'socketpair', 'josh.r']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = '22524'
    type = 'enhancement'
    url = 'https://bugs.python.org/issue11406'
    versions = ['Python 3.5']

    @socketpair
    Copy link
    Mannequin Author

    socketpair mannequin commented Mar 5, 2011

    Big dirs are really slow to read at once. If user wants to read items one by one like here:
    --------------
    for i in os.listdir()
    use(i)
    --------------
    having generator will gain performance, as big directories often very fragmented on disk. Also, dir_cache in kernel used more effectively.

    @socketpair socketpair mannequin added performance Performance or resource usage stdlib Python modules in the Lib dir labels Mar 5, 2011
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 5, 2011

    Big dirs are really slow to read at once.

    Do you a proof for that claim? How big, and how really slow?

    for i in os.listdir()
    use(i)

    Also, how long does use(i) take, and what reduction (in percent)
    can you gain from listdir iterating?

    In short, I'm skeptical that there is an actual problem to be solved here.

    @socketpair
    Copy link
    Mannequin Author

    socketpair mannequin commented Mar 5, 2011

    also, forgot... memory usage on big directories using list is a pain.

    This is the same things as range() and xrange(). Why not to add os.xlistdir() ?

    P.S.
    Numerical answers will be available later.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 5, 2011

    A generator listdir() geared towards performance should probably be able to work in batches, e.g. read 100 entries at once and buffer them in some internal storage (that might mean use readdir_r()). Bonus points if it doesn't release the GIL around each individual entry, but also batches that.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Mar 5, 2011

    Big dirs are really slow to read at once. If user wants to read items one by one like here

    The problem is that readdir doesn't read a directory entry one at a time.
    When you call readdir on an open DIR * for the first time, the libc calls the getdents syscall, requesting a whole bunch of dentry at a time (32768 on my box).
    Then, the subsequent readdir calls are virtually free, and don't involve any syscall/IO at all (that is, until you hit the last cached dent, and then another getdents is performed until end of directory).

    Also, dir_cache in kernel used more effectively.

    You mean the dcache ? Could you elaborate ?

    also, forgot... memory usage on big directories using list is a pain.

    This would indeed be a good reason. Do you have numbers ?

    A generator listdir() geared towards performance should probably be able to work in batches, e.g. read 100 entries at once and buffer them in some internal storage (that might mean use readdir_r()).

    That's exactly what readdir is doing :-)

    Bonus points if it doesn't release the GIL around each individual entry, but also batches that.

    Yes, since only one in 2**15 readdir call actually blocks, that could be a nice optimization (I've no idea of the potential gain though).

    Big dirs are really slow to read at once.

    Are you using EXT3 ?
    There are records of performance issues with getdents on EXT2/3 filesystems, see:
    http://lwn.net/Articles/216948/
    and this nice post by Linus:
    https://lkml.org/lkml/2007/1/7/149

    Could you provide the output of an "strace -ttT python <test script>" (and also the time spent in os.listdir) ?

    @nvetoshkin
    Copy link
    Mannequin

    nvetoshkin mannequin commented Mar 6, 2011

    Generator listdir() could be useful if I have a directory with several millions of files and I what to process just a hundred.

    @nvetoshkin
    Copy link
    Mannequin

    nvetoshkin mannequin commented Mar 7, 2011

    Glibc's readdir() and readdir_r() already do caching, so getdents() syscall is called only once on my '/etc' directory. Should we include another caching level in xlistdir() function?
    On the other hand, we don't know anything about caches at glibc's level, i.e. we can't tell if our next call to readdir() will result in syscall or even I/O (we could possible release GIL for that).

    @socketpair
    Copy link
    Mannequin Author

    socketpair mannequin commented Mar 7, 2011

    Glibc's readdir() and readdir_r() already do caching
    Yes, but glibc's readdir is the C analogue of python's generator. We do not need to create cache for cached values.
    I think it's OK to make python's generator on top of readdir (instead of getdents).

    Why not to create generator like this?
    (pseudocode)
    ------------------

    DIR *d;
    struct dirent* entry, *e;
    entry = malloc(offsetof(struct dirent, d_name) + pathconf(dirpath, _PC_NAME_MAX) + 1);
    if (!e)
        raise Exception();
    if (!(d= opendir(dirname)))
    {
        free(e)
        raise IOException();
    }

    for (;;)
    {
    if (readdir_r(d, entry, &e))
    {
    closedir(d);
    free(entry);
    raise IOException();
    }
    if (!e)
    break;
    yield e;
    }
    closedir(d);
    free(entry);

    ------------------

    @terryjreedy
    Copy link
    Member

    There has been discussion of this before, but it must have been on one of the lists, (possibly py3k list) as searching tracker for 'listdir generator' only returns this.

    I believe I pointed out then that Miscrosoft C (also) has (did once) a 'nextdir' function. It's been so long that I forget details.

    I thought then and still do that listdir should (have) change (d) like range, map, and filter did, for same reasons.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 9, 2011

    I thought then and still do that listdir should (have) change (d) like range, map, and filter did, for same reasons.

    The reasons that applied to map and range don't apply to listdir(). The
    cost of materializing the list in the former cases may be significant,
    but will be marginal in the latter case.

    @nvetoshkin
    Copy link
    Mannequin

    nvetoshkin mannequin commented Mar 9, 2011

    Can't we simply add os.xlistdir() leaving listdir() as is?

    @briancurtin
    Copy link
    Member

    -1 on going back through blah/xblah all over again.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 9, 2011

    Can't we simply add os.xlistdir() leaving listdir() as is?

    Only if an advantage can be demonstrated. in a realistic application.

    @socketpair
    Copy link
    Mannequin Author

    socketpair mannequin commented Mar 9, 2011

    -1 on going back through blah/xblah all over again.

    Originally, I want return value of listdir to be changed from list to generator. But next, I thought about compatibility. It will break some code. For example, SimpleHTTPServer:

    list = os.listdir(path)
    list.sort(key=lambda a: a.lower())

    will not work.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 9, 2011

    Can't we simply add os.xlistdir() leaving listdir() as is?

    We could, but someone must:

    1. provide a patch
    2. demonstrate a significant improvement in some real-world situation

    @nvetoshkin
    Copy link
    Mannequin

    nvetoshkin mannequin commented Mar 9, 2011

    We could, but someone must:

    1. provide a patch
      While working on a straightforward patch for linux, I had to make a lot of copy-paste job. posixmodule.c is quite a mess already :(
    2. demonstrate a significant improvement in some real-world situation
      suppose we have a directory with several millions of files and a cron script which must process just a bunch of them at a time. There's no need to gather them all.
      As mmarkk mentioned - readdir already provides generator style access to the directory contents, it could be nice to provide such API in Python.

    http://pastebin.com/NCGmfF49 - here's a kind of test (cached and uncached)
    http://pastebin.com/tTKRTiNc - here's a testcase for batch processing of directory contenst (first is xlistdir(), second - listdir()) both uncached.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 10, 2011

    a cron script which must process just a bunch of them at a time.
    > There's no need to gather them all.

    Can you please be more explicit? What's the application in which you
    have several millions of files in a directory? What's the task that
    the processing script needs to perform?

    http://pastebin.com/NCGmfF49 - here's a kind of test (cached and uncached)

    This isn't really convincing - the test looks at all files, so it isn't
    clear why xlistdir should do any better than listdir. And indeed, with
    a cold cache, xlistdir is slower (IIUC).

    http://pastebin.com/tTKRTiNc - here's a testcase for batch processing of directory contenst (first is xlistdir(), second - listdir()) both uncached.

    This is not a real-world application - there is no actual processing done.

    BTW, can you publish your xlistdir implementation somewhere?

    @nvetoshkin
    Copy link
    Mannequin

    nvetoshkin mannequin commented Mar 10, 2011

    BTW, can you publish your xlistdir implementation somewhere?
    http://pastebin.com/Qnni5HBa

    Tests show 10 times smaller memory footprint during directory listing - 25Mb against 286Mb on directory with 800K entries.

    @socketpair
    Copy link
    Mannequin Author

    socketpair mannequin commented Mar 11, 2011

    http://lwn.net/Articles/216948/
    Why kernel.org is slow

    To proove that readdir is bad thing on large number of items in a directory.

    Well, EXT4 has fixed some issues (http://ext2.sourceforge.net/2005-ols/paper-html/node3.html) But what about locking in linux kernel (vfs and ext4) code?

    Also, some conservative linuxes still use ext3.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 11, 2011

    http://lwn.net/Articles/216948/
    Why kernel.org is slow

    That's all independent of the issue at hand. Whether or not getdents is
    slow doesn't really matter here because we need to call getdents,
    anyway: it's the only API to get at the directory contents, iterative
    or not.

    The issue at hand is whether xlistdir actually provides any advantages
    to a real application, and that cannot be answered by looking at
    benchmarking results that benchmarked the kernel. The *only* way to
    determine whether xlistdir will help is to measure it in a real
    application.

    I stand by my claim that
    a) in cases where you use listdir, you typically have to consider
    all file names in the directory eventually. The total number
    of getdents calls to be made doesn't change when you switch from
    listdir to xlistdir (except in non-realistic micro-benchmarks).

    The cases that you don't need to look at all file names are
    typically dealt with by knowing the file name in advance
    (or perhaps a few alternative spellings it may have), and
    hence you don't use listdir at all (but stat/open).
    

    b) If there is some real-world processing of the files (e.g.
    at least to look at the file modification time), this processing
    (and the IO that goes along with it) by far outweigh the cost
    of reading the directory. So even if you could speed up listdir
    by making it iterative, the overall gain would be very small.

    There are also good reasons *not* to add xlistdir, primarily to
    avoid user confusion. If xlistdir was added, all peope would run
    off and change all applications of listdir "because it is faster",
    and have then to deal with backwards compatibility, even though
    in most applications, a single getdents call will fetch the entire
    directory contents just fine (and hence there is *no* change
    in xlistdir, except that the list is not created which uses
    a few dozen bytes).

    @nvetoshkin
    Copy link
    Mannequin

    nvetoshkin mannequin commented Mar 11, 2011

    My benchmarks show that xlistdir() gives the only memory usage advantage on large directories. No speed gain so far - maybe my patch is wrong.

    @Bluehorn
    Copy link
    Mannequin

    Bluehorn mannequin commented Mar 12, 2011

    I would regard this as Type: resource usage, instead of performance. Given enough RAM, loading the whole directory at once will likely be faster.

    The downsides of os.listdir:
    a) You can't get a peek at the files so far, it's all or nothing. I only wanted to know if a directory is empty and I have to read the whole thing just to throw it away (maybe I missed another library function?)

    b) Using it in a GUI basically requires you to use threads if you may run into a dir with many files. Especially on a slow filesystem (network). Because you won't regain control until the whole thing is read.

    I would like to have an iterator version as well, but I also dislike another function (especially the "x" prefix). How about adding a keyword argument to select iterator behaviour?

    @pitrou
    Copy link
    Member

    pitrou commented Mar 12, 2011

    I would like to have an iterator version as well, but I also dislike
    another function (especially the "x" prefix). How about adding a
    keyword argument to select iterator behaviour?

    Changing the return type based on an argument is generally frown upon
    so, if anything, I think it would be better to have a separate function
    (whatever its name).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 13, 2011

    The downsides of os.listdir: a) You can't get a peek at the files so
    far, it's all or nothing. I only wanted to know if a directory is
    empty and I have to read the whole thing just to throw it away (maybe
    I missed another library function?)

    This depends somewhat on the operating system. On Unix, doing os.stat
    on the directory, then looking on st_nlink, will tell you whether
    the directory is empty (st_nlink is 2 on an empty directory).

    b) Using it in a GUI basically requires you to use threads if you may
    run into a dir with many files. Especially on a slow filesystem
    (network). Because you won't regain control until the whole thing is
    read.

    Hmm. In a GUI, you would typically want to sort the file names by
    some criterion, which typically requires you to read all files
    (even if you then only display some of them).

    I would like to have an iterator version as well, but I also dislike
    another function (especially the "x" prefix). How about adding a
    keyword argument to select iterator behaviour?

    I still would like to see a demonstrable improvement in a real-world
    application.

    @serhiy-storchaka
    Copy link
    Member

    On Unix, doing os.stat
    on the directory, then looking on st_nlink, will tell you whether
    the directory is empty (st_nlink is 2 on an empty directory).

    Directory with st_nlink==2 can contains any number of non-directory files. And one subdirectory if this directory is root.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 6, 2013

    I'd like to take care of this at Python. At least for posix (someone else can deal with the windows side if they want).

    I just stumbled upon an extension module at work that someone wrote specifically because os.listdir consumed way too much ram by building up a huge list on large directories with tons of long filenames that it needed to process. (when your process is in charge of keeping that directory in check... the inability to process it once it grows too large simply isn't acceptable)

    @gpshead gpshead self-assigned this Mar 6, 2013
    @tjguk
    Copy link
    Member

    tjguk commented Mar 6, 2013

    IIRC Nick Coghlan had put a bit of work into this a few months ago as an
    external module with a view to seeing if it got traction before putting
    anything into the stdlib. Might be worth pinging him, or looking to see
    what he'd done. Can't remember the keywords to search for, I'm afraid.
    Something like "directory walker"

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented May 5, 2013

    I find iterdir_stat() ugly :-) I like the scandir name, which has some precedent with POSIX.

    Fair enough. I'm cool with scandir().

    scandir() cannot return (name, stat), because on POSIX, readdir() only returns d_name and d_type (the type of the entry): to return a stat, we would have to call stat() on each entry, which would defeat the performance gain.

    Yes, you're right. I "solved" this in BetterWalk with the solution you propose of returning a stat_result object with the fields it could get "for free" set, and the others set to None.

    So on Linux, you'd get a stat_result with only st_mode set (or None for DT_UNKNOWN), and all the other fields None. However -- st_mode is the one you're most likely to use, usually looking just for whether it's a file or directory. So calling code would look something like this:

    files = []
    dirs = []
    for name, st in scandir(path):
        if st.st_mode is None:
            st = os.stat(os.path.join(path, name))
        if stat.S_ISDIR(st.st_mode):
            dirs.append(name)
        else:
            files.append(name)

    Meaning you'd get the speed improvements 99% of the time (when st_mode) was set, but if st_mode is None, you can call stat and handle errors and whatnot yourself.

    That's why scandir would be a rather low-level call, whose main user would be walkdir, which only needs to know the entry time and not the whole stat result.

    Agreed. This is in the OS module after all, and there's tons of stuff that's OS-dependent in there. However, I think that doing something like the above, we can make it usable and performant on both Linux and Windows for use cases like walking directory trees.

    Also, I don't know which information is returned by the readdir equivalent on Windows, but if we want a consistent API, we have to somehow map d_type and Windows's returned type to a common type, like DT_FILE, DT_DIRECTORY, etc (which could be an enum).

    The Windows scan directory functions (FindFirstFile/FindNextFile) return a *full* stat (or at least, as much info as you get from a stat in Windows). We *could* map them to a common type -- but I'm suggesting that common type might as well be "stat_result with None meaning not present". That way users don't have to learn a completely new type.

    The other approach would be to return a dummy stat object with only st_mode set, but that would be kind of a hack to return a dummy stat result with only part of the attributes set (some people will get bitten by this).

    We could document any platform-specific stuff, and places you'd users could get bitten. But can you give me an example of where the stat_result-with-st_mode-or-None approach falls over completely?

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented May 5, 2013

    I think os.scandir is a case where we *want* a low level call that exposes everything we can retrieve efficiently about the directory entries given the underlying platform - not everything written in Python is written to be portable, especially when it comes to scripts rather than applications (e.g. given where I work, I write a fair bit of code that is Fedora/RHEL specific, and if that code happens to work anywhere else it's just a bonus rather than being of any specific value to me).

    This may mean that we just return an "info" object for each item, where the available info is explicitly platform specific. Agreed it can be an actual stat object, though.

    os.walk then become the cross-platform abstraction built on top of the low level scandir call (splitting files from directories is probably about all we can do consistently cross-platform without per-entry stat calls).

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented May 5, 2013

    We could document any platform-specific stuff, and places you'd users could get bitten. But can you give me an example of where the stat_result-with-st_mode-or-None approach falls over completely?

    Well, that's easy:

    size = 0
    for name, st in scandir(path):
        if stat.S_ISREG(st.st_mode):
            size += st.st_size

    Agreed it can be an actual stat object, though.

    Well, the nice thing is that we don't have to create yet another info
    object, the downside is that it can be tricky, see above.

    We can probably use the DTTOIF macro to convert d_type to st_mode.

    @vstinner
    Copy link
    Member

    vstinner commented May 5, 2013

    I really like scandir() -> (name: str, stat: stat structure using None for
    unknown fields).

    I expect that this API to optimize use cases like:

    • glob.glob("*.jpg") in a big directory with few JPEG picture
    • os.walk(".") in a directory with many files: should reduce the number of
      stat() to zero on most platforms

    But as usual, a benchmark on a real platform would be more convicing.

    Filtering entries in os.listdir() or os.scandir() would be faster (than
    filtering their output), but it hard to design an API to filter arbitrary
    fields (name, file type, size, ...) especially because the underlying C
    functions does not provide required information. A generator is closer to
    Python design and more natural.

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented May 5, 2013

    Yeah, I very much agree with what Nick says -- we really want a way to expose what the platform provides. It's less important (though still the ideal), to expose that in a platform-independent way. Today the only way to get access to opendir/readdir on Linux and FindFirst/Next on Windows is by using a bunch of messy (and slowish) ctypes code. And yes, os.walk() would be the main cross-platform abstraction built on top of this.

    Charles gave this example of code that would fall over:

    size = 0
    for name, st in scandir(path):
        if stat.S_ISREG(st.st_mode):
            size += st.st_size

    I don't see it, though. In this case you need both .st_mode and .st_size, so a caller would check that those are not None, like so:

    size = 0
    for name, st in scandir(path):
        if st.st_mode is None or st.st_size is None:
            st = os.stat(os.path.join(path, name))
        if stat.S_ISREG(st.st_mode):
            size += st.st_size

    One line of extra code for the caller, but a big performance gain in most cases.

    Stinner said, "But as usual, a benchmark on a real platform would be more convicing". Here's a start: https://github.com/benhoyt/betterwalk#benchmarks -- but it's not nearly as good as it gets yet, because those figures are still using the ctypes version. I've got a C version that's half-finished, and on Windows it makes os.walk() literally 10x the speed of the default version. Not sure about Linux/opendir/readdir yet, but I intend to do that too.

    @vstinner
    Copy link
    Member

    vstinner commented May 5, 2013

    size = 0
    for name, st in scandir(path):
        if st.st_mode is None or st.st_size is None:
            st = os.stat(os.path.join(path, name))
        if stat.S_ISREG(st.st_mode):
            size += st.st_size

    It would be safer to use dir_fd parameter when supported, but I don't
    think that os.scandir() should care of this problem. An higher level
    API like pathlib, walkdir & cie which should be able to reuse *at() C
    functions using dir_fd parameter.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented May 6, 2013

    Charles gave this example of code that would fall over:

    size = 0
    for name, st in scandir(path):
    if stat.S_ISREG(st.st_mode):
    size += st.st_size

    I don't see it, though. In this case you need both .st_mode and .st_size, so a caller would check that those are not None, like so:

    Well, that's precisely the point.
    A normal "caller" would never expect a stat object to be partially
    populated: if a function has a prototype returning a stat object, then
    I definitely expect it to be a regular stat object, with all the
    fields guaranteed by POSIX set (st_size, st_ino, st_dev...). By
    returning a dummy stat object, you break the stat interface, and I'm
    positive this *will* puzzle users and introduce errors.

    Now, if I'm the only one who finds this trick dangerous and ugly, you
    can go ahead, but I stand by my claim that it's definitely a bad idea
    (between this and the explicit Enum value assignent, I feel somewhat
    lost lately :-)

    @pitrou
    Copy link
    Member

    pitrou commented May 6, 2013

    (between this and the explicit Enum value assignent, I feel somewhat
    lost lately :-)

    Don't worry, it sometimes happens :-)

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented May 6, 2013

    A normal "caller" would never expect a stat object to be partially populated: if a function has a prototype returning a stat object, then I definitely expect it to be a regular stat object, with all the fields guaranteed by POSIX set (st_size, st_ino, st_dev...).

    I don't think that's true in general, or true of how other Python APIs work. For instance, many APIs return a "file-like object", and you can only do certain things on that object, depending on what the documentation says, or what EAFP gets you. Some file-like object don't support seek/tell, some don't support close, etc. I've seen plenty of walk-like-a-duck checks like this:

    if hasattr(f, 'close'):
        f.close()

    Anyway, my point boils down to:

    • scandir() is a new function, so there aren't old trends or things that will break
    • we clearly document it as returning a tuple of (name, st), where st is a "stat-like object" whose invididual fields are None if they couldn't be determined for free with the directory scanning
    • in fact, that's kind of the point of the "st" object in this function, so the example could be the one I gave above where you call os.stat() if either of the fields you want is None
    • if that's clear in the documentation (of this new function) and the first example shows you exactly how it's meant to be used, I think that's pretty sane and sensible...

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented May 6, 2013

    I don't think that's true in general, or true of how other Python APIs work. For instance, many APIs return a "file-like object", and you can only do certain things on that object, depending on what the documentation says, or what EAFP gets you. Some file-like object don't support seek/tell, some don't support close, etc. I've seen plenty of walk-like-a-duck checks like this:

    Yes, I'm fully aware duck-typing ;-)
    But here, you're saying that "a duck has a beak, but it *may* have
    legs, a tail, etc".
    It's just looks wrong to me on so many levels.

    Please bring this up on python-dev.

    @gpshead
    Copy link
    Member

    gpshead commented May 6, 2013

    Actually I'm thinking this duck may only have a beak. Instead of a bunch of
    fields of None I'd prefer just not having that attribute defined on the
    object. I consider the os specific "stat-like" info from reading a
    directory to be so os specific that i'd rather not let someone be confused
    by it if it were to be returned up to a higher level caller. It's not a
    stat.
    On May 6, 2013 2:36 AM, "Charles-François Natali" <report@bugs.python.org>
    wrote:

    Charles-François Natali added the comment:

    > I don't think that's true in general, or true of how other Python APIs
    work. For instance, many APIs return a "file-like object", and you can only
    do certain things on that object, depending on what the documentation says,
    or what EAFP gets you. Some file-like object don't support seek/tell, some
    don't support close, etc. I've seen plenty of walk-like-a-duck checks like
    this:

    Yes, I'm fully aware duck-typing ;-)
    But here, you're saying that "a duck has a beak, but it *may* have
    legs, a tail, etc".
    It's just looks wrong to me on so many levels.

    Please bring this up on python-dev.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue11406\>


    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented May 11, 2013

    Please bring this up on python-dev.

    Good idea. Thread started: http://mail.python.org/pipermail/python-dev/2013-May/126119.html

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Oct 21, 2013

    Gregory, did you make any progress on this?
    I think it would be a nice addition.

    @tiran
    Copy link
    Member

    tiran commented Oct 21, 2013

    Indeed! I'd like to see the feature in 3.4 so I can remove my own hack from our code base.

    @gpshead
    Copy link
    Member

    gpshead commented Oct 21, 2013

    I haven't had a chance to look at this since May. It'd still be a great addition.

    @gpshead
    Copy link
    Member

    gpshead commented Nov 23, 2013

    For reference the current state of things for this is the proposal in:
    https://mail.python.org/pipermail/python-dev/2013-May/126196.html

    With a prototype using a ctypes based implementation as proof of concept in https://github.com/benhoyt/scandir.

    A combination of that interface plus my existing scandir patch (-gps02) could be created for the final implementation.

    As 3.4beta1 happens tonight, this isn't going to make 3.4 so i'm bumping this to 3.5. I really like the proposed design outlined above.

    @rhettinger
    Copy link
    Contributor

    I'm with Martin and the other respondents who think this shouldn't be done.

    Without compelling timings, the smacks of feature creep. The platform specific issues may create an on-going maintenance problem. The feature itself is prone to misuse, leaving hard-to-find race condition bugs in its wake.

    @vstinner
    Copy link
    Member

    Maybethe development should start outside Python stdlib, on a project on
    PyPI.

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Jun 26, 2014

    Raymond, there are very compelling timings/benchmarks for this -- not so much the original issue here (generator vs list, that's not really an issue) but having a scandir() function that returns the stat-like info from the OS so you don't need extra stat calls. This speeds up os.walk() by 7-20 times on Windows and 4-5 times on Linux. See more at: https://github.com/benhoyt/scandir#benchmarks

    I've written a draft PEP that I've sent to the PEP editors (if you're interested, it's at https://github.com/benhoyt/scandir/blob/master/PEP.txt). If any of the PEP editors are listening here ... would love some feedback on that at some stage. :-)

    Victor -- development has started outside the stdlib here: https://github.com/benhoyt/scandir and PyPI module here: https://pypi.python.org/pypi/scandir Both are being used by various people.

    @vstinner
    Copy link
    Member

    "I've written a draft PEP that I've sent to the PEP editors (if you're interested, it's at https://github.com/benhoyt/scandir/blob/master/PEP.txt). If any of the PEP editors are listening here ... would love some feedback on that at some stage. :-)"

    Oh you wrote a PEP? Great! I pushed it to the PEP repository. It should be online in a few hours:
    http://legacy.python.org/dev/peps/pep-0471/

    PEP editors are still useful if you want to get access directly the Mercurial repository to modify directly your PEP.

    If you have a PEP, it's time to send it to the python-dev mailing list. Don't attach it to your mail, but copy PEP in the body of your email for easier inline comments in replies.

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Jun 26, 2014

    Thanks! Will post the PEP to python-dev in the next day or two.

    @ncoghlan
    Copy link
    Contributor

    I suggest a pass through python-ideas first. python-ideas feedback tends to
    be more oriented towards "is this proposal as persuasive as it could be?",
    while python-dev is more aimed at the "is this a good idea or not?" yes/no
    question. (python-ideas feedback naturally includes some of the latter as
    well, but there's a lot more "I'm not sure I agree with the idea itself,
    but I agree it's worth discussing further" feedback than is common on
    python-dev)

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Jun 26, 2014

    Nick -- sorry, already posted to python-dev before seeing your latest. However, I think it's the right place, as there's already been a fair bit of hashing this idea and API out on python-ideas first and then also python-dev. See links in the PEP here: http://legacy.python.org/dev/peps/pep-0471/#previous-discussion

    @pitrou
    Copy link
    Member

    pitrou commented Sep 30, 2014

    I haven't really followed, but now that the PEP is accepted, what is the progress on this one?

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Sep 30, 2014

    Yes, PEP-471 has been accepted, and I've got a mostly-finished C implementation of os.scandir() for CPython 3.5, as well as tests and docs. If you want a sneak preview, see posixmodule_scandir*.c, test/test_scandir.py, and os.rst here: https://github.com/benhoyt/scandir

    It's working well on Windows, but the Linux version has a couple of tiny issues yet (core dumps ;-).

    Given that os.scandir() will solve this issue (as well as the bigger performance problem due to listdir throwing away file type info), can we close this issue and open another one to track the implementation of os.scandir() / PEP-471?

    @pitrou
    Copy link
    Member

    pitrou commented Sep 30, 2014

    Given that os.scandir() will solve this issue (as well as the bigger performance problem due to listdir throwing away file type info), can we close this issue and open another one to track the implementation of os.scandir() / PEP-471?

    This makes sense. Can you do it?

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Sep 30, 2014

    Okay, I've opened http://bugs.python.org/issue22524, but I don't have the permissions to close this one, so could someone with bugs.python.org superpowers please do that?

    @ncoghlan
    Copy link
    Contributor

    This approach has been rejected in favour of the accepted PEP-471 proposal to add os.scandir() (issue bpo-22524)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests