classification
Title: Make os.walk and os.fwalk yield namedtuple instead of tuple
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: ethan.furman, giampaolo.rodola, loewis, palaviv, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2016-04-26 13:45 by palaviv, last changed 2017-04-11 02:02 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
os-walk-result-namedtuple.patch palaviv, 2016-04-26 13:45 review
Messages (13)
msg264285 - (view) Author: Aviv Palivoda (palaviv) * Date: 2016-04-26 13:45
I am suggesting that os.walk and os.fwalk will yield a namedtuple instead of the regular tuple they currently yield.
The use case for this change can be seen in the next example:

def walk_wrapper(walk_it):
    for dir_entry in walk_it:
        if dir_entry[0] == "aaa":
           yield dir_entry

Because walk_it can be either os.walk or os.fwalk I need to access dir_entry via index.

My change will allow me to change this function to:

def walk_wrapper(walk_it):
    for dir_entry in walk_it:
        if dir_entry.dirpath == "aaa":
           yield dir_entry

Witch is more clear and readable.
msg264288 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2016-04-26 13:58
Quick review of patch looks good.  I'll try to look it over more closely later.
msg264418 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-04-28 06:42
Classes are normally named with CamelCase.  Also, "walk_result" or "WalkResult" seems like an odd name that doesn't really fit.   DirEntry or DirInfo is a better match (see the OP's example, "for dir_entry in walk_it: ...")

The "versionchanged" should be a "versionadded".

The docs should use "named tuple" instead of "namedtuple".  The former is the generic term used in the glossary to describe the instances.  The latter is the factory function that creates a new tuple subclass.

The attribute descriptions for the docs are pretty good.  They should also be applied as actual docstrings in the code as well.

The docs and code for fwalk() needs to be harmonized with walk() so the the tuple fields use the same names:  change (root, dirs, files) to (dirpath, dirnames, filenames).
msg264421 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-28 08:00
Sorry, but I disagree with Raymond in many points.

> Classes are normally named with CamelCase.  Also, "walk_result" or "WalkResult" seems like an odd name that doesn't really fit.   DirEntry or DirInfo is a better match (see the OP's example, "for dir_entry in walk_it: ...")

See "stat_result", "statvfs_result", "waitid_result", "uname_result", and "times_result". DirEntry is already used in the os module. And if accept this feature, needed separate types for walk() and fwalk() results.

> The "versionchanged" should be a "versionadded".

os.walk() is not new. Just it's result is changed. Class "walk_result" can be tagged with "versionadded", but I'm not sure there is a need to document it separately. The documentation of the os module already too large. "uname_result" and "times_result" are not documented.

> The docs and code for fwalk() needs to be harmonized with walk() so the the tuple fields use the same names:  change (root, dirs, files) to (dirpath, dirnames, filenames).

(root, dirs, files) is shorter than (dirpath, dirnames, filenames) and these names were used with os.walk() and os.fwalk() for years.

I general, I have doubts about this feature.

1. There is little backward incompatibility. At least pickle is not backward compatible, and I guess other serialization methods.

2. os.walk() and os.fwalk() are purposed to be used in for loop with immediate unpacking result tuple:

    for root, dirs, files in os.walk(...):
        ...

Adding named tuple doesn't add any benefit for common case.

In OP case, you can either use fwalk-based implementation of walk (issue15200):

    def fwalk_as_walk(*args, **kwargs):
        for x in os.fwalk(*args, **kwargs):
            yield x[:-1]

or just ignore the rest of tuple items:

    for root, *_ in walk_it:
        ...

3. Using namedtuple is slower and consumes more memory than using tuple. Even for FS-related operation like os.walk() this can matter. A lot of code is optimized for exact tuples, with namedtuple this optimization is lost.

4. New names (dirpath, dirnames, filenames) are questionable. Why not use undersores (dir_names)? "dir" in dirpath refers to the current proceeded directory, but "dir" in dirnames refers to it's subdirectories. Currently you are free to use short names (root, dirs, files) from examples or what you prefer, but with namedtuple you are sticked with standard names forever. There are no names that satisfy everybody.

5. Third-party walk-like iterators generate tuples, so you can't use attribute access in too general code.
msg264478 - (view) Author: Aviv Palivoda (palaviv) * Date: 2016-04-29 08:42
In regard to Raymond`s points I agree with Serhiy`s comments.

As for Serhiy`s doubts:

> 3. Using namedtuple is slower and consumes more memory than using tuple. Even for FS-related operation like os.walk() this can matter. A lot of code is optimized for exact tuples, with namedtuple this optimization is lost.

I did some testing on my own PC:
./python -m timeit -s "from os import walk"  "for x in walk('Lib'): pass"

Regular tuple: 7.53 msec
Named tuple: 7.66 msec

> 4. New names (dirpath, dirnames, filenames) are questionable. Why not use undersores (dir_names)? "dir" in dirpath refers to the current proceeded directory, but "dir" in dirnames refers to it's subdirectories. Currently you are free to use short names (root, dirs, files) from examples or what you prefer, but with namedtuple you are sticked with standard names forever. There are no names that satisfy everybody.

I agree that there will be no names that will satisfy everybody but I think the names that are currently in the documentation are the most trivial choice.

As for points 1,2,5 this feature doesn`t break any of the old walk API.

One more point I would like input on is the testing. I can remove the walk method from the WalkTests, FwalkTests classes and use the new named tuple attributes in the tests. Do you think its better or should we keep the tests with the old API (access using indexes)?
msg264503 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2016-04-29 14:45
I'm not clear on what you asking, but regardless we should have both the old (by-index) tests and new by-attribute tests.
msg264506 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-04-29 16:09
https://www.python.org/dev/peps/pep-0008/#class-names -- "Class names should normally use the CapWords convention."

Examples:
---------
crypt.py
6:from collections import namedtuple as _namedtuple
13:class _Method(_namedtuple('_Method', 'name ident salt_chars total_size')):

difflib.py
34:from collections import namedtuple as _namedtuple
36:Match = _namedtuple('Match', 'a b size')

dis.py
163:_Instruction = collections.namedtuple("_Instruction",
280:    Generates a sequence of Instruction namedtuples giving the details of each

doctest.py
107:from collections import namedtuple
109:TestResults = namedtuple('TestResults', 'failed attempted')

functools.py
21:from collections import namedtuple
345:_CacheInfo = namedtuple("CacheInfo", ["hits", "misses", "maxsize", "currsize"])

inspect.py
51:from collections import namedtuple, OrderedDict
323:Attribute = namedtuple('Attribute', 'name kind defining_class object')
968:Arguments = namedtuple('Arguments', 'args, varargs, varkw')
1008:ArgSpec = namedtuple('ArgSpec', 'args varargs keywords defaults')
1032:FullArgSpec = namedtuple('FullArgSpec',
1124:ArgInfo = namedtuple('ArgInfo', 'args varargs keywords locals')
1317:ClosureVars = namedtuple('ClosureVars', 'nonlocals globals builtins unbound')
1372:Traceback = namedtuple('Traceback', 'filename lineno function code_context index')
1412:FrameInfo = namedtuple('FrameInfo', ('frame',) + Traceback._fields)

nntplib.py
159:GroupInfo = collections.namedtuple('GroupInfo',
162:ArticleInfo = collections.namedtuple('ArticleInfo',

No doubt, there are exceptions to the rule in the standard library which is less consistent than we might like:  "stat_result".  That said, stat_result is a structseq and many C type names are old or violate the rules (list vs List, etc).   New named tuples should follow PEP 8 can use CapWords convention unless there is a strong reason not to in a particular case.
msg264507 - (view) Author: Aviv Palivoda (palaviv) * Date: 2016-04-29 16:26
Thanks for the response Ethan I think that I will leave the tests as they are in the current patch.

> No doubt, there are exceptions to the rule in the standard library which is less consistent than we might like:  "stat_result".  That said, stat_result is a structseq and many C type names are old or violate the rules (list vs List, etc).   New named tuples should follow PEP 8 can use CapWords convention unless there is a strong reason not to in a particular case.

I actually thought we should keep on consistency with other "result" like objects. I can see your point about new named tuples that should follow PEP 8 and DirEntry is an example of new "result" class that follow PEP8.
What names do you suggest? Maybe DirInfo and FDirInfo?
msg291346 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-04-08 21:40
Should we have concerns about performances? Accessing a namedtuple value is almost 4x times slower compared to a plain tuple [1] and os.walk() may iterate hundreds of times.

http://stackoverflow.com/questions/2646157/what-is-the-fastest-to-access-struct-like-object-in-python
msg291352 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-04-09 06:08
I would expect that the field access time is inconsequential compared to just about every other aspect of os.walk().
msg291354 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-09 07:28
namedtuple's attribute access was optimized in recent years. In 3.7 it is 30% faster than in 3.4. So now it is only 3x times slower compared to a plain tuple. On other hand, os.walk() and os.fwalk() was optimized too. In 3.7 they are up to 3.5x times faster than in 3.4 (with hot caches). I didn't make measurements, but I expect that using namedtuples with os.walk() can decrease its performance at least by few percents.

My main concern is that this feature will increase the complexity of the documentation of the os module (very little) and may encourage writing less clear code (but this is just my own preference, others can found new style more clear).
msg291355 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-09 07:29
s/at least/at most/
msg291402 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-04-10 00:55
There doesn't seem to be a consensus that the proposal is a net win.  Serhiy made a persuasive argument that the added complexity isn't worth it.

I'll leave this open for a day or two so that anyone else can make their case.  Otherwise, I'll mark this as closed/rejected.
History
Date User Action Args
2017-04-11 02:02:11rhettingersetstatus: open -> closed
resolution: rejected
stage: resolved
2017-04-10 00:56:10rhettingersetmessages: - msg291388
2017-04-10 00:55:45rhettingersetmessages: + msg291402
2017-04-09 19:03:04rhettingersetassignee: rhettinger
messages: + msg291388
2017-04-09 07:29:02serhiy.storchakasetmessages: + msg291355
2017-04-09 07:28:29serhiy.storchakasetmessages: + msg291354
2017-04-09 06:08:56rhettingersetmessages: + msg291352
2017-04-08 21:40:52giampaolo.rodolasetnosy: + giampaolo.rodola
messages: + msg291346
2016-05-04 13:09:28ppperrysettitle: os.walk and os.fwalk yield namedtuple instead of tuple -> Make os.walk and os.fwalk yield namedtuple instead of tuple
2016-04-29 16:26:58palavivsetmessages: + msg264507
2016-04-29 16:09:11rhettingersetmessages: + msg264506
2016-04-29 14:45:22ethan.furmansetmessages: + msg264503
2016-04-29 08:42:25palavivsetmessages: + msg264478
2016-04-28 08:00:48serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg264421
2016-04-28 06:42:47rhettingersetnosy: + rhettinger
messages: + msg264418
2016-04-26 13:58:00ethan.furmansetmessages: + msg264288
2016-04-26 13:53:40ethan.furmansetnosy: + ethan.furman
2016-04-26 13:45:06palavivcreate