classification
Title: pathlib.Path.parents negative indexing is wrong for absolute paths
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.11, Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: josh.r, p-ganssle
Priority: normal Keywords:

Created on 2021-10-08 18:11 by josh.r, last changed 2021-10-08 21:19 by josh.r.

Messages (4)
msg403488 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2021-10-08 18:11
At least on PosixPath (not currently able to try on Windows to check WindowsPath, but from a quick code check I think it'll behave the same way), the negative indexing added in #21041 is implemented incorrectly for absolute paths. Passing either -1 or -2 will return a path representing the root, '/' for PosixPath (which should only be returned for -1), and passing an index of -3 or beyond returns the value expected for that index + 1, e.g. -3 gets the result expected for -2, -4 gets the result for -3, etc. And for the negative index that should be equivalent to index 0, you end up with an IndexError.

The underlying problem appears to be that absolute paths (at least, those created from a string) are represented in self._parts with the root '/' included (redundantly, since self._root has it too), so all the actual components of the path are offset by one.

This does not affect slicing (slicing is implemented using range and slice.indices to perform normalization from negative to positive indices, so it never indexes with a negative index).

Example:

>>> from pathlib import Path
>>> p = Path('/1/2/3')
>>> p._parts
['/', '1', '2', '3']
>>> p.parents[:]
(PosixPath('/1/2'), PosixPath('/1'), PosixPath('/'))
>>> p.parents[-1]
PosixPath('/')
>>> p.parents[-1]._parts  # Still behaves normally as self._root is still '/'
[]
>>> p.parents[-2]
PosixPath('/')
>>> p.parents[-2]._parts
['/']
>>> p.parents[-3]
PosixPath('/1')
>>> p.parents[-4]
Traceback (most recent call last):
    ...
IndexError: -4

It looks like the underlying problem is that the negative indexing code doesn't account for the possibility of '/' being in _parts and behaving as a component separate from the directory/files in the path. Frankly, it's a little odd that _parts includes '/' at all (Path has a ._root/.root attribute that stores it too, and even when '/' isn't in the ._parts/.parts, the generated complete path includes it because of ._root), but it looks like the docs guaranteed that behavior in their examples.

It looks like one of two options must be chosen:

1. Fix the negative indexing code to account for absolute paths, and ensure absolute paths store '/' in ._parts consistently (it should not be possible to get two identical Paths, one of which includes '/' in _parts, one of which does not, which is possible with the current negative indexing bug; not sure if there are any documented code paths that might produce this warped sort of object outside of the buggy .parents), or

2. Make no changes to the negative indexing code, but make absolute paths *never* store the root as the first element of _parts (.parts can prepend self._drive/self._root on demand to match documentation). This probably involves more changes (lots of places assume _parts includes the root, e.g. the _PathParents class's own __len__ method raises a ValueError when called on the warped object returned by p.parents[-1], because it adjusts for the root, and the lack of one means it returns a length of -1).

I think #1 is probably the way to go. I believe all that would require is to add:

if idx < 0:
    return self.__getitem__(len(self) + idx)

just before:

return self._pathcls._from_parsed_parts(self._drv, self._root, self._parts[:-idx - 1])

so it never tries to use a negative idx directly (it has to occur after the check for valid index in [-len(self), len(self) so very negative indices don't recurse until they become positive).

This takes advantage of _PathParents's already adjusting the reported length for the presence of drive/root, keeping the code simple; the alternative I came up with that doesn't recurse changes the original return line:

return self._pathcls._from_parsed_parts(self._drv, self._root, self._parts[:-idx - 1])

to:

adjust = idx >= 0 or not (self._drv or self._root)
return self._pathcls._from_parsed_parts(self._drv, self._root, self._parts[:-idx - adjust])

which is frankly terrible, even if it's a little faster.
msg403493 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2021-10-08 19:08
This is a great bug report, but for anyone else who gets a bit lost in the details, here's the core of the issue:

>>> p = Path("/1/2")
>>> q = Path("1/2")

>>> p.parents[-1]  # This is correct
PosixPath('/')
>>> q.parents[-1]
PosixPath('.')

>>> p.parents[-2]  # Should be PosixPath('/1')
PosixPath('/')
>>> q.parents[-2]
PosixPath('1')

>>> p.parents[-3]  # Should be PosixPath('/1/2')
PosixPath('/1')
>>> q.parents[-3]
PosixPath('1/2')

I think a refactoring where '/' doesn't appear in ._parts would be a good idea if we can get past Chesterton's Fence and determine that this was indeed not a deliberate design decision (or at least one whose concerns no longer apply), but at least in the short term, I agree that transforming negative indexes into positive indices is the right, expedient thing to do.

We'll definitely want to make sure that we're careful about bad indices (and add relevant tests), though, since it would be easy to get weird behavior where too-large negative indexes start "wrapping around" (e.g. p.parents[-4] with len(p._parents) == 3 → p.parents[-1]).
msg403497 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2021-10-08 19:45
"We'll definitely want to make sure that we're careful about bad indices ... since it would be easy to get weird behavior where too-large negative indexes start 'wrapping around'"

When I noticed the problem, I originally thought "Hey, the test for a negative index can come *before* the range check and save some work for negative indices". Then I realized, while composing this bug report, that that would make p.parents[-4] with len(p.parents) == 3 → p.parents[-1] as you said, and die with a RecursionError for p.parents[-3000] or so. I'm going to ignore the possibility I'm sleep-deprived and/or sloppy, and assume a lot of good programmers would think to make that "optimization" and accidentally introduce new bugs. :-) So yeah, all the tests.
msg403507 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2021-10-08 21:19
On the subject of sleep-deprived and/or sloppy, just realized:

return self.__getitem__(len(self) + idx)

should really just be:

idx += len(self)

no need to recurse.
History
Date User Action Args
2021-10-08 21:19:51josh.rsetmessages: + msg403507
2021-10-08 19:45:21josh.rsetmessages: + msg403497
2021-10-08 19:08:20p-gansslesettype: behavior
messages: + msg403493
2021-10-08 18:41:16p-gansslesetnosy: + p-ganssle
2021-10-08 18:11:18josh.rcreate