This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: pathlib.(Pure)WindowsPaths can compare equal but refer to different files
Type: security Stage:
Components: Library (Lib), Windows Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benrg, paul.moore, pitrou, r.david.murray, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2018-01-21 21:04 by benrg, last changed 2022-04-11 14:58 by admin.

Messages (9)
msg310384 - (view) Author: (benrg) Date: 2018-01-21 21:04
(Pure)WindowsPath uses str.lower to fold paths for comparison and hashing. This doesn't match the case folding of actual Windows file systems. There exist WindowsPath objects that compare and hash equal, but refer to different files. For example, the strings

  '\xdf' (sharp S) and '\u1e9e' (capital sharp S)
  '\u01c7' (LJ) and '\u01c8' (Lj)
  '\u0130' (I with dot) and 'i\u0307' (i followed by combining dot)
  'K' and '\u212a' (Kelvin sign)

are equal under str.lower folding but are distinct file names on NTFS volumes on my Windows 7 machine. There are hundreds of other such pairs.

I think this is very bad. The reverse (paths that compare unequal but refer to the same file) is probably unavoidable and is expected by programmers. But paths that compare equal should never be unequal to the OS.

How to fix this:

Unfortunately, there is no correct way to case fold Windows paths. The FAT, NTFS, and exFAT drivers on my machine all have different behavior. (The examples above work on all three, except for 'K' and '\u212a', which are equivalent on FAT volumes.) NTFS stores its case-folding map on each volume in the hidden $UpCase file, so even different NTFS volumes on the same machine can have different behavior. The contents of $UpCase have changed over time as Windows is updated to support new Unicode versions. NTFS and NFS (and possibly WebDAV) also support full case sensitivity when used with Interix/SUA and Cygwin, though this requires disabling system-wide case insensitivity via the registry.

I think that pathlib should either give up on case folding entirely, or should fold very conservatively, treating WCHARs as equivalent only if they're equivalent on all standard file systems on all supported Windows versions.

If pathlib folds case at all, there should be a solution for people who need to interoperate with Cygwin or SUA tools on a case-sensitive machine, but I suppose they can just use PosixPath.
msg310477 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-01-23 07:27
Arguably, a WindowsPath instance only represents the *path* and not the file located by the path. So the programmer has to take just as much responsibility as if they were using plain strings, except there are some conveniences added to make those strings easier to manage.

I don't see anything in the docs suggesting that a Path instances have file-identity (that is, two Path objects are always equal if they refer to the same file). Even the "resolve()" method doesn't pretend to get the exact filename.

The best option for true comparison is to use stat() and compare st_inode and st_dev. Lacking that, doing a good enough job of case folding is better than ignoring it, as most workarounds are likely to do a worse job (e.g., they won't even correct for "A"=="B\..\A", let alone casing).

If you have a specific suggestion for how comparison could be improved here without having to go to kernel mode, feel free to make it. Unfortunately, your two suggestions here are not workable.

(And yes, PurePosixPath and "import posixpath" are the right way to handle case-sensitive paths explicitly, which is why they're available on all platforms.)
msg310479 - (view) Author: (benrg) Date: 2018-01-23 07:55
This bug is about paths that compare *equal*, but refer to *different* files. I agree that the opposite is not much of a problem (and I said so in the original comment).

The reason I classified this as a security bug is that Python scripts using pathlib on Windows could be vulnerable in certain cases to an attacker that can choose file names. For example, the order in which paths are added to a set or dict could affect which of two files is seen by the script. If different parts of the script add files in different orders - which would normally be safe - the result could be similar to a TOCTTOU race.

I don't disagree that "doing a good enough job of case folding is better than ignoring it." I just think that pathlib should not case-fold strings that Windows filesystems don't.
msg310480 - (view) Author: (benrg) Date: 2018-01-23 08:15
I don't know whether this clarifies it at all, but if x and y are Path objects, and x == y, I would expect also x.exists() == y.exists(), and x.read_bytes() == y.read_bytes(), and so on, unless there is a race condition. I think all programmers expect that if x == y, then they refer to the same file. This is not true currently.
msg310481 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-01-23 08:20
(FWIW, I don't think your "security" argument is going to be very convincing, as this problem has been around for far too long to be treated as suddenly urgent. But up to you.)

My fear is that if PureWindowsPath stops handling the >90% of cases it currently does (by making "Path"!="path"), then we'll see developers implement their own workarounds, most likely by replacing:

    my_set.add(p)

with:

    my_dict[str(p).lower()] = p

This doesn't improve anything, and actually regresses it as I pointed out, so removing case folding cannot be the answer.

That leaves finding a better way to obtain hashes and comparisons for PureWindowsPath instances (note that all the functionality that appears to be at issue is inherited by WindowsPath, so there's presumably nothing to fix there). My initial guess would be that some combination of locale and flags to LcMapStringEx will give us a better sort key than .lower(), though I don't know what that combination is. The $UpCase file is not accessible directly, so that isn't any help.

I have heard that .upper() is considered to be "more accurate" for NTFS than .lower(), but have no confirmation. Calling stat is not an option for PureWindowsPath, so I'm running out of ideas on how to improve it.

Perhaps the best answer is to add a note to the docs warning about this and suggesting not using pathlib if you are concerned? And document the stat() method, or maybe offer some way to integrate with filecmp?
msg310483 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-01-23 08:26
> I think all programmers expect that if x == y, then they refer to the same file. This is not true currently.

Perhaps, but you could equally say that they expect that if x != y then they refer to different files, which is also not true. So do we optimise for the false positive or the false negative? It really depends on what operation you are doing, and so it has to be left in the hands of the developer.

Currently, we have a cheap comparison that minimises both without preventing either. The only option for PureWindowsPath is to prevent false positives (== always means same file), at the cost of causing so many false negatives that I suspect most developers would simply roll their own equality comparison. That's not the status quo, and especially for a behaviour that has been shipping for a few years without significant problems, changing the status quo that dramatically requires a more compelling case than has been presented so far.

If there are ways we can reduce the false positives without increasing the false negatives, then let's do it. But pragmatically, we should balance both, and leave the "perfect" comparisons to other methods.
msg310484 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-01-23 08:30
Just tested something that I'd assumed and it turned out I was wrong:

>>> p1 = PureWindowsPath(r"C:\a\b\..\c")
>>> p2 = PureWindowsPath(r"C:\a\c")
>>> p1 == p2
False
>>> p1, p2
(PureWindowsPath('C:/a/b/../c'), PureWindowsPath('C:/a/c'))

So PureWindowsPath already doesn't collapse '..' elements (as they could change meaning in the presence of symlinks), so that's a point against the "what if everyone started doing str(p1).lower() == str(p2).lower() instead" argument.
msg310510 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-01-23 17:16
Maybe we could at least mention the issue (and perhaps link to samefile) in the "General Properties" section?
msg310512 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-01-23 17:18
Could WindowsPath (as opposed to PureWindowsPath) call samefile as part of the equality test?  (I'm not familiar enough with pathlib to know if that is a nonsense question.)
History
Date User Action Args
2022-04-11 14:58:56adminsetgithub: 76793
2018-01-23 17:18:35r.david.murraysetmessages: + msg310512
2018-01-23 17:16:40r.david.murraysetnosy: + r.david.murray
messages: + msg310510
2018-01-23 08:30:53steve.dowersetmessages: + msg310484
2018-01-23 08:27:00steve.dowersetmessages: + msg310483
2018-01-23 08:20:06steve.dowersetmessages: + msg310481
2018-01-23 08:15:07benrgsetmessages: + msg310480
2018-01-23 07:55:57benrgsetnosy: + pitrou
type: enhancement -> security
messages: + msg310479
2018-01-23 07:27:01steve.dowersettype: security -> enhancement
messages: + msg310477
versions: + Python 3.8, - Python 3.4, Python 3.5, Python 3.6
2018-01-21 21:04:41benrgcreate