Title: os.path.normcase() is inconsistent with Windows file system
Type: behavior Stage:
Components: Library (Lib), Unicode, Windows Versions: Python 3.10, Python 3.9, Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, paul.moore, sogom, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2020-12-16 12:44 by sogom, last changed 2021-03-09 20:27 by vstinner.

Messages (2)
msg383163 - (view) Author: (sogom) Date: 2020-12-16 12:44
On Windows file system, U+03A9 (Greek capital letter Omega) and U+2126 (Ohm sign) are distinguished. In fact, two distinct files "\u03A9.txt" and "\u2126.txt" can exist side by side in the same folder. But os.path.normcase() transforms both U+03A9 and U+2126 to U+03C9 (Greek small letter omega).

MSDN reads they use CompareStringOrdinal() to compare NTFS file names: . This document also says "the function maps case using the operating system *uppercasing* table." But I made an experiment and found that at least in the Basic Multilingual Plane, "lowercase two strings by means of LCMapStringEx() and then wcscmp the two" always gives the same result as "compare the two strings with CompareStringOrdinal()". Though this fact is not explicitly mentioned in MSDN , the description of LCMAP_LINGUISTIC_CASING in this page implies that casing rules conform to file system's unless LCMAP_LINGUISTIC_CASING is used.

Therefore, I believe that os.path.normcase() should probably call LCMapStringEx(), with the first argument LOCALE_NAME_INVARIANT and the second argument LCMAP_LOWERCASE.
msg384012 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-12-29 15:48
> "lowercase two strings by means of LCMapStringEx() and then wcscmp
> the two" always gives the same result as "compare the two strings 
> with CompareStringOrdinal()"

For checking case-insensitive equality, it shouldn't matter whether names are converted to uppercase or lowercase when using invariant non-linguistic casing. It's based on symmetric mappings between pairs of uppercase and lowercase codes, which avoids problems such as 'ϴ' (U+03F4) and 'Θ' (U+0398) both lowercasing as 'θ' (U+03B8), or 'ß' uppercasing as 'SS'.

That said, when sorting filenames, you need to use LCMAP_UPPERCASE in order to match the case-insensitive sort order of Windows. For example, 'Ÿ' (U+0178) is greater than 'Ŷ' (U+0176), but -- respectively lowercase -- 'ÿ' (U+00FF) is less than 'ŷ' (U+0177). In particular, if you have an NTFS directory with two files named 'ÿ' and 'ŷ', the listing will be ['ŷ', 'ÿ'] -- in uppercase order. (An NTFS directory is stored on disk as a b-tree sorted by uppercase filenames.)

For the implementation, _winapi.LCMapStringEx and related constants could be added.
Date User Action Args
2021-03-09 20:27:48vstinnersetnosy: - vstinner
2021-03-09 15:06:51eryksunlinkissue43397 superseder
2021-03-09 15:02:02eryksunsetnosy: + ezio.melotti, vstinner

components: + Library (Lib), Unicode
versions: + Python 3.8, Python 3.10
2020-12-29 15:48:41eryksunsetnosy: + eryksun
messages: + msg384012
2020-12-16 12:44:26sogomcreate