This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: os.path.normcase() is inconsistent with Windows file system
Type: behavior Stage: patch review
Components: Library (Lib), Unicode, Windows Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: asaka, eryksun, ezio.melotti, paul.moore, sogom, steve.dower, tim.golden, zach.ware
Priority: normal Keywords: patch

Created on 2020-12-16 12:44 by sogom, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 32010 open asaka, 2022-03-20 15:47
Messages (2)
msg383163 - (view) Author: (sogom) Date: 2020-12-16 12:44
On Windows file system, U+03A9 (Greek capital letter Omega) and U+2126 (Ohm sign) are distinguished. In fact, two distinct files "\u03A9.txt" and "\u2126.txt" can exist side by side in the same folder. But os.path.normcase() transforms both U+03A9 and U+2126 to U+03C9 (Greek small letter omega).

MSDN reads they use CompareStringOrdinal() to compare NTFS file names: https://docs.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications#sort-strings-ordinally . This document also says "the function maps case using the operating system *uppercasing* table." But I made an experiment and found that at least in the Basic Multilingual Plane, "lowercase two strings by means of LCMapStringEx() and then wcscmp the two" always gives the same result as "compare the two strings with CompareStringOrdinal()". Though this fact is not explicitly mentioned in MSDN https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringex , the description of LCMAP_LINGUISTIC_CASING in this page implies that casing rules conform to file system's unless LCMAP_LINGUISTIC_CASING is used.

Therefore, I believe that os.path.normcase() should probably call LCMapStringEx(), with the first argument LOCALE_NAME_INVARIANT and the second argument LCMAP_LOWERCASE.
msg384012 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-12-29 15:48
> "lowercase two strings by means of LCMapStringEx() and then wcscmp
> the two" always gives the same result as "compare the two strings 
> with CompareStringOrdinal()"

For checking case-insensitive equality, it shouldn't matter whether names are converted to uppercase or lowercase when using invariant non-linguistic casing. It's based on symmetric mappings between pairs of uppercase and lowercase codes, which avoids problems such as 'ϴ' (U+03F4) and 'Θ' (U+0398) both lowercasing as 'θ' (U+03B8), or 'ß' uppercasing as 'SS'.

That said, when sorting filenames, you need to use LCMAP_UPPERCASE in order to match the case-insensitive sort order of Windows. For example, 'Ÿ' (U+0178) is greater than 'Ŷ' (U+0176), but -- respectively lowercase -- 'ÿ' (U+00FF) is less than 'ŷ' (U+0177). In particular, if you have an NTFS directory with two files named 'ÿ' and 'ŷ', the listing will be ['ŷ', 'ÿ'] -- in uppercase order. (An NTFS directory is stored on disk as a b-tree sorted by uppercase filenames.)

For the implementation, _winapi.LCMapStringEx and related constants could be added.
History
Date User Action Args
2022-04-11 14:59:39adminsetgithub: 86824
2022-03-20 15:47:09asakasetkeywords: + patch
nosy: + asaka

pull_requests: + pull_request30098
stage: patch review
2021-03-09 20:27:48vstinnersetnosy: - vstinner
2021-03-09 15:06:51eryksunlinkissue43397 superseder
2021-03-09 15:02:02eryksunsetnosy: + ezio.melotti, vstinner

components: + Library (Lib), Unicode
versions: + Python 3.8, Python 3.10
2020-12-29 15:48:41eryksunsetnosy: + eryksun
messages: + msg384012
2020-12-16 12:44:26sogomcreate