Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os.path.normcase() is inconsistent with Windows file system #86824

Closed
sogom mannequin opened this issue Dec 16, 2020 · 10 comments
Closed

os.path.normcase() is inconsistent with Windows file system #86824

sogom mannequin opened this issue Dec 16, 2020 · 10 comments
Labels
3.10 only security fixes 3.11 bug and security fixes 3.12 bugs and security fixes OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@sogom
Copy link
Mannequin

sogom mannequin commented Dec 16, 2020

BPO 42658
Nosy @pfmoore, @tjguk, @ezio-melotti, @zware, @eryksun, @zooba, @aisk
PRs
  • bpo-42658: Using LCMapStringEx in ntpath.normcase #32010
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2020-12-16.12:44:26.513>
    labels = ['type-bug', '3.8', 'OS-windows', '3.10', 'library', 'expert-unicode', '3.9']
    title = 'os.path.normcase() is inconsistent with Windows file system'
    updated_at = <Date 2022-03-20.15:47:09.896>
    user = 'https://bugs.python.org/sogom'

    bugs.python.org fields:

    activity = <Date 2022-03-20.15:47:09.896>
    actor = 'asaka'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)', 'Unicode', 'Windows']
    creation = <Date 2020-12-16.12:44:26.513>
    creator = 'sogom'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 42658
    keywords = ['patch']
    message_count = 2.0
    messages = ['383163', '384012']
    nosy_count = 8.0
    nosy_names = ['paul.moore', 'tim.golden', 'ezio.melotti', 'zach.ware', 'eryksun', 'steve.dower', 'asaka', 'sogom']
    pr_nums = ['32010']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue42658'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @sogom
    Copy link
    Mannequin Author

    sogom mannequin commented Dec 16, 2020

    On Windows file system, U+03A9 (Greek capital letter Omega) and U+2126 (Ohm sign) are distinguished. In fact, two distinct files "\u03A9.txt" and "\u2126.txt" can exist side by side in the same folder. But os.path.normcase() transforms both U+03A9 and U+2126 to U+03C9 (Greek small letter omega).

    MSDN reads they use CompareStringOrdinal() to compare NTFS file names: https://docs.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications#sort-strings-ordinally . This document also says "the function maps case using the operating system *uppercasing* table." But I made an experiment and found that at least in the Basic Multilingual Plane, "lowercase two strings by means of LCMapStringEx() and then wcscmp the two" always gives the same result as "compare the two strings with CompareStringOrdinal()". Though this fact is not explicitly mentioned in MSDN https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringex , the description of LCMAP_LINGUISTIC_CASING in this page implies that casing rules conform to file system's unless LCMAP_LINGUISTIC_CASING is used.

    Therefore, I believe that os.path.normcase() should probably call LCMapStringEx(), with the first argument LOCALE_NAME_INVARIANT and the second argument LCMAP_LOWERCASE.

    @sogom sogom mannequin added 3.9 only security fixes OS-windows type-bug An unexpected behavior, bug, or error labels Dec 16, 2020
    @eryksun
    Copy link
    Contributor

    eryksun commented Dec 29, 2020

    "lowercase two strings by means of LCMapStringEx() and then wcscmp
    the two" always gives the same result as "compare the two strings
    with CompareStringOrdinal()"

    For checking case-insensitive equality, it shouldn't matter whether names are converted to uppercase or lowercase when using invariant non-linguistic casing. It's based on symmetric mappings between pairs of uppercase and lowercase codes, which avoids problems such as 'ϴ' (U+03F4) and 'Θ' (U+0398) both lowercasing as 'θ' (U+03B8), or 'ß' uppercasing as 'SS'.

    That said, when sorting filenames, you need to use LCMAP_UPPERCASE in order to match the case-insensitive sort order of Windows. For example, 'Ÿ' (U+0178) is greater than 'Ŷ' (U+0176), but -- respectively lowercase -- 'ÿ' (U+00FF) is less than 'ŷ' (U+0177). In particular, if you have an NTFS directory with two files named 'ÿ' and 'ŷ', the listing will be ['ŷ', 'ÿ'] -- in uppercase order. (An NTFS directory is stored on disk as a b-tree sorted by uppercase filenames.)

    For the implementation, _winapi.LCMapStringEx and related constants could be added.

    @eryksun eryksun added stdlib Python modules in the Lib dir topic-unicode 3.8 only security fixes 3.10 only security fixes labels Mar 9, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @zooba zooba added 3.11 bug and security fixes 3.12 bugs and security fixes and removed 3.9 only security fixes 3.8 only security fixes labels Jun 6, 2022
    @zooba
    Copy link
    Member

    zooba commented Jun 6, 2022

    Fix is committed for 3.12, but needs some help backporting. I can't get to it today, so if anyone wants to do it, go ahead. I'll try and get to it later this week.

    @zooba
    Copy link
    Member

    zooba commented Jun 8, 2022

    I started backporting to 3.11, but the memory leak is back in the test_embed tests. I don't see where it could be, and it doesn't occur in 3.12, so we need to fix it before backporting.

    @aisk
    Copy link
    Member

    aisk commented Jun 9, 2022

    I started backporting to 3.11, but the memory leak is back in the test_embed tests. I don't see where it could be, and it doesn't occur in 3.12, so we need to fix it before backporting.

    The leak is caused by calling PyUnicode_AsUnicodeAndSize with empty string. Since PyUnicode_AsUnicodeAndSize is removed in 3.12, should we fix it in former versions?

    @zooba
    Copy link
    Member

    zooba commented Jun 9, 2022

    Who's calling AsUnicodeAndSize? Is it inside one of the parameter converters? If so, we should definitely fix it in earlier versions.

    @zooba
    Copy link
    Member

    zooba commented Jun 9, 2022

    I updated #93591 to use PyUnicode_AsWideCharString instead, which avoids the leak. Presumably we're (at least potentially) leaking elsewhere, but it doesn't seem to have come up, so less urgent to track it down. I did look through, and it wasn't anything real obvious.

    @serhiy-storchaka
    Copy link
    Member

    The leak is caused by calling PyUnicode_AsUnicodeAndSize with empty string.

    It is very strange. I see nothing suspicious in the code. Does it only leak for empty strings? Functions in the os module use this converter to convert path arguments, so it should be easy to reproduce. Also it is used in winreg.SetValue() and indirectly in other functions which use "u" or "Z" format units in PyArg_Parse*() functions.

    @zooba
    Copy link
    Member

    zooba commented Aug 14, 2023

    I didn't see anything suspicious either, so maybe it's just a false positive in the test. Either way, the code is gone and AC just hasn't been updated yet for null-embedded strings, so it doesn't really matter.

    @serhiy-storchaka
    Copy link
    Member

    Closing?

    @zooba zooba closed this as completed Aug 15, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes 3.11 bug and security fixes 3.12 bugs and security fixes OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants