Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames #54029

Closed
vstinner opened this issue Sep 10, 2010 · 17 comments
Closed
Labels
OS-windows stdlib Python modules in the Lib dir topic-unicode

Comments

@vstinner
Copy link
Member

BPO 9820
Nosy @loewis, @vstinner, @bitdancer
Files
  • listdir_windows_bytes.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-09-13.19:11:21.703>
    created_at = <Date 2010-09-10.10:42:58.310>
    labels = ['invalid', 'library', 'expert-unicode', 'OS-windows']
    title = "Windows : os.listdir(b'.') doesn't raise an\terrorfor unencodablefilenames"
    updated_at = <Date 2010-09-13.19:11:21.702>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-09-13.19:11:21.702>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-09-13.19:11:21.703>
    closer = 'vstinner'
    components = ['Library (Lib)', 'Unicode', 'Windows']
    creation = <Date 2010-09-10.10:42:58.310>
    creator = 'vstinner'
    dependencies = []
    files = ['18836']
    hgrepos = []
    issue_num = 9820
    keywords = ['patch']
    message_count = 17.0
    messages = ['115995', '115996', '116002', '116043', '116209', '116210', '116220', '116228', '116230', '116232', '116233', '116235', '116295', '116296', '116299', '116300', '116340']
    nosy_count = 3.0
    nosy_names = ['loewis', 'vstinner', 'r.david.murray']
    pr_nums = []
    priority = 'normal'
    resolution = 'not a bug'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue9820'
    versions = ['Python 3.2']

    @vstinner
    Copy link
    Member Author

    In Python 3.2, mbcs encoding (default filesystem encoding on Windows) is now strict: raise an error on unencodable/undecodable characters/bytes. But os.listdir(b'.') encodes unencodable bytes as b'?'.

    Example:

    >>> os.mkdir('listdir')
    >>> open('listdir\\xxx-\u0363', 'w').close()
    >>> filename = os.listdir(b'listdir')[0]
    >>> filename
    b'xxx-?'
    >>> open(filename, 'r').close()
    IOError: [Errno 22] Invalid argument: 'xxx-?'

    os.listdir(b'listdir') should raise an error (and not ignore the filename or replaces unencodable characters by b'?').

    I think that we should list the directory using the wide character API (FindFirstFileW) but encode the filename using PyUnicode_EncodeFSDefault() if the directory name type is bytes, instead of using the ANSI API (FindFirstFileA).

    @vstinner vstinner added stdlib Python modules in the Lib dir topic-unicode OS-windows labels Sep 10, 2010
    @vstinner
    Copy link
    Member Author

    I found this bug while trying to find an unencodable filename for bpo-9819 (TESTFN_UNDECODABLE).

    Anyway, the bytes API should be avoided on Windows since Windows native filename type is unicode.

    @vstinner
    Copy link
    Member Author

    os.listdir(b'listdir') should raise an error (and not ignore
    the filename or replaces unencodable characters by b'?').

    To avoid the error, a solution is to support the PEP-383 on Windows (for the mbcs encoding). I opened a separated issue for that: bpo-9821.

    But support PEP-383 will not fix this issue because the current implementation of listdir(b'.') doesn't use the Python codec, but use raw bytes filenames (use the ANSI API).

    @vstinner
    Copy link
    Member Author

    Patch:

    • Remove the bytes version of listdir(): reuse the unicode version but converts the filename to bytes using PyUnicode_EncodeFSDefault() if the directory name is not unicode
    • use Py_XDECREF(d) instead of Py_DECREF(d) at the end (because d=NULL on error)
    • use Py_CLEAR(d) instead of "Py_DECREF(d); d=NULL;"
    • remove "char namebuf[MAX_PATH+5]" buffer (use less stack memory)

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 12, 2010

    What do you gain with this patch? (i.e. what is its advantage?)

    @vstinner
    Copy link
    Member Author

    What do you gain with this patch? (i.e. what is its advantage?)

    You know directly that os.listdir(bytes) is unable to encode the filename, instead of manipulate an invalid filename (b'?') and get the error later (when you use the filename: open, copy, delete, ... the file).

    It's the same idea than str+bytes raises an error on Python3: get the error earlier instead of store invalid data and get the error to late.

    Anywy, on Windows, it's not a good idea to manipulate bytes filenames. So it's also a way to encourage people to migrate their applications to unicode on Windows.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 12, 2010

    You know directly that os.listdir(bytes) is unable to encode the
    filename, instead of manipulate an invalid filename (b'?') and get
    the error later (when you use the filename: open, copy, delete, ...
    the file).

    Ok. Then I'm -1 on the patch: you can't know whether the application
    actually wants to open the file. Perhaps it only wants to display the
    file names, or perhaps it only wants to open some of the files, or
    only traverse into subdirectories.

    For backwards compatibility, I recommend to leave things as they are.
    FindFirst/NextFileA will also do some other interesting conversions,
    such as the best-fit conversion (which the "mbcs" code doesn't do
    (anymore?)).

    Windows has explicit A and W versions, and Python has explicit A
    and W types, so it's IMO best to pair them in the natural way
    (even if that means code duplication).

    Anywy, on Windows, it's not a good idea to manipulate bytes
    filenames. So it's also a way to encourage people to migrate their
    applications to unicode on Windows.

    Only if people run into the issue (which few people will). People
    which *do* run into the issue will likely get an error either
    way, which will teach them their lesson :-)

    @loewis loewis mannequin changed the title Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames Sep 12, 2010
    @vstinner
    Copy link
    Member Author

    FindFirst/NextFileA will also do some other interesting conversions,
    such as the best-fit conversion (which the "mbcs" code doesn't do
    (anymore?)).

    About mbcs, mbcs codec of Python 3.1 is like .encode('mbcs', 'replace') and
    .decode('mbcs', 'ignore') of Python 3.2 (see issue bpo-850997). By default
    (strict error handler), it now raises errors on undecodable byte sequence and
    unencodable character, whereas Python 3.1 just ignores the error handler.

    PyUnicode_EncodeFSDefault / PyUnicode_DecodeFSDefault uses the strict error
    handler.

    I just added a note about mbcs in Doc/whatsnew/3.2.rst: r84750.

    @vstinner vstinner changed the title Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames Sep 12, 2010
    @vstinner
    Copy link
    Member Author

    It remembers me the discussion of the issue bpo-3187. About unencodable filenames,
    Guido proposed to ignore them or to use errors="replace", and wrote "Failing
    the entire os.listdir() call is not acceptable". (... long discussion ...) And
    finally, os.listdir() ignored undecodable filenames on UNIX/BSD.

    Then you introduced the genious PEP-383 (utf8b then renamed surrogateescape)
    and os.listdir() now raises an error if the PyUnicode_FromEncodedObject(v,
    Py_FileSystemDefaultEncoding, "surrogateescape") fails... which doesn't occur
    because of undecodable byte sequence, but for other reasons like a memory
    error.

    About Windows, os.listdir(str) never fails, but my question is about
    os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded
    with "mbcs+replace", filenames not usable to open, rename or delete the file) or
    just ignore them?

    Ok. Then I'm -1 on the patch: you can't know whether the application
    actually wants to open the file. Perhaps it only wants to display the
    file names, or perhaps it only wants to open some of the files, or
    only traverse into subdirectories.

    For backwards compatibility, I recommend to leave things as they are.
    FindFirst/NextFileA will also do some other interesting conversions,
    such as the best-fit conversion (which the "mbcs" code doesn't do
    (anymore?)).

    "it only wants to open some of the files" is the typical reason for which I
    hate Python2 and its implicit conversion between bytes and characters: it
    works in most cases, but it fails "sometimes". The problem is to define (and
    explain) "sometimes".

    The typical use case of listing a directory is a file chooser. On Windows using
    the bytes API, it works in most cases, but it fails if the user picks the
    "wrong" file (name with "?"). That's the problem I would like to address.

    --

    Ignore unencodable filenames solution is compatible with the "traverse into
    subdirectories" case. And it does also keep backward compatibility (except
    that unencodable files are hidden, which is a least problem I think).

    --

    I proposed to raise an error on unencodable filename. I changed my mind after
    reading your answer and the discussion on bpo-3187. My patch breaks compatibility
    and users don't bother to unencodable filenames. Eg. glob("*.mp3") should not
    fail if the directory contains a temporary unencodable filename ("xxx.tmp").

    @vstinner
    Copy link
    Member Author

    FindFirst/NextFileA will also do some other interesting conversions,
    such as the best-fit conversion (which the "mbcs" code doesn't do
    (anymore?)).

    If we choose to keep this behaviour, I will have to revert my commit on mbcs
    codec to be consistent with os.listdir(). Or at least patch
    PyUnicode_EncodeFSDefault and os.fsencode() (use replace error handler) and
    PyUnicode_DecodeFSDefault and os.fsdecode() (use igrore error handler).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 12, 2010

    About Windows, os.listdir(str) never fails, but my question is about
    os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded
    with "mbcs+replace", filenames not usable to open, rename or delete the file) or
    just ignore them?

    I see nothing wrong with returning incorrect file names.

    "it only wants to open some of the files" is the typical reason for which I
    hate Python2 and its implicit conversion between bytes and characters: it
    works in most cases, but it fails "sometimes". The problem is to define (and
    explain) "sometimes".

    Notice that this doesn't change with the patch. It will *still* work
    sometimes, and fail sometimes. In fact, for most users and most
    applications, it will never fail - *even with your patch applied*.

    Ignore unencodable filenames solution is compatible with the "traverse into
    subdirectories" case. And it does also keep backward compatibility (except
    that unencodable files are hidden, which is a least problem I think).

    I fail to see why removing incorrect file names from the result list is
    any better than keeping them. The result list will be incorrect either way.

    In one case (files skipped), the user will not see the file in the
    selection dialog, even though he knows its there and explorer shows it
    just fine. So he thinks there must be a bug.

    In the other case, it displays a non-sensical file name. Again, the user
    thinks there is a bug - plus if you click on the file, you get some
    error message (hopefully, the application will catch the exception -
    the directory may also have changed in-between, so a missing file
    error must be recovered from).

    So it's a user-visible bug in either case, but if the incorrect file
    name is included, it's slightly more obvious that something is wrong.

    @loewis loewis mannequin changed the title Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames Sep 12, 2010
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 12, 2010

    If we choose to keep this behaviour, I will have to revert my commit on mbcs
    codec to be consistent with os.listdir(). Or at least patch
    PyUnicode_EncodeFSDefault and os.fsencode() (use replace error handler) and
    PyUnicode_DecodeFSDefault and os.fsdecode() (use igrore error handler).

    I think trying to emulate, in Python, what the *A functions do is
    futile. IIUC, disables WC_NO_BEST_FIT_CHARS, and may do other stuff
    which apparently is undocumented.

    However, I fail to see the relationship to this issue. Having the MBCS
    codec support strict mode is a good thing.

    @vstinner
    Copy link
    Member Author

    I fail to see why removing incorrect file names from the result
    list is any better than keeping them. The result list will
    be incorrect either way.

    It depends if you focus on displaying the content of the directory, or on processing files and directories. If you focus on display, yes, missing files an be seen as a bug. But if you walk into directories (use cases: os.walk(), replace a text pattern in all files (~os.glob), ...), and the function raises an error (because a directory or a file name is invalid) is worse. I mean the user have to rename all unencodable names, or the devfeloper have to patch its application to catch IOError and ignore the specific IOError(22).

    If listdir() ignores unencodable names, os.walk() doesn't fail, but it misses some subdirectories and files.

    --

    Another (worse?) idea: deny bytes path for os.listdir(), but I suppose that we will not like the idea ;-)

    @vstinner
    Copy link
    Member Author

    I think trying to emulate, in Python, what the *A functions
    do is futile.

    My problem is that some functions will use mbcs in strict mode (functions using PyUnicode_EncodeFSDefault): raise UnicodeEncodeError, and other will use mbcs in replace mode (functions using Windows functions in ANSI mode): raise IOError (or other error depending on the function). It's inconsistent. We should try to keep the same behaviour for all functions.

    Examples of functions using (indirectly) PyUnicode_EncodeFSDefault to encode unicode filenames: bz2.BZ2File() and _ssl.SSLContext.load_cert_chain().

    @bitdancer
    Copy link
    Member

    After the decision to ignore undecodable file names in os.listdir but before PEP-383 there was a long discussion on python-dev (in which I was a participant) about how horrible just ignoring the undecodable filenames was. This applies *especially* to the os.walk case, where some files would be mysteriously skipped and it wouldn't be obvious why.
    Or even obvious that they'd been skipped, in some cases. The biggest issue was that the developer would likely never see the problem since the bulk of developers don't encounter encoding issues, so it would be the poor end user who would be confronted with the mystery, with no clues as to the cause or solution.

    The conclusion of that particular thread was that Guido approved adding warning messages for filenames that were undecodable, but otherwise leaving os.listdir unchanged. Fortunately Martin came up with PEP-383, which solved the underlying problem in a better way.

    So, I don't think that skipping the undecodable names is good, unless you generate a warning. In that thread I started out advocating raising an error, but in this case as Martin points out that would be a backward compatibility issue. Returning the munged filenames and having the error show up when the broken filename is used seems OK to me, even if imperfect. When the user sees the problem, they report it to the developer as a bug, who hopefully changes his code to use strings.

    Adding warning messages would probably be useless at best and annoying at worst on Windows. Maybe we could add a pseudo deprecation warning (ie: aimed at developers, silent by default) that says "don't use listdir with bytes on windows"?

    @bitdancer
    Copy link
    Member

    But in the case of BZ2File and ssl.SSLContext.load_cert_chain(), isn't it the case that they are trying to open the files? So producing an early error about the decoding problem makes sense. Are there any functions other than listdir where the decoded filenames are not necessarily immediately used to manipulate the files?

    @vstinner
    Copy link
    Member Author

    • ignore unencodable filenames is not a good idea
    • raise an error on unencodable filenames breaks backward compatibility
    • I don't think that emit a warning will change anything

    Even if I don't like mbcs+replace (current behaviour of os.listdir(bytes)), I now agree that it is the less worst solution. I close this issue as invalid.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    OS-windows stdlib Python modules in the Lib dir topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants