under Windows, os.path.abspath returns non-ASCII bytes paths as question marks #57456

ubershmekel · 2011-10-22T23:45:24Z

BPO	13247
Nosy	@loewis, @terryjreedy, @atsuoishimoto, @vstinner, @tjguk, @zware, @serhiy-storchaka, @zooba
Files	os_mbcs.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-06-22.23:29:17.370>
created_at = <Date 2011-10-22.23:45:23.895>
labels = ['invalid', 'type-bug', 'library', 'OS-windows']
title = 'under Windows, os.path.abspath returns non-ASCII bytes paths as question marks'
updated_at = <Date 2014-06-22.23:29:17.369>
user = 'https://bugs.python.org/ubershmekel'

bugs.python.org fields:

activity = <Date 2014-06-22.23:29:17.369>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2014-06-22.23:29:17.370>
closer = 'vstinner'
components = ['Library (Lib)', 'Windows']
creation = <Date 2011-10-22.23:45:23.895>
creator = 'ubershmekel'
dependencies = []
files = ['23521']
hgrepos = []
issue_num = 13247
keywords = ['patch']
message_count = 23.0
messages = ['146204', '146222', '146322', '146396', '146397', '146403', '146405', '146406', '146407', '146418', '146422', '146424', '146427', '146428', '146451', '146462', '146465', '177612', '220006', '220157', '221203', '221226', '221325']
nosy_count = 11.0
nosy_names = ['loewis', 'terry.reedy', 'ishimoto', 'vstinner', 'tim.golden', 'ubershmekel', 'BreamoreBoy', 'python-dev', 'zach.ware', 'serhiy.storchaka', 'steve.dower']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue13247'
versions = ['Python 3.1', 'Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

ubershmekel · 2011-10-22T23:45:24Z

For Python 2:

    Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32
    >>> os.path.abspath('.')
    'C:\\Users\\yuv\\Desktop\\YuvDesktop\\??????'
    >>> os.path.abspath(u'.')
    u'C:\\Users\\yuv\\Desktop\\YuvDesktop\\\u05d0\u05d1\u05d2\u05d3\u05d4\u05d5'

For Python 3:
    Python 3.2 (r32:88445, Feb 20 2011, 21:29:02) [MSC v.1500 32 bit (Intel)] on win32
    >>> os.path.abspath('.')
    'C:\\Users\\yuv\\Desktop\\YuvDesktop\\\u05d0\u05d1\u05d2\u05d3\u05d4\u05d5'
    >>> os.path.abspath(b'.')
    b'C:\\Users\\yuv\\Desktop\\YuvDesktop\\??????'

The returned path with question marks is completely useless. It's better that python throw an error than return the question marks. Another option is to try and get the ascii version of the path, I believe windows has one.

vstinner · 2011-10-23T09:19:47Z

abspath() is implemented using nt._getfullpathname() which calls GetFullPathNameA().

The returned path with question marks is completely useless.

Can you open the file using such filename? If no, I agree that the result is useless.

It's better that python throw an error than return the question marks.

Python is currently a thin wrapper on the Windows API. Windows doesn't consider that a filename with question marks as an error.

http://msdn.microsoft.com/en-us/library/windows/desktop/aa364963%28v=vs.85%29.aspx

Python can maybe uses GetFullPathNameW() and encode manually the filename using its strict MBCS codec. MBCS codec is strict since Python 3.2: it raises a UnicodeEncodeError if the string cannot be encoded.

ubershmekel · 2011-10-24T19:51:39Z

An example error with abspath and bytes input:

    >>> os.path.abspath('.')
    'C:\\Users\\yuv\\Desktop\\YuvDesktop\\\u05d0\u05d1\u05d2\u05d3\u05d4\u05d5'
    >>> os.path.abspath(b'.')
    b'C:\\Users\\yuv\\Desktop\\YuvDesktop\\??????'
    >>> os.listdir(os.path.abspath(b'.'))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: 'C:\\Users\\yuv\\Desktop\\YuvDesktop\\??????/*.*'
    >>>

I couldn't follow the implementation, I got stuck not being able to locate the definition for os.getcwdb so I couldn't join you for that part. Here's another possible solution:

    >>> win32api.GetFullPathName('.')
    'C:\\Users\\yuv\\Desktop\\YuvDesktop\\\u05d0\u05d1\u05d2\u05d3\u05d4\u05d5'
    >>> win32api.GetShortPathName(win32api.GetFullPathName('.'))
    'C:\\Users\\yuv\\Desktop\\YUVDES~1\\5F30~1'

The short path is ascii but the problem is not all windows file systems have 8.3 filenames [1]. So I think your suggestion is the best solution.

[1] http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx#short_vs.\_long_names

vstinner · 2011-10-25T19:51:27Z

os.getcwdb() (GetCurrentDirectoryA) and os.listdir(bytes) (FindNextFileA & co) encode filenames using WideCharToMultiByte() in default mode (flags=0): unencodable characters are replaced by question marks. Such filenames cannot be used, open() fails with OSError(22, "invalid argument: '?'") for example.

Attached patch changes os.getcwdb() and os.listdir(bytes) to use the Windows native API (wide character API) with Python MBCS codec in strict mode (error handler "strict") to notify directly the user that the filename cannot be decoded.

The patch only changes the behaviour for filename not encodable to the ANSI code page, such filenames are rare.

vstinner · 2011-10-25T20:00:09Z

os_mbcs.patch adds _Py_EncodeCodePage() to encode directly wchar_t* filenames without having to create a temporary Unicode object.

The patch removes HAVE_MBCS because the MBCS is now always needed by the posixmodule.c. Anyway, I don't see why MultiByteToWideChar() and WideCharToMultiByte() would not be available on Windows.

atsuoishimoto · 2011-10-25T23:32:10Z

-1 from me.

I hate to see Unicode exceptions here. It would be an another source of mysterious Unicode exception. Programmers and users would be confused by error message. If you make such characters error, Python should raise an OSError or such.
File names with '?' are fine to display informations to users. Not all file names are nessesary to be used to open files.
I don't think filenames cannot be decoded in ANSI code page are rare enough to be ignored. I use Japanese edition of windows, but I sometime receive files with Chinese or German names.

Or, in some case, I have to change codepage with 'chcp 437' command to run console application made for American environment. I seldom run such application in these days, though.

vstinner · 2011-10-26T00:12:43Z

Le 26/10/2011 01:32, Atsuo Ishimoto a écrit :

I don't think filenames cannot be decoded in ANSI code page are rare enough to be ignored.

The issue is able being able to be noticied of encoding errors.
Currently, unencodable characters are silently replaced and you don't
know if the filename is valid or not. If a UnicodeEncodeError is raised,
you will be noticed and so you have to fix the problem.

Anyway, you must use the Unicode API on Windows. If you use the Unicode
API, filenames are no more encoded and code pages are no more used, so
bye bye Unicode errors!

The Windows bytes API is just kept for backward compatibility. More
details in my email to python-dev:
http://mail.python.org/pipermail/python-dev/2011-October/114203.html

terryjreedy · 2011-10-26T00:47:38Z

The doc says "All functions accepting path or file names accept both bytes and string objects, and result in an object of the same type, if a path or file name is returned." It does that now (the encoding assumed or produced for bytes is not specified). It says nothing about raising exceptions in certain situations. So this is a feature change request, one that would likely break existing code.

Users can test for invalid returned paths with "'?' in returned_path", though I admit that the use of '?' as a glob, regex, and url special char makes it a bad choice of error char.

atsuoishimoto · 2011-10-26T02:18:43Z

On Wed, Oct 26, 2011 at 9:12 AM, STINNER Victor <report@bugs.python.org> wrote:

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

Le 26/10/2011 01:32, Atsuo Ishimoto a écrit :
> - I don't think filenames cannot be decoded in ANSI code page are rare enough to be ignored.

The issue is able being able to be noticied of encoding errors.

This patch solve nothing, but just raises exception. It can break
existing codes. Also, I don't think it worth to add weired behavior to
Python std lib. I'll be surprised if *Byte* API raised an
UnicodeEncodeError.

Anyway, you must use the Unicode API on Windows. If you use the Unicode
API, filenames are no more encoded and code pages are no more used, so
bye bye Unicode errors!

Agreed. So I would like to suggest not to adding unnecessary
complexity to the Byte API.

ubershmekel · 2011-10-26T06:36:32Z

I use python a lot with Hebrew and many websites have internationalization which may involve unicode paths. I agree that saying "unicode paths are rare" is inaccurate.

If the current situation isn't fixed though - you just can't use the resulting path for almost anything. Do you have a use case Ishimoto?

Windows XP and up implement paths as unicode, that means that a bytes api doesn't even make sense unless python does some encoding and decoding for you. E.g. python can use the unicode API's internally and return utf-8 encoded bytes. But you couldn't use these paths outside of python. The fact is you shouldn't be doing os.path.abspath(b'.') in windows to begin with.

ubershmekel · 2011-10-26T07:02:40Z

Another option btw is to use utf-16, which will work but it's a bit ugly as well:

>>> os.listdir(os.path.abspath(u'.').encode('utf-16'))
[]
>>> os.path.abspath(u'.')
u'C:\\Users\\alon\\Desktop\\\u05e9\u05dc\u05d5\u05dd'
>>> os.path.abspath(u'.').encode('utf-16')
'\xff\xfeC\x00:\x00\\\x00U\x00s\x00e\x00r\x00s\x00\\\x00a\x00l\x00o\x00n\x00\\\x
00D\x00e\x00s\x00k\x00t\x00o\x00p\x00\\\x00\xe9\x05\xdc\x05\xd5\x05\xdd\x05'
>>> os.listdir(os.path.abspath(u'.').encode('utf-16'))
[]

Tested on python 2.7, but you know what I mean.

atsuoishimoto · 2011-10-26T08:01:33Z

On Wed, Oct 26, 2011 at 3:36 PM, Yuval Greenfield
<report@bugs.python.org> wrote:

If the current situation isn't fixed though - you just can't use the resulting path for almost anything. Do you have a use case Ishimoto?

I don't have use case. But does raising UnicodeEncodeError fix
problems? It could break existing code, but I don't see much
difference over WindowsError caused by the broken file names.

The fact is you shouldn't be doing os.path.abspath(b'.') in windows to begin with.

Agreed. So I think adding Windows specific check to Byte API does not
improve situation, but increase complexity of std lib.

ubershmekel · 2011-10-26T09:08:30Z

It won't break existing code. Ignoring this problem here only moves the exception to whenever the data returned is first used.

Any code this fix "breaks" is already broken.

vstinner · 2011-10-26T10:10:27Z

Yuval Greenfield <ubershmekel@gmail.com> added the comment:
Another option btw is to use utf-16

UTF-8, UTF-16 or any encoding different than the ANSI code page are not an
option. The Windows bytes API expect filenames encoded to the ANSI code page.
os.listdir() would raise an error (unknown directory) or return an empty list
instead of the content of the directory.

terryjreedy · 2011-10-26T18:43:33Z

Yuval, you are assuming that *no one* who uses the os byte APIs on Windows is either checking for '?' in returned paths or catching later exceptions. With Google code search, I did find one instance where someone tests paths for '?' after encoding with the file system encoding. It was not an instance of os.xxx output, but it is the same idea.

In any case,

Our experience is that any change will affect someone. I was the victim of a 'harmless' micro change introduced in 3.1.2 (an intentional violation of the bugfix-only rule in bugfix releases -- and the last that I know of ;-).
The change will introduce an incompatibility between 3.2- and 3.3+.

The justification that mitigates the above is that there is little reason to request os bytes returns. By the same reasoning, the change is hardly worth bothering with as there should be little to no benefit in real code. So I am +-0 on the change.

python-dev · 2011-10-26T23:39:22Z

New changeset 2cad20e2e588 by Victor Stinner in branch 'default':
Close bpo-13247: Add cp65001 codec, the Windows UTF-8 (CP_UTF8)
http://hg.python.org/cpython/rev/2cad20e2e588

vstinner · 2011-10-26T23:43:35Z

Oops, I specified the wrong issue number in my changeset 2cad20e2e588, it's the issue bpo-13216.

serhiy-storchaka · 2012-12-16T16:36:04Z

See also bpo-16656 where another approach was proposed (unicode names returned from Bytes API if result is not encodable). Actually I think now that there is no right solution of this issue.

BreamoreBoy · 2014-06-08T00:32:00Z

I've read this entire issue and can't see that much can be done, so suggest it is closed as "won't fix" as has already happened for bpo-16656, which was suggested is a duplicate of this in msg177609.

vstinner · 2014-06-10T09:28:50Z

I've read this entire issue and can't see that much can be done

My patch can be applied in Python 3.5 to notice immediatly users that filenames cannot be encoded to the ANSI code page. Anyway, bytes filenames are deprecated (emit a DeprecationWarning warning) in the os module on Windows since Python 3.3.

BreamoreBoy · 2014-06-21T21:10:47Z

Can someone do a patch review please, it's way over my head, and set the stage and versions as appropriate.

loewis · 2014-06-22T07:27:26Z

I'm -1 on the patch. The string currently returned might be useless, but the fundamental problem is that using bytes for filenames on Windows just isn't sufficient for all cases. Microsoft has chosen to return question marks in the API, and Python should return them as the system vendor did.

Another alternative would be to switch to UTF-8 as the file system encoding on Windows, but that change might be too incompatible.

vstinner · 2014-06-22T23:29:17Z

Ok to keep calls to ANSI versions of the Windows API when bytes filenames are used (so get question marks on encoding errors).

Another alternative would be to switch to UTF-8 as the file system encoding on Windows, but that change might be too incompatible.

On Linux, I tried to have more than one "OS" encoding and it was a big fail (search for "PYTHONFSENCODING" env var in Python history). It introduced many new tricky issues. In short, Python should use the same "OS encoding" *everyone*. Since they are many places where Python doesn't control the encoding, we must use the same encoding than the OS. For example, os.listdir(b'.') uses the ANSI code page. If you concatenate two strings, one encoding to UTF-8 and the other encoded to the ANSI code page, you will at least see mojibake, and your operation will probably fail (ex: unable to open the file).

I mean that forcing an encoding *everywhere* is a losing battle. There are too many external functions using the locale encoding on UNIX and the ANSI code page on Windows. Not only in the C library, think also to OpenSSL just to give you one example.

Anyway, bytes filenames are deprecated since Python 3.2 so it's maybe time to stop using them!

--

Another alternative is to completly drop support of bytes filenames on Windows in Python 3.5. But I expect that too many applications will just fail. It's too early for such disruptive change.

So I'm just closing the issue as "not a bug", because Python just follows the vendor choice (Microsoft decided to use funny question marks :-)).

ubershmekel mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Oct 22, 2011

terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Oct 26, 2011

python-dev mannequin closed this as completed Oct 26, 2011

vstinner reopened this Oct 26, 2011

pitrou added the OS-windows label Dec 16, 2012

pitrou changed the title ~~os.path.abspath returns unicode paths as question marks~~ under Windows, os.path.abspath returns non-ASCII bytes paths as question marks Dec 16, 2012

pitrou added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Dec 16, 2012

vstinner closed this as completed Jun 22, 2014

vstinner added the invalid label Jun 22, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

under Windows, os.path.abspath returns non-ASCII bytes paths as question marks #57456

under Windows, os.path.abspath returns non-ASCII bytes paths as question marks #57456

ubershmekel mannequin commented Oct 22, 2011

ubershmekel mannequin commented Oct 22, 2011

vstinner commented Oct 23, 2011

ubershmekel mannequin commented Oct 24, 2011

vstinner commented Oct 25, 2011

vstinner commented Oct 25, 2011

atsuoishimoto mannequin commented Oct 25, 2011

vstinner commented Oct 26, 2011

terryjreedy commented Oct 26, 2011

atsuoishimoto mannequin commented Oct 26, 2011

ubershmekel mannequin commented Oct 26, 2011

ubershmekel mannequin commented Oct 26, 2011

atsuoishimoto mannequin commented Oct 26, 2011

ubershmekel mannequin commented Oct 26, 2011

vstinner commented Oct 26, 2011

terryjreedy commented Oct 26, 2011

python-dev mannequin commented Oct 26, 2011

vstinner commented Oct 26, 2011

serhiy-storchaka commented Dec 16, 2012

BreamoreBoy mannequin commented Jun 8, 2014

vstinner commented Jun 10, 2014

BreamoreBoy mannequin commented Jun 21, 2014

loewis mannequin commented Jun 22, 2014

vstinner commented Jun 22, 2014

under Windows, os.path.abspath returns non-ASCII bytes paths as question marks #57456

under Windows, os.path.abspath returns non-ASCII bytes paths as question marks #57456

Comments

ubershmekel mannequin commented Oct 22, 2011

ubershmekel mannequin commented Oct 22, 2011

vstinner commented Oct 23, 2011

ubershmekel mannequin commented Oct 24, 2011

vstinner commented Oct 25, 2011

vstinner commented Oct 25, 2011

atsuoishimoto mannequin commented Oct 25, 2011

vstinner commented Oct 26, 2011

terryjreedy commented Oct 26, 2011

atsuoishimoto mannequin commented Oct 26, 2011

ubershmekel mannequin commented Oct 26, 2011

ubershmekel mannequin commented Oct 26, 2011

atsuoishimoto mannequin commented Oct 26, 2011

ubershmekel mannequin commented Oct 26, 2011

vstinner commented Oct 26, 2011

terryjreedy commented Oct 26, 2011

python-dev mannequin commented Oct 26, 2011

vstinner commented Oct 26, 2011

serhiy-storchaka commented Dec 16, 2012

BreamoreBoy mannequin commented Jun 8, 2014

vstinner commented Jun 10, 2014

BreamoreBoy mannequin commented Jun 21, 2014

loewis mannequin commented Jun 22, 2014

vstinner commented Jun 22, 2014