New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
under Windows, os.path.abspath returns non-ASCII bytes paths as question marks #57456
Comments
For Python 2: Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32
>>> os.path.abspath('.')
'C:\\Users\\yuv\\Desktop\\YuvDesktop\\??????'
>>> os.path.abspath(u'.')
u'C:\\Users\\yuv\\Desktop\\YuvDesktop\\\u05d0\u05d1\u05d2\u05d3\u05d4\u05d5'
For Python 3:
Python 3.2 (r32:88445, Feb 20 2011, 21:29:02) [MSC v.1500 32 bit (Intel)] on win32
>>> os.path.abspath('.')
'C:\\Users\\yuv\\Desktop\\YuvDesktop\\\u05d0\u05d1\u05d2\u05d3\u05d4\u05d5'
>>> os.path.abspath(b'.')
b'C:\\Users\\yuv\\Desktop\\YuvDesktop\\??????' The returned path with question marks is completely useless. It's better that python throw an error than return the question marks. Another option is to try and get the ascii version of the path, I believe windows has one. |
abspath() is implemented using nt._getfullpathname() which calls GetFullPathNameA().
Can you open the file using such filename? If no, I agree that the result is useless.
Python is currently a thin wrapper on the Windows API. Windows doesn't consider that a filename with question marks as an error. http://msdn.microsoft.com/en-us/library/windows/desktop/aa364963%28v=vs.85%29.aspx Python can maybe uses GetFullPathNameW() and encode manually the filename using its strict MBCS codec. MBCS codec is strict since Python 3.2: it raises a UnicodeEncodeError if the string cannot be encoded. |
An example error with abspath and bytes input: >>> os.path.abspath('.')
'C:\\Users\\yuv\\Desktop\\YuvDesktop\\\u05d0\u05d1\u05d2\u05d3\u05d4\u05d5'
>>> os.path.abspath(b'.')
b'C:\\Users\\yuv\\Desktop\\YuvDesktop\\??????'
>>> os.listdir(os.path.abspath(b'.'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: 'C:\\Users\\yuv\\Desktop\\YuvDesktop\\??????/*.*'
>>> I couldn't follow the implementation, I got stuck not being able to locate the definition for os.getcwdb so I couldn't join you for that part. Here's another possible solution: >>> win32api.GetFullPathName('.')
'C:\\Users\\yuv\\Desktop\\YuvDesktop\\\u05d0\u05d1\u05d2\u05d3\u05d4\u05d5'
>>> win32api.GetShortPathName(win32api.GetFullPathName('.'))
'C:\\Users\\yuv\\Desktop\\YUVDES~1\\5F30~1' The short path is ascii but the problem is not all windows file systems have 8.3 filenames [1]. So I think your suggestion is the best solution. [1] http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx#short_vs.\_long_names |
os.getcwdb() (GetCurrentDirectoryA) and os.listdir(bytes) (FindNextFileA & co) encode filenames using WideCharToMultiByte() in default mode (flags=0): unencodable characters are replaced by question marks. Such filenames cannot be used, open() fails with OSError(22, "invalid argument: '?'") for example. Attached patch changes os.getcwdb() and os.listdir(bytes) to use the Windows native API (wide character API) with Python MBCS codec in strict mode (error handler "strict") to notify directly the user that the filename cannot be decoded. The patch only changes the behaviour for filename not encodable to the ANSI code page, such filenames are rare. |
os_mbcs.patch adds _Py_EncodeCodePage() to encode directly wchar_t* filenames without having to create a temporary Unicode object. The patch removes HAVE_MBCS because the MBCS is now always needed by the posixmodule.c. Anyway, I don't see why MultiByteToWideChar() and WideCharToMultiByte() would not be available on Windows. |
-1 from me.
Or, in some case, I have to change codepage with 'chcp 437' command to run console application made for American environment. I seldom run such application in these days, though. |
Le 26/10/2011 01:32, Atsuo Ishimoto a écrit :
The issue is able being able to be noticied of encoding errors. Anyway, you must use the Unicode API on Windows. If you use the Unicode The Windows bytes API is just kept for backward compatibility. More |
The doc says "All functions accepting path or file names accept both bytes and string objects, and result in an object of the same type, if a path or file name is returned." It does that now (the encoding assumed or produced for bytes is not specified). It says nothing about raising exceptions in certain situations. So this is a feature change request, one that would likely break existing code. Users can test for invalid returned paths with "'?' in returned_path", though I admit that the use of '?' as a glob, regex, and url special char makes it a bad choice of error char. |
On Wed, Oct 26, 2011 at 9:12 AM, STINNER Victor <report@bugs.python.org> wrote:
This patch solve nothing, but just raises exception. It can break
Agreed. So I would like to suggest not to adding unnecessary |
I use python a lot with Hebrew and many websites have internationalization which may involve unicode paths. I agree that saying "unicode paths are rare" is inaccurate. If the current situation isn't fixed though - you just can't use the resulting path for almost anything. Do you have a use case Ishimoto? Windows XP and up implement paths as unicode, that means that a bytes api doesn't even make sense unless python does some encoding and decoding for you. E.g. python can use the unicode API's internally and return utf-8 encoded bytes. But you couldn't use these paths outside of python. The fact is you shouldn't be doing os.path.abspath(b'.') in windows to begin with. |
Another option btw is to use utf-16, which will work but it's a bit ugly as well: >>> os.listdir(os.path.abspath(u'.').encode('utf-16'))
[]
>>> os.path.abspath(u'.')
u'C:\\Users\\alon\\Desktop\\\u05e9\u05dc\u05d5\u05dd'
>>> os.path.abspath(u'.').encode('utf-16')
'\xff\xfeC\x00:\x00\\\x00U\x00s\x00e\x00r\x00s\x00\\\x00a\x00l\x00o\x00n\x00\\\x
00D\x00e\x00s\x00k\x00t\x00o\x00p\x00\\\x00\xe9\x05\xdc\x05\xd5\x05\xdd\x05'
>>> os.listdir(os.path.abspath(u'.').encode('utf-16'))
[] Tested on python 2.7, but you know what I mean. |
On Wed, Oct 26, 2011 at 3:36 PM, Yuval Greenfield
I don't have use case. But does raising UnicodeEncodeError fix
Agreed. So I think adding Windows specific check to Byte API does not |
It won't break existing code. Ignoring this problem here only moves the exception to whenever the data returned is first used. Any code this fix "breaks" is already broken. |
UTF-8, UTF-16 or any encoding different than the ANSI code page are not an |
Yuval, you are assuming that *no one* who uses the os byte APIs on Windows is either checking for '?' in returned paths or catching later exceptions. With Google code search, I did find one instance where someone tests paths for '?' after encoding with the file system encoding. It was not an instance of os.xxx output, but it is the same idea. In any case,
The justification that mitigates the above is that there is little reason to request os bytes returns. By the same reasoning, the change is hardly worth bothering with as there should be little to no benefit in real code. So I am +-0 on the change. |
New changeset 2cad20e2e588 by Victor Stinner in branch 'default': |
Oops, I specified the wrong issue number in my changeset 2cad20e2e588, it's the issue bpo-13216. |
See also bpo-16656 where another approach was proposed (unicode names returned from Bytes API if result is not encodable). Actually I think now that there is no right solution of this issue. |
My patch can be applied in Python 3.5 to notice immediatly users that filenames cannot be encoded to the ANSI code page. Anyway, bytes filenames are deprecated (emit a DeprecationWarning warning) in the os module on Windows since Python 3.3. |
Can someone do a patch review please, it's way over my head, and set the stage and versions as appropriate. |
I'm -1 on the patch. The string currently returned might be useless, but the fundamental problem is that using bytes for filenames on Windows just isn't sufficient for all cases. Microsoft has chosen to return question marks in the API, and Python should return them as the system vendor did. Another alternative would be to switch to UTF-8 as the file system encoding on Windows, but that change might be too incompatible. |
Ok to keep calls to ANSI versions of the Windows API when bytes filenames are used (so get question marks on encoding errors).
On Linux, I tried to have more than one "OS" encoding and it was a big fail (search for "PYTHONFSENCODING" env var in Python history). It introduced many new tricky issues. In short, Python should use the same "OS encoding" *everyone*. Since they are many places where Python doesn't control the encoding, we must use the same encoding than the OS. For example, os.listdir(b'.') uses the ANSI code page. If you concatenate two strings, one encoding to UTF-8 and the other encoded to the ANSI code page, you will at least see mojibake, and your operation will probably fail (ex: unable to open the file). I mean that forcing an encoding *everywhere* is a losing battle. There are too many external functions using the locale encoding on UNIX and the ANSI code page on Windows. Not only in the C library, think also to OpenSSL just to give you one example. Anyway, bytes filenames are deprecated since Python 3.2 so it's maybe time to stop using them! -- Another alternative is to completly drop support of bytes filenames on Windows in Python 3.5. But I expect that too many applications will just fail. It's too early for such disruptive change. So I'm just closing the issue as "not a bug", because Python just follows the vendor choice (Microsoft decided to use funny question marks :-)). |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: