New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX) #52857
Comments
Python3 is unable to start (bootstrap failure) on a POSIX system if the locale encoding is different than utf8 and the Python path (standard library path where the encoding module is stored) contains a non-ASCII character. (Windows and Mac OS X are not affected by this issue because the file system encoding is hardcoded.)
=> error because the path is not encoded and decoded with the same encoding We cannot encodes a path with the locale encoding because we need find_module() to load the encoding codec, and loading the codec needs find_module()... (bootstrap error :-)) We should decodes the path using a fixed encoding (eg. ASCII or utf-8), use the same encoding to encodes paths in find_module(), and then reencode paths of all objects storing filenames:
The error occurs in an early stage of Py_InitializeEx(), so the object list is limited and we control this list (eg. site is not loaded yet). Related issues: |
We could have a separate list storing the original bytes form of sys.path; this list would be used by find_module() as long as Py_FileSystemDefaultEncoding isn't initialized. |
Or find_module() could use wcstombs() as long as Py_FileSystemDefaultEncoding is NULL. |
I have a patch implementation most of the point described in my first message. I have to rework on it before submit it. The patch depends on other issues, and I prefer to first fix all related issues. |
Let's try with something: pyunicode_asencodefsdefault.patch adds PyUnicode_EncodeFSDefault() function to uniformize how a unicode is converted to bytes. Fallback to UTF-8 if Py_FileSystemEncoding is not set (I should be ASCII, not UTF-8) and use surrogateescape error handler. |
I opened a separated issue for the new function PyUnicode_EncodeFSDefault(): bpo-8715. |
See also bpo-4352. |
If I understood correctly, this issue is a regression introduced by r67055 (to fix bpo-4213). Read: http://bugs.python.org/issue4213#msg75387 See also r67057 (issue bpo-3723). |
After looking in bpo-4352 deep I figured out what true separation of filesystem default encoding and utf8 python namespace is really too complicated. |
As I wrote, I have an huge patch somewhere in my harddrive fixing this issue. But I don't want to publish it because it's really huge. I prefer to fix the problem step by step. I fixed most related issues: see the dependency list of bpo-8242. I will publish the big patch shortly. |
I'm skeptical about surrogates particularly for that problem. |
asvetlov> I'm skeptical about surrogates particularly for that It's not exclusive. We can use surrogates on POSIX and then convert to bytes at the system calls, and use the unicode version of the Windows API. In both cases, filenames are unicode. asvetlov> Conversions utf-8 -> mbcs -> utf8 will loose encoding On Windows, Python3 *does* convert unicode to bytes with the mbcs encoding in the import machinery. I tested and Python3 has the same problem on Windows with non decodable filenames than Python3 on Unix. Eg. add "\u0809" character (random non encodable character) to the Python directory name: Python3 doesn't start if the code page cannot encode/decode it. To fix all OS (Windows and POSIX), Python3 import machinery should not convert filenames to bytes but manipulate unicode characters and only convert filenames to bytes on POSIX at the last moment (at system calls). -- mbcs codec ignores the error handler: it replaces unknown characters by "?" by default, see bpo-850997. |
I think that bpo-8988 is a duplicate of this issue. |
See also bpo-3080. |
I posted a patch to fix this issue: see bpo-9425. |
This will have to wait until after alpha1, as well. |
The Unicode import system won't be put in place before 3.2a2, deferring. |
Deferring once again. |
Status of this issue, 5 months later: most tests pass except test_gc test_gdb test_runpy test_sys test_wsgiref test_zipimport. Said differently, 95% of the task (or more?) is done. It's possible to run Python installed in a non-ascii directory with any locale (I tested ascii, iso-8859-1 and utf-8). |
Updated list of failing test with py3k and a non-ascii path:
Possible reasons:
|
r85655 fixed test_gdb failure. test_runpy failure looks to be linked to test_zipimport problems. |
r85659 + r85662 + r85663 fixed test_httpservers. |
Victor, can you paste or attach the error for email? My MSDN subscription has expired so I can't set up to test it myself (I've submitted the renewal, but who knows how long it will take to process :) |
It doesn't look to be related to the path name (same failure with "py3ké" or "py3k" directory name), so I opened bpo-10134. |
Starting at r85691, the full test suite of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non-ascii directory. The work on this issue is done. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: