Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX) #52857

vstinner · 2010-05-04T13:30:49Z

BPO	8611
Nosy	@brettcannon, @birkenfeld, @pitrou, @vstinner, @bitdancer, @asvetlov
Dependencies	bpo-9425: Rewrite import machinery to work with unicode paths

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-10-19.01:02:50.824>
created_at = <Date 2010-05-04.13:30:49.120>
labels = ['interpreter-core', 'expert-unicode', 'release-blocker']
title = "Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX)"
updated_at = <Date 2010-10-19.01:02:50.823>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-10-19.01:02:50.823>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-10-19.01:02:50.824>
closer = 'vstinner'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2010-05-04.13:30:49.120>
creator = 'vstinner'
dependencies = ['9425']
files = []
hgrepos = []
issue_num = 8611
keywords = ['patch']
message_count = 26.0
messages = ['104932', '104934', '104935', '104941', '105241', '105723', '106097', '106103', '106154', '106159', '106337', '106474', '108569', '109025', '112031', '112119', '115637', '115944', '118324', '118908', '118967', '118976', '118979', '118991', '118997', '119098']
nosy_count = 7.0
nosy_names = ['brett.cannon', 'georg.brandl', 'pitrou', 'vstinner', 'Arfrever', 'r.david.murray', 'asvetlov']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue8611'
versions = ['Python 3.2']

vstinner · 2010-05-04T13:30:48Z

Python3 is unable to start (bootstrap failure) on a POSIX system if the locale encoding is different than utf8 and the Python path (standard library path where the encoding module is stored) contains a non-ASCII character. (Windows and Mac OS X are not affected by this issue because the file system encoding is hardcoded.)

Py_FileSystemDefaultEncoding == NULL
calculate_path(): sys.path is filled with directory names decoded with the locale encoding
find_module() encodes each path using PyUnicode_AsEncodedString(..., Py_FileSystemDefaultEncoding, NULL): use "utf-8" encoding because Py_FileSystemDefaultEncoding is NULL

=> error because the path is not encoded and decoded with the same encoding

We cannot encodes a path with the locale encoding because we need find_module() to load the encoding codec, and loading the codec needs find_module()... (bootstrap error :-))

We should decodes the path using a fixed encoding (eg. ASCII or utf-8), use the same encoding to encodes paths in find_module(), and then reencode paths of all objects storing filenames:

sys.path list items
sys.modules dict keys
sys.modules values: each module have __file__ and/or __path__ attributes
all code objects (co_filename)
(maybe some other?)

The error occurs in an early stage of Py_InitializeEx(), so the object list is limited and we control this list (eg. site is not loaded yet).

Related issues:

bpo-8610: "Python3/POSIX: errors if file system encoding is None"
bpo-8242: "Improve support of PEP-383 (surrogates) in Python3: meta-issue"

pitrou · 2010-05-04T13:36:43Z

We could have a separate list storing the original bytes form of sys.path; this list would be used by find_module() as long as Py_FileSystemDefaultEncoding isn't initialized.

pitrou · 2010-05-04T13:39:42Z

Or find_module() could use wcstombs() as long as Py_FileSystemDefaultEncoding is NULL.

vstinner · 2010-05-04T14:16:15Z

I have a patch implementation most of the point described in my first message. I have to rework on it before submit it. The patch depends on other issues, and I prefer to first fix all related issues.

vstinner · 2010-05-07T22:35:59Z

Let's try with something: pyunicode_asencodefsdefault.patch adds PyUnicode_EncodeFSDefault() function to uniformize how a unicode is converted to bytes. Fallback to UTF-8 if Py_FileSystemEncoding is not set (I should be ASCII, not UTF-8) and use surrogateescape error handler.

vstinner · 2010-05-14T16:57:24Z

I opened a separated issue for the new function PyUnicode_EncodeFSDefault(): bpo-8715.

vstinner · 2010-05-19T20:45:53Z

See also r67057 (issue bpo-3723).

asvetlov · 2010-05-20T14:10:27Z

After looking in bpo-4352 deep I figured out what true separation of filesystem default encoding and utf8 python namespace is really too complicated.
For example import call stack chain converts module name from utf-8 to filesystem in import.c:find_module. After that converted name used by PyImport_ExecCodeModule* as utf-8 name while actually it has filesystem encoding. That problem cannot be solved by "five-line patch" and Martin von Loevis suggested me to stop potentially dangerous big import.c changes in python 3.1 beta.
I like importlib way (with maybe C implementation as next step) in terms of "true way" reorganization of python import machinery, but unfortunatelly Cannon has no time for that. From my perspective only big refactoring can solve encoding issues (and we can use excellent io implementation to open utf-8 named files in Windows using native unicode functions). We need to split 'module names' from 'filesystem pathes' clean.
Maybe pure python importing is not easy - not sure. But reorganizing of current 'import spaghetti' is required. importlib (and PEP-302) introduced a nice way to do that.
I like to be volunteer for this task and I feel enough knowledge to implement and test cover at least windows and linux (MacOs is not big problem also). But I need a mentor (Petrou, Cannon - you are welcome) to make it done, done clear and stable, done in resonable time period.

vstinner · 2010-05-20T15:22:33Z

As I wrote, I have an huge patch somewhere in my harddrive fixing this issue. But I don't want to publish it because it's really huge. I prefer to fix the problem step by step. I fixed most related issues: see the dependency list of bpo-8242. I will publish the big patch shortly.

asvetlov · 2010-05-23T17:14:57Z

I'm skeptical about surrogates particularly for that problem.
From my perspective the solution is only to use native unicode support for windows file operation functions.
Conversions utf-8 -> mbcs -> utf8 will loose encoding information thanks to tricky Microsoft mbcs encoding schema.
If I'm wrong please correct me.

vstinner · 2010-05-25T20:55:20Z

asvetlov> I'm skeptical about surrogates particularly for that
asvetlov> problem. From my perspective the solution is only to use
asvetlov> native unicode support for windows file operation functions.

It's not exclusive. We can use surrogates on POSIX and then convert to bytes at the system calls, and use the unicode version of the Windows API. In both cases, filenames are unicode.

asvetlov> Conversions utf-8 -> mbcs -> utf8 will loose encoding
asvetlov> information thanks to tricky Microsoft mbcs encoding schema.
asvetlov> If I'm wrong please correct me.

On Windows, Python3 *does* convert unicode to bytes with the mbcs encoding in the import machinery. I tested and Python3 has the same problem on Windows with non decodable filenames than Python3 on Unix. Eg. add "\u0809" character (random non encodable character) to the Python directory name: Python3 doesn't start if the code page cannot encode/decode it.

To fix all OS (Windows and POSIX), Python3 import machinery should not convert filenames to bytes but manipulate unicode characters and only convert filenames to bytes on POSIX at the last moment (at system calls).

--

mbcs codec ignores the error handler: it replaces unknown characters by "?" by default, see bpo-850997.

vstinner · 2010-06-24T23:54:54Z

I think that bpo-8988 is a duplicate of this issue.

vstinner · 2010-06-30T23:20:35Z

See also bpo-3080.

vstinner · 2010-07-30T00:23:56Z

I posted a patch to fix this issue: see bpo-9425.

birkenfeld · 2010-07-31T07:55:45Z

This will have to wait until after alpha1, as well.

birkenfeld · 2010-09-05T08:15:44Z

The Unicode import system won't be put in place before 3.2a2, deferring.

vstinner · 2010-09-09T12:50:10Z

See also bpo-9713 (Py_CompileString fails on non decode-able paths) and bpo-9738 (Document the encoding of functions bytes arguments of the C API).

birkenfeld · 2010-10-10T09:32:58Z

Deferring once again.

vstinner · 2010-10-17T00:31:21Z

Status of this issue, 5 months later: most tests pass except test_gc test_gdb test_runpy test_sys test_wsgiref test_zipimport. Said differently, 95% of the task (or more?) is done. It's possible to run Python installed in a non-ascii directory with any locale (I tested ascii, iso-8859-1 and utf-8).

vstinner · 2010-10-17T19:20:55Z

Updated list of failing test with py3k and a non-ascii path:

Linux, LANG=C: test_gc test_gdb test_runpy test_zipimport
Windows: test_email test_httpservers test_zipimport

Possible reasons:

test_httpservers (CGIHTTPServerTestCase.setUp): test should be skipped if sys.executable is not pure ASCII (and it's not possible to create ASCII path using a symlink)
test_zipimport: zipimport uses utf-8 (in strict mode) for the prefix, instead of the filesystem encoding
test_gc (test_get_count): "The following two tests are fragile: ..." :-/
test_gdb: libpython doesn't support surrogates if paths
test_email: issue with the end of line (\n vs \r\n?)
test_runpy: ?

vstinner · 2010-10-17T20:03:11Z

r85655 fixed test_gdb failure.

test_runpy failure looks to be linked to test_zipimport problems.

vstinner · 2010-10-17T20:19:12Z

r85659 + r85662 + r85663 fixed test_httpservers.

bitdancer · 2010-10-17T23:45:46Z

Victor, can you paste or attach the error for email? My MSDN subscription has expired so I can't set up to test it myself (I've submitted the renewal, but who knows how long it will take to process :)

vstinner · 2010-10-18T03:49:57Z

Victor, can you paste or attach the error for email?

It doesn't look to be related to the path name (same failure with "py3ké" or "py3k" directory name), so I opened bpo-10134.

vstinner · 2010-10-19T01:02:51Z

Starting at r85691, the full test suite of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non-ascii directory. The work on this issue is done.

vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels May 4, 2010

ncoghlan added the release-blocker label Jun 28, 2010

birkenfeld added deferred-blocker release-blocker and removed release-blocker deferred-blocker labels Jul 31, 2010

birkenfeld added deferred-blocker release-blocker and removed release-blocker deferred-blocker labels Sep 5, 2010

birkenfeld added deferred-blocker release-blocker and removed release-blocker deferred-blocker labels Oct 10, 2010

vstinner closed this as completed Oct 19, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX) #52857

Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX) #52857

vstinner commented May 4, 2010

vstinner commented May 4, 2010

pitrou commented May 4, 2010

pitrou commented May 4, 2010

vstinner commented May 4, 2010

vstinner commented May 7, 2010

vstinner commented May 14, 2010

vstinner commented May 19, 2010

vstinner commented May 19, 2010

asvetlov commented May 20, 2010

vstinner commented May 20, 2010

asvetlov commented May 23, 2010

vstinner commented May 25, 2010

vstinner commented Jun 24, 2010

vstinner commented Jun 30, 2010

vstinner commented Jul 30, 2010

birkenfeld commented Jul 31, 2010

birkenfeld commented Sep 5, 2010

vstinner commented Sep 9, 2010

birkenfeld commented Oct 10, 2010

vstinner commented Oct 17, 2010

vstinner commented Oct 17, 2010

vstinner commented Oct 17, 2010

vstinner commented Oct 17, 2010

bitdancer commented Oct 17, 2010

vstinner commented Oct 18, 2010

vstinner commented Oct 19, 2010

Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX) #52857

Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX) #52857

Comments

vstinner commented May 4, 2010

vstinner commented May 4, 2010

pitrou commented May 4, 2010

pitrou commented May 4, 2010

vstinner commented May 4, 2010

vstinner commented May 7, 2010

vstinner commented May 14, 2010

vstinner commented May 19, 2010

vstinner commented May 19, 2010

asvetlov commented May 20, 2010

vstinner commented May 20, 2010

asvetlov commented May 23, 2010

vstinner commented May 25, 2010

vstinner commented Jun 24, 2010

vstinner commented Jun 30, 2010

vstinner commented Jul 30, 2010

birkenfeld commented Jul 31, 2010

birkenfeld commented Sep 5, 2010

vstinner commented Sep 9, 2010

birkenfeld commented Oct 10, 2010

vstinner commented Oct 17, 2010

vstinner commented Oct 17, 2010

vstinner commented Oct 17, 2010

vstinner commented Oct 17, 2010

bitdancer commented Oct 17, 2010

vstinner commented Oct 18, 2010

vstinner commented Oct 19, 2010