Improve support of PEP 383 (surrogates) in Python3: meta-issue #52489

vstinner · 2010-03-27T01:12:36Z

BPO	8242
Nosy	@loewis, @vstinner
Dependencies	bpo-7606: test_xmlrpc fails with non-ascii path bpo-8092: utf8, backslashreplace and surrogates bpo-8383: pickle is unable to encode unicode surrogates bpo-8390: tarfile: use surrogates for undecode fields bpo-8391: os.execvpe() doesn't support surrogates in env bpo-8393: subprocess: support undecodable current working directory on POSIX OS bpo-8394: ctypes.dlopen() doesn't support surrogates bpo-8412: os.system() doesn't support surrogates nor bytes bpo-8467: subprocess: surrogates of the error message (Python implementation on non-Windows) bpo-8468: bz2: support surrogates in filename, and bytes/bytearray filename bpo-8477: _ssl: support surrogates in filenames, and bytes/bytearray filenames bpo-8485: Don't accept bytearray as filenames, or simplify the API
Files	surrogates-7.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-07-29.22:27:45.277>
created_at = <Date 2010-03-27.01:12:36.241>
labels = ['invalid', 'expert-unicode']
title = 'Improve support of PEP 383 (surrogates) in Python3: meta-issue'
updated_at = <Date 2010-07-29.22:27:45.275>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-07-29.22:27:45.275>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-07-29.22:27:45.277>
closer = 'vstinner'
components = ['Unicode']
creation = <Date 2010-03-27.01:12:36.241>
creator = 'vstinner'
dependencies = ['7606', '8092', '8383', '8390', '8391', '8393', '8394', '8412', '8467', '8468', '8477', '8485']
files = ['17002']
hgrepos = []
issue_num = 8242
keywords = ['patch']
message_count = 13.0
messages = ['101815', '101816', '101818', '102960', '103104', '103550', '103662', '103663', '103671', '103697', '104933', '112019', '112020']
nosy_count = 2.0
nosy_names = ['loewis', 'vstinner']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue8242'
versions = ['Python 3.1', 'Python 3.2']

vstinner · 2010-03-27T01:12:32Z

If the fullpath to the python3 binary contains a non-ASCII character and the file system encoding is ASCII, Python fails with:
---
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: can't initialize sys standard streams
ImportError: No module named encodings.utf_8
Abandon
---

The file system encoding is set to ASCII if there is no locale (eg. LANG=C).

The problem is that the command line argument, especially argv[0], is stored to a wchar_t* string using surrogates to store undecodable bytes.

Attached patch fixes calculate_path() and import functions to support surrogates. Details:

Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(), because its value is required to encode unicode using surrogates to bytes
Rename char2wchar() to _Py_char2wchar(), the function is not more static ; and create function _Py_wchar2char()
Escape surrogates (reimplement surrogateescape decoder) in calculate_path() subfunctions (_wstat, _wgetcwd, _Py_wreadlink)
Use surrogateescape error handler in find_module(), NullImporter_init() and zipimporter_init()
Write a "fastpath" (I don't know the right term: is it an hack?) for utf-8 encoding with surrogateescape error handler in PyUnicode_AsEncodedObject() and PyUnicode_AsEncodedString(): required because these functions are called by codecs module is initialized

The patch is a work in progress: there are some FIXME (I don't know if the string should be encoded/decoded using surrogates or not).

I only tested ASCII and UTF-8 file system encodings. I don't know if we can support more encodings. Python has few builtin encodings. Other encodings are implemented in Python: we have to import them, but we need the codec to import a module, so...

I don't think that Windows is affected by this issue because it has a better API for unicode filenames and command line arguments, and most patched functions are surrounded by #ifndef WINDOWS ... #endif

vstinner · 2010-03-27T01:17:33Z

If I understood correctly, my patch is also required to import a module having a non-ASCII full path if the file system encoding is ASCII.

vstinner · 2010-03-27T01:51:06Z

Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(),
because its value is required to encode unicode using surrogates to bytes

Oh, it doesn't work: get_codeset() returns NULL, because the codec register is empty when get_codeset() is called (with my patch).

vstinner · 2010-04-12T17:14:23Z

New patch fixing more issues about undecodable filenames.

TODO:

Remove assert(PyBytes_Check(opath)); from NullImporter_init() and zipimporter_init()
Fix setup_context() (_warnings.c)
Reencode module filenames if the system default encoding changes
Lib/unittest/runner.py and Lib/test/test_subprocess.py contain hacks to fix tests. It might be rewritten
Fix the 3 "FIXME: use _Py_char2wchar" in getpath.c

I restored code setting the system encoding.

The patch fixes also _posixsubprocess.fork_exec() to support undecodable current working directory.

vstinner · 2010-04-14T00:34:03Z

New version of the patch: all tests pass except of 3 (test_ftplib, test_pep3120, test_traceback).

vstinner · 2010-04-18T23:29:48Z

I commited the platform.py patch as r80166 (trunk) and r80167 (py3k), but quickly reverted it because the patch on trunk broke Python bootstrap. The patch might be applied, but only on py3k and with more tests (ensure that it doesn't break bootstrap on any OS) :-)

vstinner · 2010-04-20T00:25:10Z

Updated patch:

Some parts have been applied in other issues
Remove assert(PyBytes_Check(x)): support PyByteArray type
use PyErr_Format() instead of sprintf+PyErr_SetString in tokenizer.c
don't convert message to byte and then back to unicode in err_input(): keep the unicode object

vstinner · 2010-04-20T00:28:19Z

$ diffstat ~/surrogates-7.patch
 Doc/library/tarfile.rst     |   15 +--
 Include/moduleobject.h      |    1
 Lib/platform.py             |   12 +-
 Lib/subprocess.py           |    2
 Lib/tarfile.py              |   14 --
 Lib/test/regrtest.py        |    5 -
 Lib/test/test_import.py     |    5 +
 Lib/test/test_reprlib.py    |    4
 Lib/test/test_subprocess.py |    4
 Lib/test/test_tarfile.py    |    4
 Lib/test/test_urllib.py     |    8 +
 Lib/test/test_urllib2.py    |    4
 Lib/test/test_xml_etree.py  |    6 +
 Lib/traceback.py            |   10 +-
 Lib/unittest/runner.py      |    4
 Modules/_ctypes/callproc.c  |   12 +-
 Modules/_ssl.c              |   10 +-
 Modules/_tkinter.c          |    6 -
 Modules/getpath.c           |  100 ++++++++++++++++++--
 Modules/main.c              |   46 +++++

loewis · 2010-04-20T05:45:56Z

I haven't reviewed the patch in detail yet, but it seems to me that it fixes independent issues. -1000 on that. One problem, one bug report in the tracker, one commit.

If this issue is about the import machinery not working anymore if there is a non-ASCII character in the path, then why the heck does it touch posixmodule.c????

As for modules that have non-ASCII characters in their module name: this is, again, an unrelated issue (ISTM), so if you want to deal with it, please create a new issue.

vstinner · 2010-04-20T12:20:43Z

I haven't reviewed the patch in detail yet, but it seems to me that
it fixes independent issues.

Right. First I only wanted to fix import machinery, but then I fixed a lot of "indenpendent" issues to test the patch on import. All fixes are related to surrogates. I'm splitting the big patch into small parts: see the dependency list of this issue.

I will open a new issue for the import machinery. But this patch requires extra changes which are now discussed in new issues.

(...) why the heck does it > touch posixmodule.c?

I opened issue bpo-8391 for this change: "os.execvpe() doesn't support surrogates in env".

vstinner · 2010-05-04T13:32:11Z

I opened a different issue to use surrogates in Python module path: bpo-8611, but the issue is not specific to surrogates ("Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX)").

vstinner · 2010-07-29T22:26:24Z

I created a new svn branch for my work on import in unicode. I will open a new issue and so I close this one.

vstinner · 2010-07-29T22:27:45Z

Remove dependency on bpo-6697 to be able to close this issue.

vstinner added the topic-unicode label Mar 27, 2010

vstinner changed the title ~~Support surrogates in import ; install Python in a non-ASCII directory~~ Improve support of PEP 383 (surrogates) in Python3: meta-issue Apr 20, 2010

vstinner closed this as completed Jul 29, 2010

vstinner added the invalid label Jul 29, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support of PEP 383 (surrogates) in Python3: meta-issue #52489

Improve support of PEP 383 (surrogates) in Python3: meta-issue #52489

vstinner commented Mar 27, 2010

vstinner commented Mar 27, 2010

vstinner commented Mar 27, 2010

vstinner commented Mar 27, 2010

vstinner commented Apr 12, 2010

vstinner commented Apr 14, 2010

vstinner commented Apr 18, 2010

vstinner commented Apr 20, 2010

vstinner commented Apr 20, 2010

loewis mannequin commented Apr 20, 2010

vstinner commented Apr 20, 2010

vstinner commented May 4, 2010

vstinner commented Jul 29, 2010

vstinner commented Jul 29, 2010

Improve support of PEP 383 (surrogates) in Python3: meta-issue #52489

Improve support of PEP 383 (surrogates) in Python3: meta-issue #52489

Comments

vstinner commented Mar 27, 2010

vstinner commented Mar 27, 2010

vstinner commented Mar 27, 2010

vstinner commented Mar 27, 2010

vstinner commented Apr 12, 2010

vstinner commented Apr 14, 2010

vstinner commented Apr 18, 2010

vstinner commented Apr 20, 2010

vstinner commented Apr 20, 2010

loewis mannequin commented Apr 20, 2010

vstinner commented Apr 20, 2010

vstinner commented May 4, 2010

vstinner commented Jul 29, 2010

vstinner commented Jul 29, 2010