classification
Title: Improve support of PEP 383 (surrogates) in Python3: meta-issue
Type: Stage:
Components: Unicode Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: not a bug
Dependencies: 7606 8092 8383 8390 8391 8393 8394 8412 8467 8468 8477 8485 Superseder:
Assigned To: Nosy List: loewis, vstinner
Priority: normal Keywords: patch

Created on 2010-03-27 01:12 by vstinner, last changed 2010-07-29 22:27 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
surrogates-7.patch vstinner, 2010-04-20 00:25
Messages (13)
msg101815 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-27 01:12
If the fullpath to the python3 binary contains a non-ASCII character and the file system encoding is ASCII, Python fails with:
---
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: can't initialize sys standard streams
ImportError: No module named encodings.utf_8
Abandon
---

The file system encoding is set to ASCII if there is no locale (eg. LANG=C).

The problem is that the command line argument, especially argv[0], is stored to a wchar_t* string using surrogates to store undecodable bytes.

Attached patch fixes calculate_path() and import functions to support surrogates. Details:

 * Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(), because its value is required to encode unicode using surrogates to bytes
 * Rename char2wchar() to _Py_char2wchar(), the function is not more static ; and create function _Py_wchar2char()
 * Escape surrogates (reimplement surrogateescape decoder) in calculate_path() subfunctions (_wstat, _wgetcwd, _Py_wreadlink)
 * Use surrogateescape error handler in find_module(), NullImporter_init() and zipimporter_init()
 * Write a "fastpath" (I don't know the right term: is it an hack?) for utf-8 encoding with surrogateescape error handler in PyUnicode_AsEncodedObject() and PyUnicode_AsEncodedString(): required because these functions are called by codecs module is initialized

The patch is a work in progress: there are some FIXME (I don't know if the string should be encoded/decoded using surrogates or not).

I only tested ASCII and UTF-8 file system encodings. I don't know if we can support more encodings. Python has few builtin encodings. Other encodings are implemented in Python: we have to import them, but we need the codec to import a module, so...

I don't think that Windows is affected by this issue because it has a better API for unicode filenames and command line arguments, and most patched functions are surrounded by #ifndef WINDOWS ... #endif
msg101816 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-27 01:17
If I understood correctly, my patch is also required to import a module having a non-ASCII full path if the file system encoding is ASCII.
msg101818 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-27 01:51
> Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(),
> because its value is required to encode unicode using surrogates to bytes

Oh, it doesn't work: get_codeset() returns NULL, because the codec register is empty when get_codeset() is called (with my patch).
msg102960 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-12 17:14
New patch fixing more issues about undecodable filenames.

 Lib/test/test_subprocess.py |    4 -
 Lib/unittest/runner.py      |    4 +
 Modules/_posixsubprocess.c  |   21 ++++++++--
 Modules/getpath.c           |   90 +++++++++++++++++++++++++++++++++++++++-----
 Modules/posixmodule.c       |    5 +-
 Modules/python.c            |    6 +-
 Modules/zipimport.c         |   11 ++++-
 Objects/fileobject.c        |    6 +-
 Objects/unicodeobject.c     |   22 ++++++++--
 Parser/tokenizer.c          |   14 ++++--
 Python/_warnings.c          |    7 +++
 Python/ast.c                |   10 +++-
 Python/ceval.c              |    2
 Python/errors.c             |    2
 Python/import.c             |   37 +++++++++++++-----
 Python/traceback.c          |   38 ++++++++++++++----
 16 files changed, 225 insertions(+), 54 deletions(-)

TODO:
 - Remove assert(PyBytes_Check(opath)); from NullImporter_init() and zipimporter_init()
 - Fix setup_context() (_warnings.c)
 - Reencode module filenames if the system default encoding changes
 - Lib/unittest/runner.py and Lib/test/test_subprocess.py contain hacks to fix tests. It might be rewritten
 - Fix the 3 "FIXME: use _Py_char2wchar" in getpath.c

I restored code setting the system encoding.

The patch fixes also _posixsubprocess.fork_exec() to support undecodable current working directory.
msg103104 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-14 00:34
New version of the patch: all tests pass except of 3 (test_ftplib, test_pep3120, test_traceback).
msg103550 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-18 23:29
I commited the platform.py patch as r80166 (trunk) and r80167 (py3k), but quickly reverted it because the patch on trunk broke Python bootstrap. The patch might be applied, but only on py3k and with more tests (ensure that it doesn't break bootstrap on any OS) :-)
msg103662 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-20 00:25
Updated patch:
 - Some parts have been applied in other issues
 - Remove assert(PyBytes_Check(x)): support PyByteArray type
 - use PyErr_Format() instead of sprintf+PyErr_SetString in tokenizer.c
 - don't convert message to byte and then back to unicode in err_input(): keep the unicode object
msg103663 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-20 00:28
$ diffstat ~/surrogates-7.patch
 Doc/library/tarfile.rst     |   15 +--
 Include/moduleobject.h      |    1
 Lib/platform.py             |   12 +-
 Lib/subprocess.py           |    2
 Lib/tarfile.py              |   14 --
 Lib/test/regrtest.py        |    5 -
 Lib/test/test_import.py     |    5 +
 Lib/test/test_reprlib.py    |    4
 Lib/test/test_subprocess.py |    4
 Lib/test/test_tarfile.py    |    4
 Lib/test/test_urllib.py     |    8 +
 Lib/test/test_urllib2.py    |    4
 Lib/test/test_xml_etree.py  |    6 +
 Lib/traceback.py            |   10 +-
 Lib/unittest/runner.py      |    4
 Modules/_ctypes/callproc.c  |   12 +-
 Modules/_ssl.c              |   10 +-
 Modules/_tkinter.c          |    6 -
 Modules/getpath.c           |  100 ++++++++++++++++++--
 Modules/main.c              |   46 +++++----
 Modules/posixmodule.c       |   18 ++-
 Modules/pyexpat.c           |   11 +-
 Modules/zipimport.c         |  210 ++++++++++++++++++++++++++++++++------------
 Objects/codeobject.c        |    7 +
 Objects/exceptions.c        |   49 ++++++----
 Objects/fileobject.c        |    6 -
 Objects/moduleobject.c      |   22 +++-
 Objects/unicodeobject.c     |   22 +++-
 Parser/tokenizer.c          |   18 ++-
 Python/_warnings.c          |   26 ++++-
 Python/ast.c                |   10 +-
 Python/bltinmodule.c        |   33 ++++--
 Python/ceval.c              |    4
 Python/compile.c            |   12 ++
 Python/errors.c             |    4
 Python/import.c             |   88 ++++++++++++------
 Python/pythonrun.c          |   39 ++++----
 Python/traceback.c          |   39 ++++++--
 38 files changed, 625 insertions(+), 265 deletions(-)
msg103671 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-04-20 05:45
I haven't reviewed the patch in detail yet, but it seems to me that it fixes independent issues. -1000 on that. One problem, one bug report in the tracker, one commit.

If this issue is about the import machinery not working anymore if there is a non-ASCII character in the path, then why the heck does it touch posixmodule.c????

As for modules that have non-ASCII characters in their module name: this is, again, an unrelated issue (ISTM), so if you want to deal with it, please create a new issue.
msg103697 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-20 12:20
> I haven't reviewed the patch in detail yet, but it seems to me that
> it fixes independent issues.

Right. First I only wanted to fix import machinery, but then I fixed a lot of "indenpendent" issues to test the patch on import. All fixes are related to surrogates. I'm splitting the big patch into small parts: see the dependency list of this issue.

I will open a new issue for the import machinery. But this patch requires extra changes which are now discussed in new issues.

> (...) why the heck does it > touch posixmodule.c?

I opened issue #8391 for this change: "os.execvpe() doesn't support surrogates in env".
msg104933 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-04 13:32
I opened a different issue to use surrogates in Python module path: #8611, but the issue is not specific to surrogates ("Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX)").
msg112019 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-29 22:26
I created a new svn branch for my work on import in unicode. I will open a new issue and so I close this one.
msg112020 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-29 22:27
Remove dependency on #6697 to be able to close this issue.
History
Date User Action Args
2010-07-29 22:27:45vstinnersetstatus: open -> closed
resolution: not a bug
dependencies: - Check that _PyUnicode_AsString() result is not NULL
messages: + msg112020
2010-07-29 22:26:23vstinnersetmessages: + msg112019
2010-05-04 13:32:11vstinnersetmessages: + msg104933
2010-04-23 21:00:33vstinnersetdependencies: + Check that _PyUnicode_AsString() result is not NULL
2010-04-23 11:38:00vstinnersetdependencies: + Don't accept bytearray as filenames, or simplify the API
2010-04-20 23:38:50vstinnersetdependencies: + _ssl: support surrogates in filenames, and bytes/bytearray filenames
2010-04-20 12:20:43vstinnersetmessages: + msg103697
title: Support surrogates in import ; install Python in a non-ASCII directory -> Improve support of PEP 383 (surrogates) in Python3: meta-issue
2010-04-20 12:12:56vstinnersetdependencies: + bz2: support surrogates in filename, and bytes/bytearray filename
2010-04-20 12:03:16vstinnersetdependencies: + subprocess: surrogates of the error message (Python implementation on non-Windows)
2010-04-20 11:16:16vstinnersetdependencies: + utf8, backslashreplace and surrogates
2010-04-20 05:45:57loewissetmessages: + msg103671
2010-04-20 00:28:19vstinnersetmessages: + msg103663
2010-04-20 00:27:55vstinnersetfiles: - surrogates-6.patch
2010-04-20 00:25:29vstinnersetfiles: + surrogates-7.patch

messages: + msg103662
2010-04-18 23:29:48vstinnersetmessages: + msg103550
2010-04-18 23:27:35vstinnersetdependencies: + tarfile: use surrogates for undecode fields
2010-04-16 01:14:54vstinnersetdependencies: + pickle is unable to encode unicode surrogates
2010-04-16 01:10:46vstinnersetdependencies: + os.system() doesn't support surrogates nor bytes
2010-04-14 01:18:17vstinnersetfiles: - surrogates-5.patch
2010-04-14 01:16:36vstinnersetdependencies: + ctypes.dlopen() doesn't support surrogates
2010-04-14 01:09:07vstinnersetdependencies: + subprocess: support undecodable current working directory on POSIX OS
2010-04-14 00:34:31vstinnersetfiles: + surrogates-6.patch

messages: + msg103104
2010-04-14 00:02:40vstinnersetdependencies: + os.execvpe() doesn't support surrogates in env
2010-04-13 23:37:47vstinnersetdependencies: + test_xmlrpc fails with non-ascii path
2010-04-12 17:20:54vstinnersetfiles: - surrogates_bootstrap-4.patch
2010-04-12 17:14:28vstinnersetfiles: + surrogates-5.patch

messages: + msg102960
2010-03-27 13:39:06pitrousetnosy: + loewis
2010-03-27 01:51:05vstinnersetmessages: + msg101818
2010-03-27 01:17:33vstinnersetmessages: + msg101816
2010-03-27 01:12:36vstinnercreate