Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support of PEP 383 (surrogates) in Python3: meta-issue #52489

Closed
vstinner opened this issue Mar 27, 2010 · 13 comments
Closed

Improve support of PEP 383 (surrogates) in Python3: meta-issue #52489

vstinner opened this issue Mar 27, 2010 · 13 comments

Comments

@vstinner
Copy link
Member

BPO 8242
Nosy @loewis, @vstinner
Dependencies
  • bpo-7606: test_xmlrpc fails with non-ascii path
  • bpo-8092: utf8, backslashreplace and surrogates
  • bpo-8383: pickle is unable to encode unicode surrogates
  • bpo-8390: tarfile: use surrogates for undecode fields
  • bpo-8391: os.execvpe() doesn't support surrogates in env
  • bpo-8393: subprocess: support undecodable current working directory on POSIX OS
  • bpo-8394: ctypes.dlopen() doesn't support surrogates
  • bpo-8412: os.system() doesn't support surrogates nor bytes
  • bpo-8467: subprocess: surrogates of the error message (Python implementation on non-Windows)
  • bpo-8468: bz2: support surrogates in filename, and bytes/bytearray filename
  • bpo-8477: _ssl: support surrogates in filenames, and bytes/bytearray filenames
  • bpo-8485: Don't accept bytearray as filenames, or simplify the API
  • Files
  • surrogates-7.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-07-29.22:27:45.277>
    created_at = <Date 2010-03-27.01:12:36.241>
    labels = ['invalid', 'expert-unicode']
    title = 'Improve support of PEP 383 (surrogates) in Python3: meta-issue'
    updated_at = <Date 2010-07-29.22:27:45.275>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-07-29.22:27:45.275>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-07-29.22:27:45.277>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2010-03-27.01:12:36.241>
    creator = 'vstinner'
    dependencies = ['7606', '8092', '8383', '8390', '8391', '8393', '8394', '8412', '8467', '8468', '8477', '8485']
    files = ['17002']
    hgrepos = []
    issue_num = 8242
    keywords = ['patch']
    message_count = 13.0
    messages = ['101815', '101816', '101818', '102960', '103104', '103550', '103662', '103663', '103671', '103697', '104933', '112019', '112020']
    nosy_count = 2.0
    nosy_names = ['loewis', 'vstinner']
    pr_nums = []
    priority = 'normal'
    resolution = 'not a bug'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue8242'
    versions = ['Python 3.1', 'Python 3.2']

    @vstinner
    Copy link
    Member Author

    If the fullpath to the python3 binary contains a non-ASCII character and the file system encoding is ASCII, Python fails with:
    ---
    Could not find platform independent libraries <prefix>
    Could not find platform dependent libraries <exec_prefix>
    Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
    Fatal Python error: Py_Initialize: can't initialize sys standard streams
    ImportError: No module named encodings.utf_8
    Abandon
    ---

    The file system encoding is set to ASCII if there is no locale (eg. LANG=C).

    The problem is that the command line argument, especially argv[0], is stored to a wchar_t* string using surrogates to store undecodable bytes.

    Attached patch fixes calculate_path() and import functions to support surrogates. Details:

    • Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(), because its value is required to encode unicode using surrogates to bytes
    • Rename char2wchar() to _Py_char2wchar(), the function is not more static ; and create function _Py_wchar2char()
    • Escape surrogates (reimplement surrogateescape decoder) in calculate_path() subfunctions (_wstat, _wgetcwd, _Py_wreadlink)
    • Use surrogateescape error handler in find_module(), NullImporter_init() and zipimporter_init()
    • Write a "fastpath" (I don't know the right term: is it an hack?) for utf-8 encoding with surrogateescape error handler in PyUnicode_AsEncodedObject() and PyUnicode_AsEncodedString(): required because these functions are called by codecs module is initialized

    The patch is a work in progress: there are some FIXME (I don't know if the string should be encoded/decoded using surrogates or not).

    I only tested ASCII and UTF-8 file system encodings. I don't know if we can support more encodings. Python has few builtin encodings. Other encodings are implemented in Python: we have to import them, but we need the codec to import a module, so...

    I don't think that Windows is affected by this issue because it has a better API for unicode filenames and command line arguments, and most patched functions are surrounded by #ifndef WINDOWS ... #endif

    @vstinner
    Copy link
    Member Author

    If I understood correctly, my patch is also required to import a module having a non-ASCII full path if the file system encoding is ASCII.

    @vstinner
    Copy link
    Member Author

    Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(),
    because its value is required to encode unicode using surrogates to bytes

    Oh, it doesn't work: get_codeset() returns NULL, because the codec register is empty when get_codeset() is called (with my patch).

    @vstinner
    Copy link
    Member Author

    New patch fixing more issues about undecodable filenames.

    Lib/test/test_subprocess.py | 4 -
    Lib/unittest/runner.py | 4 +
    Modules/_posixsubprocess.c | 21 ++++++++--
    Modules/getpath.c | 90 +++++++++++++++++++++++++++++++++++++++-----
    Modules/posixmodule.c | 5 +-
    Modules/python.c | 6 +-
    Modules/zipimport.c | 11 ++++-
    Objects/fileobject.c | 6 +-
    Objects/unicodeobject.c | 22 ++++++++--
    Parser/tokenizer.c | 14 ++++--
    Python/_warnings.c | 7 +++
    Python/ast.c | 10 +++-
    Python/ceval.c | 2
    Python/errors.c | 2
    Python/import.c | 37 +++++++++++++-----
    Python/traceback.c | 38 ++++++++++++++----
    16 files changed, 225 insertions(+), 54 deletions(-)

    TODO:

    • Remove assert(PyBytes_Check(opath)); from NullImporter_init() and zipimporter_init()
    • Fix setup_context() (_warnings.c)
    • Reencode module filenames if the system default encoding changes
    • Lib/unittest/runner.py and Lib/test/test_subprocess.py contain hacks to fix tests. It might be rewritten
    • Fix the 3 "FIXME: use _Py_char2wchar" in getpath.c

    I restored code setting the system encoding.

    The patch fixes also _posixsubprocess.fork_exec() to support undecodable current working directory.

    @vstinner
    Copy link
    Member Author

    New version of the patch: all tests pass except of 3 (test_ftplib, test_pep3120, test_traceback).

    @vstinner
    Copy link
    Member Author

    I commited the platform.py patch as r80166 (trunk) and r80167 (py3k), but quickly reverted it because the patch on trunk broke Python bootstrap. The patch might be applied, but only on py3k and with more tests (ensure that it doesn't break bootstrap on any OS) :-)

    @vstinner
    Copy link
    Member Author

    Updated patch:

    • Some parts have been applied in other issues
    • Remove assert(PyBytes_Check(x)): support PyByteArray type
    • use PyErr_Format() instead of sprintf+PyErr_SetString in tokenizer.c
    • don't convert message to byte and then back to unicode in err_input(): keep the unicode object

    @vstinner
    Copy link
    Member Author

    $ diffstat ~/surrogates-7.patch
     Doc/library/tarfile.rst     |   15 +--
     Include/moduleobject.h      |    1
     Lib/platform.py             |   12 +-
     Lib/subprocess.py           |    2
     Lib/tarfile.py              |   14 --
     Lib/test/regrtest.py        |    5 -
     Lib/test/test_import.py     |    5 +
     Lib/test/test_reprlib.py    |    4
     Lib/test/test_subprocess.py |    4
     Lib/test/test_tarfile.py    |    4
     Lib/test/test_urllib.py     |    8 +
     Lib/test/test_urllib2.py    |    4
     Lib/test/test_xml_etree.py  |    6 +
     Lib/traceback.py            |   10 +-
     Lib/unittest/runner.py      |    4
     Modules/_ctypes/callproc.c  |   12 +-
     Modules/_ssl.c              |   10 +-
     Modules/_tkinter.c          |    6 -
     Modules/getpath.c           |  100 ++++++++++++++++++--
     Modules/main.c              |   46 +++++

    Modules/posixmodule.c | 18 ++-
    Modules/pyexpat.c | 11 +-
    Modules/zipimport.c | 210 ++++++++++++++++++++++++++++++++------------
    Objects/codeobject.c | 7 +
    Objects/exceptions.c | 49 ++++++----
    Objects/fileobject.c | 6 -
    Objects/moduleobject.c | 22 +++-
    Objects/unicodeobject.c | 22 +++-
    Parser/tokenizer.c | 18 ++-
    Python/_warnings.c | 26 ++++-
    Python/ast.c | 10 +-
    Python/bltinmodule.c | 33 ++++--
    Python/ceval.c | 4
    Python/compile.c | 12 ++
    Python/errors.c | 4
    Python/import.c | 88 ++++++++++++------
    Python/pythonrun.c | 39 ++++----
    Python/traceback.c | 39 ++++++--
    38 files changed, 625 insertions(+), 265 deletions(-)

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 20, 2010

    I haven't reviewed the patch in detail yet, but it seems to me that it fixes independent issues. -1000 on that. One problem, one bug report in the tracker, one commit.

    If this issue is about the import machinery not working anymore if there is a non-ASCII character in the path, then why the heck does it touch posixmodule.c????

    As for modules that have non-ASCII characters in their module name: this is, again, an unrelated issue (ISTM), so if you want to deal with it, please create a new issue.

    @vstinner
    Copy link
    Member Author

    I haven't reviewed the patch in detail yet, but it seems to me that
    it fixes independent issues.

    Right. First I only wanted to fix import machinery, but then I fixed a lot of "indenpendent" issues to test the patch on import. All fixes are related to surrogates. I'm splitting the big patch into small parts: see the dependency list of this issue.

    I will open a new issue for the import machinery. But this patch requires extra changes which are now discussed in new issues.

    (...) why the heck does it > touch posixmodule.c?

    I opened issue bpo-8391 for this change: "os.execvpe() doesn't support surrogates in env".

    @vstinner vstinner changed the title Support surrogates in import ; install Python in a non-ASCII directory Improve support of PEP 383 (surrogates) in Python3: meta-issue Apr 20, 2010
    @vstinner
    Copy link
    Member Author

    vstinner commented May 4, 2010

    I opened a different issue to use surrogates in Python module path: bpo-8611, but the issue is not specific to surrogates ("Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX)").

    @vstinner
    Copy link
    Member Author

    I created a new svn branch for my work on import in unicode. I will open a new issue and so I close this one.

    @vstinner
    Copy link
    Member Author

    Remove dependency on bpo-6697 to be able to close this issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant