Issue 10114: compile() doesn't support the PEP 383 (surrogates)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54323

classification

Title:	compile() doesn't support the PEP 383 (surrogates)
Type:		Stage:
Components:	Interpreter Core, Unicode	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, terry.reedy, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-10-15 12:34 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
code_encoding.patch	vstinner, 2010-10-15 22:58

Messages (13)
msg118762 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-15 12:34
Example: $ ./python Python 3.2a3+ (py3k, Oct 15 2010, 14:31:59) >>> compile('', 'abc\uDC80', 'exec') ... UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 3: surrogates not allowed Attached patch encodes manually the filename to utf-8 with surrogateescape. I found this problem while testing Python with an ASCII locale encoding (LANG=C ./python Lib/test/regrtest.py). Example: $ LANG=C ./python -m base64 -e setup.py ... UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' ...
msg118833 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-10-15 21:50
I think the title is slightly misleading. As I read the patch, the issue is that PyArg_ParseTupleAndKeywords requires that string args to C functions be valid Unicode strings (and that it does this by trying to encode to utf-8). Your patch subverts this by redefining filename to be a generic object, with a looser custom-coded test. It is not clear to me that filename, out of all string args to builtins, should be excepted this way. It seems to me that any real filename should be real unicode and there is no need for fake names that are not.
msg118836 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-15 22:04
#6543 changed code->co_filename encoding from filesystem encoding+surrogateescape to utf-8+strict. With my patch, compile('', '\udcc3\udca9', 'exec').co_filename gives 'é', it doesn't depend on the filesystem encoding. But 'é' cannot be used with all filesystem encodings, eg. with ascii locale encoding (C locale), use it raises an error. I now think that it was a bad idea to use utf-8 instead of the fileystem encoding. All filenames should use the filesystem encoding in Python.
msg118837 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-15 22:08
See also #9713.
msg118842 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-15 22:58
> All filenames should use the filesystem encoding in Python. Here is a new patch [code_encoding.patch] implementing this idea: - Use filesystem encoding (and surrogateescape) to encode/decode paths in compile() and the parser, instead of utf-8 in strict mode - Ensure that co_filename attribute can be used as a filename (eg. to not raise UnicodeEncodeError on Linux) - compile() builtin supports bytes filenames - _Py_FindSourceFile() (traceback.c) encodes paths of sys.path into the filesystem encoding, as do find_module() (import.c) - PyRun_SimpleFileExFlags() sets __file__ attribute using the filesystem encoding The patch restores the situation before #6543.
msg118843 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-10-15 22:58
Pardon my ignorance, but given that code.co_filename is a string attribute given as a string, which is to say, unicode in 3.x, I do not see what filesystem encodings, or any other encoding to bytes should really have to do with the attribute. I actually would have expected compile to take your example argument 'abc\uDC80' and paste it onto the code object unchanged. The only issue to me is whether any string should be allowed or only legal-unicode strings. Anything else would seem like a 2.x holdover. If PyBytes_AS_STRING (macro version of PyBytes_AsString) is the implementation of str(bytes_object) (as I would guess from the doc), then as I read your patch, it will produce rather strange 'filenames'. >>> str('abc\uDC80'.encode("utf-8", "surrogateescape")) "b'abc\\x80'" always wrapped in b'...'. If not that, what does it do (with no decoding specified)?
msg118845 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-15 23:17
> I do not see what filesystem encodings, or any other encoding > to bytes should really have to do with the [code.co_filename]. co_filename attribute is used to display the traceback: Python opens the related file, read the source code line and display it. On Windows, co_filename is directly used because Windows accepts unicode for filenames. But on other OSes, you have to encode the filename to the filesystem encoding. If your filesystem encoding is 'ascii' (eg. C locale) and co_filename is a non-ascii filename (eg. 'test_é.py'), encode co_filename will raise a UnicodeEncodeError. You can test it simply by using os.fsencode(): $ ./python Python 3.2a3+ (py3k:85551:85553M, Oct 16 2010, 00:54:03) >>> import sys; sys.getfilesystemencoding() 'utf-8' >>> import os; os.fsencode('é') b'\xc3\xa9' $ LANG= ./python Python 3.2a3+ (py3k:85551:85553M, Oct 16 2010, 00:54:03) >>> import sys; sys.getfilesystemencoding() 'ascii' >>> import os; os.fsencode('\xe9') ... UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' ... Said differently, co_filename should be encodable to the filesystem encoding (os.fsencode(co_filename) should not raise an error).
msg118846 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-15 23:18
Remove [compile_surrogates.patch] because it creates filenames unencode to the filesystem encoding. Eg. compile('', '\udcc3\udca9', 'exec').co_filename gives 'é' even if the filesystem encoding is 'ascii'.
msg118865 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-16 12:11
> Here is a new patch [code_encoding.patch] implementing this idea: > - Use filesystem encoding (and surrogateescape) to encode/decode > paths in compile() and the parser, instead of utf-8 in strict mode > (...) > The patch restores the situation before #6543. About Python 3.1 compatibility: Python 3.1 doesn't support non-ascii paths with a locale different than utf-8 (see issue #8611), so it doesn't change anything for Python 3.1 (it doesn't work anyway, with utf-8 or filesystem encoding).
msg118866 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-16 12:22
Oh, I just realized that Python 3.1.2 (last Python 3.1 release) was released the 21st March, whereas r82063 (commit for #6543) was made the 17st June. So the encoding change was not released yet.
msg118868 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-16 13:51
Commited to 3.2 (r85569+r85570). I wait for the buildbot before porting the patch to 3.1 and close the issue. There is already a regression on Gentoo buildbot with ascii locale encoding, test_doctest test_zipimport_support: http://www.python.org/dev/buildbot/all/builders/AMD64%20Gentoo%20Wide%203.x/builds/106
msg118871 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-16 14:16
I created #10123 for the test_doctest regression.
msg119099 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-19 01:22
Buildbots are green again (#10123 is closed). I ported the fix to Python 3.1 (r85716). Close this issue.

History
Date	User	Action	Args
2022-04-11 14:57:07	admin	set	github: 54323
2010-10-19 01:22:19	vstinner	set	status: open -> closed resolution: fixed messages: + msg119099
2010-10-16 14:16:12	vstinner	set	messages: + msg118871
2010-10-16 13:51:23	vstinner	set	messages: + msg118868
2010-10-16 12:22:19	vstinner	set	messages: + msg118866
2010-10-16 12:11:56	vstinner	set	messages: + msg118865
2010-10-15 23:19:17	vstinner	set	files: - compile_surrogates.patch
2010-10-15 23:18:55	vstinner	set	messages: + msg118846
2010-10-15 23:17:36	vstinner	set	messages: + msg118845
2010-10-15 22:58:41	terry.reedy	set	messages: + msg118843
2010-10-15 22:58:13	vstinner	set	files: + code_encoding.patch messages: + msg118842
2010-10-15 22:08:57	vstinner	set	messages: + msg118837
2010-10-15 22:04:54	vstinner	set	messages: + msg118836
2010-10-15 21:50:09	terry.reedy	set	nosy: + terry.reedy messages: + msg118833
2010-10-15 13:41:07	pitrou	set	nosy: + benjamin.peterson
2010-10-15 12:34:50	vstinner	create