This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: compile() doesn't support the PEP 383 (surrogates)
Type: Stage:
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, terry.reedy, vstinner
Priority: normal Keywords: patch

Created on 2010-10-15 12:34 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
code_encoding.patch vstinner, 2010-10-15 22:58
Messages (13)
msg118762 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-15 12:34
Example:

$ ./python
Python 3.2a3+ (py3k, Oct 15 2010, 14:31:59) 
>>> compile('', 'abc\uDC80', 'exec')
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 3: surrogates not allowed

Attached patch encodes manually the filename to utf-8 with surrogateescape.

I found this problem while testing Python with an ASCII locale encoding (LANG=C ./python Lib/test/regrtest.py). Example:

  $ LANG=C ./python -m base64 -e setup.py 
  ...
  UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' ...
msg118833 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-10-15 21:50
I think the title is slightly misleading. As I read the patch, the issue is that PyArg_ParseTupleAndKeywords requires that string args to C functions be valid Unicode strings (and that it does this by trying to encode to utf-8). Your patch subverts this by redefining filename to be a generic object, with a looser custom-coded test. It is not clear to me that filename, out of all string args to builtins, should be excepted this way. It seems to me that any real filename should be real unicode and there is no need for fake names that are not.
msg118836 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-15 22:04
#6543 changed code->co_filename encoding from filesystem encoding+surrogateescape to utf-8+strict.

With my patch, compile('', '\udcc3\udca9', 'exec').co_filename gives 'é', it doesn't depend on the filesystem encoding. But 'é' cannot be used with all filesystem encodings, eg. with ascii locale encoding (C locale), use it raises an error.

I now think that it was a bad idea to use utf-8 instead of the fileystem encoding. All filenames should use the filesystem encoding in Python.
msg118837 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-15 22:08
See also #9713.
msg118842 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-15 22:58
> All filenames should use the filesystem encoding in Python.

Here is a new patch [code_encoding.patch] implementing this idea:

 - Use filesystem encoding (and surrogateescape) to encode/decode paths in compile() and the parser, instead of utf-8 in strict mode
 - Ensure that co_filename attribute can be used as a filename (eg. to not raise UnicodeEncodeError on Linux)
 - compile() builtin supports bytes filenames
 - _Py_FindSourceFile() (traceback.c) encodes paths of sys.path into the filesystem encoding, as do find_module() (import.c)
 - PyRun_SimpleFileExFlags() sets __file__ attribute using the filesystem encoding

The patch restores the situation before #6543.
msg118843 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-10-15 22:58
Pardon my ignorance, but given that code.co_filename is a string attribute given as a string, which is to say, unicode in 3.x, I do not see what filesystem encodings, or any other encoding to bytes should really have to do with the attribute. I actually would have expected compile to take your example argument 'abc\uDC80' and paste it onto the code object unchanged. The only issue to me is whether any string should be allowed or only legal-unicode strings. Anything else would seem like a 2.x holdover.

If PyBytes_AS_STRING (macro version of PyBytes_AsString) is the implementation of str(bytes_object) (as I would guess from the doc), then as I read your patch, it will produce rather strange 'filenames'.
>>> str('abc\uDC80'.encode("utf-8", "surrogateescape"))
"b'abc\\x80'"
always wrapped in b'...'.

If not that, what does it do (with no decoding specified)?
msg118845 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-15 23:17
> I do not see what filesystem encodings, or any other encoding 
> to bytes should really have to do with the [code.co_filename].

co_filename attribute is used to display the traceback: Python opens the related file, read the source code line and display it. On Windows, co_filename is directly used because Windows accepts unicode for filenames. But on other OSes, you have to encode the filename to the filesystem encoding.

If your filesystem encoding is 'ascii' (eg. C locale) and co_filename is a non-ascii filename (eg. 'test_é.py'), encode co_filename will raise a UnicodeEncodeError. You can test it simply by using os.fsencode():

$ ./python 
Python 3.2a3+ (py3k:85551:85553M, Oct 16 2010, 00:54:03) 
>>> import sys; sys.getfilesystemencoding()
'utf-8'
>>> import os; os.fsencode('é')
b'\xc3\xa9'

$ LANG= ./python 
Python 3.2a3+ (py3k:85551:85553M, Oct 16 2010, 00:54:03) 
>>> import sys; sys.getfilesystemencoding()
'ascii'
>>> import os; os.fsencode('\xe9')
...
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' ...

Said differently, co_filename should be encodable to the filesystem encoding (os.fsencode(co_filename) should not raise an error).
msg118846 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-15 23:18
Remove [compile_surrogates.patch] because it creates filenames unencode to the filesystem encoding. Eg. compile('', '\udcc3\udca9', 'exec').co_filename gives 'é' even if the filesystem encoding is 'ascii'.
msg118865 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-16 12:11
> Here is a new patch [code_encoding.patch] implementing this idea:
> - Use filesystem encoding (and surrogateescape) to encode/decode
> paths in compile() and the parser, instead of utf-8 in strict mode
> (...)
> The patch restores the situation before #6543.

About Python 3.1 compatibility: Python 3.1 doesn't support non-ascii paths with a locale different than utf-8 (see issue #8611), so it doesn't change anything for Python 3.1 (it doesn't work anyway, with utf-8 or filesystem encoding).
msg118866 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-16 12:22
Oh, I just realized that Python 3.1.2 (last Python 3.1 release) was released the 21st March, whereas r82063 (commit for #6543) was made the 17st June. So the encoding change was not released yet.
msg118868 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-16 13:51
Commited to 3.2 (r85569+r85570). I wait for the buildbot before porting the patch to 3.1 and close the issue. There is already a regression on Gentoo buildbot with ascii locale encoding, test_doctest test_zipimport_support:

http://www.python.org/dev/buildbot/all/builders/AMD64%20Gentoo%20Wide%203.x/builds/106
msg118871 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-16 14:16
I created #10123 for the test_doctest regression.
msg119099 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-19 01:22
Buildbots are green again (#10123 is closed). I ported the fix to Python 3.1 (r85716). Close this issue.
History
Date User Action Args
2022-04-11 14:57:07adminsetgithub: 54323
2010-10-19 01:22:19vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg119099
2010-10-16 14:16:12vstinnersetmessages: + msg118871
2010-10-16 13:51:23vstinnersetmessages: + msg118868
2010-10-16 12:22:19vstinnersetmessages: + msg118866
2010-10-16 12:11:56vstinnersetmessages: + msg118865
2010-10-15 23:19:17vstinnersetfiles: - compile_surrogates.patch
2010-10-15 23:18:55vstinnersetmessages: + msg118846
2010-10-15 23:17:36vstinnersetmessages: + msg118845
2010-10-15 22:58:41terry.reedysetmessages: + msg118843
2010-10-15 22:58:13vstinnersetfiles: + code_encoding.patch

messages: + msg118842
2010-10-15 22:08:57vstinnersetmessages: + msg118837
2010-10-15 22:04:54vstinnersetmessages: + msg118836
2010-10-15 21:50:09terry.reedysetnosy: + terry.reedy
messages: + msg118833
2010-10-15 13:41:07pitrousetnosy: + benjamin.peterson
2010-10-15 12:34:50vstinnercreate