msg118762 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-15 12:34 |
Example:
$ ./python
Python 3.2a3+ (py3k, Oct 15 2010, 14:31:59)
>>> compile('', 'abc\uDC80', 'exec')
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 3: surrogates not allowed
Attached patch encodes manually the filename to utf-8 with surrogateescape.
I found this problem while testing Python with an ASCII locale encoding (LANG=C ./python Lib/test/regrtest.py). Example:
$ LANG=C ./python -m base64 -e setup.py
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' ...
|
msg118833 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2010-10-15 21:50 |
I think the title is slightly misleading. As I read the patch, the issue is that PyArg_ParseTupleAndKeywords requires that string args to C functions be valid Unicode strings (and that it does this by trying to encode to utf-8). Your patch subverts this by redefining filename to be a generic object, with a looser custom-coded test. It is not clear to me that filename, out of all string args to builtins, should be excepted this way. It seems to me that any real filename should be real unicode and there is no need for fake names that are not.
|
msg118836 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-15 22:04 |
#6543 changed code->co_filename encoding from filesystem encoding+surrogateescape to utf-8+strict.
With my patch, compile('', '\udcc3\udca9', 'exec').co_filename gives 'é', it doesn't depend on the filesystem encoding. But 'é' cannot be used with all filesystem encodings, eg. with ascii locale encoding (C locale), use it raises an error.
I now think that it was a bad idea to use utf-8 instead of the fileystem encoding. All filenames should use the filesystem encoding in Python.
|
msg118837 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-15 22:08 |
See also #9713.
|
msg118842 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-15 22:58 |
> All filenames should use the filesystem encoding in Python.
Here is a new patch [code_encoding.patch] implementing this idea:
- Use filesystem encoding (and surrogateescape) to encode/decode paths in compile() and the parser, instead of utf-8 in strict mode
- Ensure that co_filename attribute can be used as a filename (eg. to not raise UnicodeEncodeError on Linux)
- compile() builtin supports bytes filenames
- _Py_FindSourceFile() (traceback.c) encodes paths of sys.path into the filesystem encoding, as do find_module() (import.c)
- PyRun_SimpleFileExFlags() sets __file__ attribute using the filesystem encoding
The patch restores the situation before #6543.
|
msg118843 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2010-10-15 22:58 |
Pardon my ignorance, but given that code.co_filename is a string attribute given as a string, which is to say, unicode in 3.x, I do not see what filesystem encodings, or any other encoding to bytes should really have to do with the attribute. I actually would have expected compile to take your example argument 'abc\uDC80' and paste it onto the code object unchanged. The only issue to me is whether any string should be allowed or only legal-unicode strings. Anything else would seem like a 2.x holdover.
If PyBytes_AS_STRING (macro version of PyBytes_AsString) is the implementation of str(bytes_object) (as I would guess from the doc), then as I read your patch, it will produce rather strange 'filenames'.
>>> str('abc\uDC80'.encode("utf-8", "surrogateescape"))
"b'abc\\x80'"
always wrapped in b'...'.
If not that, what does it do (with no decoding specified)?
|
msg118845 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-15 23:17 |
> I do not see what filesystem encodings, or any other encoding
> to bytes should really have to do with the [code.co_filename].
co_filename attribute is used to display the traceback: Python opens the related file, read the source code line and display it. On Windows, co_filename is directly used because Windows accepts unicode for filenames. But on other OSes, you have to encode the filename to the filesystem encoding.
If your filesystem encoding is 'ascii' (eg. C locale) and co_filename is a non-ascii filename (eg. 'test_é.py'), encode co_filename will raise a UnicodeEncodeError. You can test it simply by using os.fsencode():
$ ./python
Python 3.2a3+ (py3k:85551:85553M, Oct 16 2010, 00:54:03)
>>> import sys; sys.getfilesystemencoding()
'utf-8'
>>> import os; os.fsencode('é')
b'\xc3\xa9'
$ LANG= ./python
Python 3.2a3+ (py3k:85551:85553M, Oct 16 2010, 00:54:03)
>>> import sys; sys.getfilesystemencoding()
'ascii'
>>> import os; os.fsencode('\xe9')
...
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' ...
Said differently, co_filename should be encodable to the filesystem encoding (os.fsencode(co_filename) should not raise an error).
|
msg118846 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-15 23:18 |
Remove [compile_surrogates.patch] because it creates filenames unencode to the filesystem encoding. Eg. compile('', '\udcc3\udca9', 'exec').co_filename gives 'é' even if the filesystem encoding is 'ascii'.
|
msg118865 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-16 12:11 |
> Here is a new patch [code_encoding.patch] implementing this idea:
> - Use filesystem encoding (and surrogateescape) to encode/decode
> paths in compile() and the parser, instead of utf-8 in strict mode
> (...)
> The patch restores the situation before #6543.
About Python 3.1 compatibility: Python 3.1 doesn't support non-ascii paths with a locale different than utf-8 (see issue #8611), so it doesn't change anything for Python 3.1 (it doesn't work anyway, with utf-8 or filesystem encoding).
|
msg118866 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-16 12:22 |
Oh, I just realized that Python 3.1.2 (last Python 3.1 release) was released the 21st March, whereas r82063 (commit for #6543) was made the 17st June. So the encoding change was not released yet.
|
msg118868 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-16 13:51 |
Commited to 3.2 (r85569+r85570). I wait for the buildbot before porting the patch to 3.1 and close the issue. There is already a regression on Gentoo buildbot with ascii locale encoding, test_doctest test_zipimport_support:
http://www.python.org/dev/buildbot/all/builders/AMD64%20Gentoo%20Wide%203.x/builds/106
|
msg118871 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-16 14:16 |
I created #10123 for the test_doctest regression.
|
msg119099 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-10-19 01:22 |
Buildbots are green again (#10123 is closed). I ported the fix to Python 3.1 (r85716). Close this issue.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:07 | admin | set | github: 54323 |
2010-10-19 01:22:19 | vstinner | set | status: open -> closed resolution: fixed messages:
+ msg119099
|
2010-10-16 14:16:12 | vstinner | set | messages:
+ msg118871 |
2010-10-16 13:51:23 | vstinner | set | messages:
+ msg118868 |
2010-10-16 12:22:19 | vstinner | set | messages:
+ msg118866 |
2010-10-16 12:11:56 | vstinner | set | messages:
+ msg118865 |
2010-10-15 23:19:17 | vstinner | set | files:
- compile_surrogates.patch |
2010-10-15 23:18:55 | vstinner | set | messages:
+ msg118846 |
2010-10-15 23:17:36 | vstinner | set | messages:
+ msg118845 |
2010-10-15 22:58:41 | terry.reedy | set | messages:
+ msg118843 |
2010-10-15 22:58:13 | vstinner | set | files:
+ code_encoding.patch
messages:
+ msg118842 |
2010-10-15 22:08:57 | vstinner | set | messages:
+ msg118837 |
2010-10-15 22:04:54 | vstinner | set | messages:
+ msg118836 |
2010-10-15 21:50:09 | terry.reedy | set | nosy:
+ terry.reedy messages:
+ msg118833
|
2010-10-15 13:41:07 | pitrou | set | nosy:
+ benjamin.peterson
|
2010-10-15 12:34:50 | vstinner | create | |