New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python interpreter uses Unicode surrogate pairs only before the pyc is created #47547
Comments
Problem: when you have Unicode characters with a code point greater than Tested on: Steps to reproduce the problem: (Instead of using reload() is also possible to create a function and Expected behavior: Further informations: |
On my Linux box sys.maxunicode == 1114111 and len(u'\U00010123') == 1, |
Simpler way to reproduce this (on linux): $ rm unicodetest.pyc
$
$ python -c 'import unicodetest'
Result: False
Len: 2 1
Repr: u'\ud800\udd23' u'\U00010123'
$
$ python -c 'import unicodetest'
Result: True
Len: 1 1
Repr: u'\U00010123' u'\U00010123' Storing surrogates in UTF-32 is ill-formed[1], so the first part The repr could go either way, as unicode doesn't cover escape sequences. The bigger problem is how much we prohibit ill-formed character [1] Search for D90 in http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf |
Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16 The conversions done from the literal escaped representation to the PYC files are written using the marshal module, which uses UTF-8 as All of these codecs know about surrogates, so there must be a bug I checked on Linux using a UCS2 and a UCS4 build of Python 2.5: the |
No, the configure options are wrong - we do use UTF-16 and UTF-32. If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. |
Adam, I do know what I'm talking about: I was the lead designer of the What you see as repr() of a Unicode object is the result of applying a That said, Ezio did uncover a bug and we need to find the cause. It's case 3:
if ((s[1] & 0xc0) != 0x80 ||
(s[2] & 0xc0) != 0x80) {
errmsg = "invalid data";
startinpos = s-starts;
endinpos = startinpos+3;
goto utf8Error;
}
ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) << 6) + (s[2] &
0x3f);
if (ch < 0x0800) {
/* Note: UTF-8 encodings of surrogates are considered
legal UTF-8 sequences;
|
Marc, perhaps Unicode has refined their definitions since you last looked? Valid UTF-8 *cannot* contain surrogates[1]. If it does, you have So there are two bugs: first, the UTF-8 codec should refuse to load [1] 4th bullet point of D92 in |
Err, to clarify, the parse/compile/whatever stages is producing broken |
Ping. |
"Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16 I recently read most of the Unicode 5 standard and as near as I could
Section C.2 says "UCS-4 can now be taken effectively as an alias for the
U5 has 3 coding formats which it names UTF-8,16,32 and 7 serialization ---------------------- On WinXP,IDLE 3.0b2
>>> repr('\U00010123') # u prefix no longer needed or valid
"'𐄣'"
>>> repr('\ud800\udd23')
"'𐄣'"
# Interesting: what I cut from IDLE has 2 empty boxes instead of the one
larger square with 010 and 123 I see on FireFox. len(repr('\U0010123'))
is 4, not 3, so FireFox recognizes the surrogate and displays one symbol.
Entering either directly into the interpreter gives
Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit
(Intel)] on win32
>>> c='\U00010123'
>>> len(c)
2
>>> repr(c)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python30\lib\io.py", line 1428, in write
b = encoder.encode(s)
File "C:\Program Files\Python30\lib\encodings\cp437.py", line 19, in
encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
2-3: character maps to <undefined> 2.5 gives instead "u'\\U00010123'" as reported, so I added 3.0 to the I do wonder how can repr() work on IDLE but not the underlying |
On 2008-08-29 23:33, Terry J. Reedy wrote:
UCS2 and UCS4 are terms which stem from the versions of Unicode See http://en.wikipedia.org/wiki/Universal_Character_Set for details. UTF-16 is a transfer encoding that is based on UCS2 by adding Whether surrogates are supported or not and how they are supported
You are mixing the internal representation of Unicode code points Also note that because Python can be built using two different internal BTW: There's no such thing as an ill-formed code unit. What you probably Please also note that because Python can be used to build valid Whether the codecs should raise exceptions and possibly let an I hope that clears up the reasoning for using UCS2/UCS4 rather |
Marc, I don't understand what you're saying. UTF-16's surrogates are Likewise, UCS-4 originally allowed a much larger range of code points, You are right in that I shouldn't have said "a pair of ill-formed code Although python may allow ill-formed sequences to be created internally |
I've got another report open about the codecs not properly reporting |
Looks like the failure mode has changed here, presumably due to issue $ rm unicodetest.pyc
$ ./python -c 'import unicodetest'
Result: False
Len: 2 1
Repr: '\ud800\udd23' '\U00010123'
[28877 refs]
$ ./python -c 'import unicodetest'
Result: False
Len: 2 1
Repr: '\ud800\udd23' '\U00010123'
[28708 refs] |
I've traced down the biggest problem to decode_unicode in ast.c. It Incidentally, there's no point using the surrogatepass error handler Unfortunately there's a second problem in repr(). |
This last point is already tracked by bpo-5127. |
Patch, which uses UTF-32-BE as indicated in my last comment. Test included. |
With some further prodding I've noticed that although the test behaves |
Committed Adam's patch in r75928. |
@benjamin.peterson: Do you plan to port r75928 to 2.7 and 3.1? If not, can you close this issue? I think that this issue priority is minor because few people write directly non-BMP characters in Python files (maybe only one, Ezio Melotti :-)). u"\uxxxx", u"\Uxxxxxxxx" or unichr(xxx) can be used in Python 2.7 and 3.1 (without u prefix for 3.1). |
We are too close from the final 2.7 release, it's too late to backport. As I wrote, this feature is not important and there are many workaround, so we don't need to backport to 3.1. Close the issue: use Python 3.2 if you want a better support of unicode ;-) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: