Message102522
STINNER Victor wrote:
>
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
>> I also found out that, according to RFC 3629, surrogates
>> are considered invalid and they can't be encoded/decoded,
>> but the UTF-8 codec actually does it.
>
> Python2 does, but Python3 raises an error.
>
> Python 2.7a4+ (trunk:79675, Apr 3 2010, 16:11:36)
>>>> u"\uDC80".encode("utf8")
> '\xed\xb2\x80'
>
> Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55)
>>>> "\uDC80".encode("utf8")
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
>
> Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see #6687 (and #8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7.
I wonder how that change got into the 3.x branch - I would certainly
not have approved it for the reasons given further up on this ticket.
I think we should revert that change for Python 3.2. |
|
Date |
User |
Action |
Args |
2010-04-07 08:37:37 | lemburg | set | recipients:
+ lemburg, sjmachin, vstinner, ezio.melotti, dangra |
2010-04-07 08:37:35 | lemburg | link | issue8271 messages |
2010-04-07 08:37:34 | lemburg | create | |
|