Message93518
I've traced down the biggest problem to decode_unicode in ast.c. It
needs to convert everything into a form of escapes so it becomes pure
ascii, which then become evaluated back into a unicode object.
Unfortunately, it uses UTF-16-BE to do so, which always split
surrogates. Switching it to UTF-32-BE is fairly straightforward, and
works even on UTF-16 (or "narrow") builds.
Incidentally, there's no point using the surrogatepass error handler
once we actually support surrogates.
Unfortunately there's a second problem in repr().
'\U0001010F'.isprintable() returns True on UTF-32 builds and False on
UTF-16 builds. This causes repr() to escape it unnecessarily on UTF-16
builds. repr() at least joins surrogate pairs before its internally
printable test (unlike .isprintable() or any other str method), but it
turns out all of the APIs in unicodectype.c only accept a single 16-bit
int in UTF-16 builds anyway. That'll be a bigger patch than the first part. |
|
Date |
User |
Action |
Args |
2009-10-04 05:44:06 | Rhamphoryncus | set | recipients:
+ Rhamphoryncus, lemburg, terry.reedy, l.mastrodomenico, benjamin.peterson, jwilk, ezio.melotti, hippietrail |
2009-10-04 05:44:05 | Rhamphoryncus | set | messageid: <1254635045.77.0.589610642174.issue3297@psf.upfronthosting.co.za> |
2009-10-04 05:44:03 | Rhamphoryncus | link | issue3297 messages |
2009-10-04 05:43:59 | Rhamphoryncus | create | |
|