This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Rhamphoryncus
Recipients Rhamphoryncus, benjamin.peterson, ezio.melotti, hippietrail, jwilk, l.mastrodomenico, lemburg, terry.reedy
Date 2009-10-04.05:43:59
SpamBayes Score 1.44359e-10
Marked as misclassified No
Message-id <1254635045.77.0.589610642174.issue3297@psf.upfronthosting.co.za>
In-reply-to
Content
I've traced down the biggest problem to decode_unicode in ast.c.  It
needs to convert everything into a form of escapes so it becomes pure
ascii, which then become evaluated back into a unicode object. 
Unfortunately, it uses UTF-16-BE to do so, which always split
surrogates.  Switching it to UTF-32-BE is fairly straightforward, and
works even on UTF-16 (or "narrow") builds.

Incidentally, there's no point using the surrogatepass error handler
once we actually support surrogates.

Unfortunately there's a second problem in repr(). 
'\U0001010F'.isprintable() returns True on UTF-32 builds and False on
UTF-16 builds.  This causes repr() to escape it unnecessarily on UTF-16
builds.  repr() at least joins surrogate pairs before its internally
printable test (unlike .isprintable() or any other str method), but it
turns out all of the APIs in unicodectype.c only accept a single 16-bit
int in UTF-16 builds anyway.  That'll be a bigger patch than the first part.
History
Date User Action Args
2009-10-04 05:44:06Rhamphoryncussetrecipients: + Rhamphoryncus, lemburg, terry.reedy, l.mastrodomenico, benjamin.peterson, jwilk, ezio.melotti, hippietrail
2009-10-04 05:44:05Rhamphoryncussetmessageid: <1254635045.77.0.589610642174.issue3297@psf.upfronthosting.co.za>
2009-10-04 05:44:03Rhamphoryncuslinkissue3297 messages
2009-10-04 05:43:59Rhamphoryncuscreate