Message 93518 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rhamphoryncus
Recipients	Rhamphoryncus, benjamin.peterson, ezio.melotti, hippietrail, jwilk, l.mastrodomenico, lemburg, terry.reedy
Date	2009-10-04.05:43:59
SpamBayes Score	1.4435853e-10
Marked as misclassified	No
Message-id	<1254635045.77.0.589610642174.issue3297@psf.upfronthosting.co.za>
In-reply-to

Content
I've traced down the biggest problem to decode_unicode in ast.c. It needs to convert everything into a form of escapes so it becomes pure ascii, which then become evaluated back into a unicode object. Unfortunately, it uses UTF-16-BE to do so, which always split surrogates. Switching it to UTF-32-BE is fairly straightforward, and works even on UTF-16 (or "narrow") builds. Incidentally, there's no point using the surrogatepass error handler once we actually support surrogates. Unfortunately there's a second problem in repr(). '\U0001010F'.isprintable() returns True on UTF-32 builds and False on UTF-16 builds. This causes repr() to escape it unnecessarily on UTF-16 builds. repr() at least joins surrogate pairs before its internally printable test (unlike .isprintable() or any other str method), but it turns out all of the APIs in unicodectype.c only accept a single 16-bit int in UTF-16 builds anyway. That'll be a bigger patch than the first part.

I've traced down the biggest problem to decode_unicode in ast.c.  It
needs to convert everything into a form of escapes so it becomes pure
ascii, which then become evaluated back into a unicode object. 
Unfortunately, it uses UTF-16-BE to do so, which always split
surrogates.  Switching it to UTF-32-BE is fairly straightforward, and
works even on UTF-16 (or "narrow") builds.

Incidentally, there's no point using the surrogatepass error handler
once we actually support surrogates.

Unfortunately there's a second problem in repr(). 
'\U0001010F'.isprintable() returns True on UTF-32 builds and False on
UTF-16 builds.  This causes repr() to escape it unnecessarily on UTF-16
builds.  repr() at least joins surrogate pairs before its internally
printable test (unlike .isprintable() or any other str method), but it
turns out all of the APIs in unicodectype.c only accept a single 16-bit
int in UTF-16 builds anyway.  That'll be a bigger patch than the first part.

History
Date	User	Action	Args
2009-10-04 05:44:06	Rhamphoryncus	set	recipients: + Rhamphoryncus, lemburg, terry.reedy, l.mastrodomenico, benjamin.peterson, jwilk, ezio.melotti, hippietrail
2009-10-04 05:44:05	Rhamphoryncus	set	messageid: <1254635045.77.0.589610642174.issue3297@psf.upfronthosting.co.za>
2009-10-04 05:44:03	Rhamphoryncus	link	issue3297 messages
2009-10-04 05:43:59	Rhamphoryncus	create