Message 69575 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rhamphoryncus
Recipients	Rhamphoryncus, ezio.melotti, lemburg
Date	2008-07-11.23:33:38
SpamBayes Score	0.001946343
Marked as misclassified	No
Message-id	<1215819220.37.0.132294711295.issue3297@psf.upfronthosting.co.za>
In-reply-to

Content
Simpler way to reproduce this (on linux): $ rm unicodetest.pyc $ $ python -c 'import unicodetest' Result: False Len: 2 1 Repr: u'\ud800\udd23' u'\U00010123' $ $ python -c 'import unicodetest' Result: True Len: 1 1 Repr: u'\U00010123' u'\U00010123' Storing surrogates in UTF-32 is ill-formed[1], so the first part definitely shouldn't be failing on linux (with a UTF-32 build). The repr could go either way, as unicode doesn't cover escape sequences. We could allow u'\ud800\udd23' literals to magically become u'\U00010123' on UTF-32 builds. We already allow repr(u'\ud800\udd23') to magically become "u'\U00010123'" on UTF-16 builds (which is why the repr test always passes there, rather than always failing). The bigger problem is how much we prohibit ill-formed character sequences. We already prevent values above U+10FFFF, but not inappropriate surrogates. [1] Search for D90 in http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf

Simpler way to reproduce this (on linux):

$ rm unicodetest.pyc 
$ 
$ python -c 'import unicodetest'
Result: False
Len: 2 1
Repr: u'\ud800\udd23' u'\U00010123'
$ 
$ python -c 'import unicodetest'
Result: True
Len: 1 1
Repr: u'\U00010123' u'\U00010123'

Storing surrogates in UTF-32 is ill-formed[1], so the first part
definitely shouldn't be failing on linux (with a UTF-32 build).

The repr could go either way, as unicode doesn't cover escape sequences.
 We could allow u'\ud800\udd23' literals to magically become
u'\U00010123' on UTF-32 builds.  We already allow repr(u'\ud800\udd23')
to magically become "u'\U00010123'" on UTF-16 builds (which is why the
repr test always passes there, rather than always failing).

The bigger problem is how much we prohibit ill-formed character
sequences.  We already prevent values above U+10FFFF, but not
inappropriate surrogates.


[1] Search for D90 in http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf

History
Date	User	Action	Args
2008-07-11 23:33:40	Rhamphoryncus	set	spambayes_score: 0.00194634 -> 0.001946343 recipients: + Rhamphoryncus, lemburg, ezio.melotti
2008-07-11 23:33:40	Rhamphoryncus	set	spambayes_score: 0.00194634 -> 0.00194634 messageid: <1215819220.37.0.132294711295.issue3297@psf.upfronthosting.co.za>
2008-07-11 23:33:39	Rhamphoryncus	link	issue3297 messages
2008-07-11 23:33:39	Rhamphoryncus	create