This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Rhamphoryncus
Recipients Rhamphoryncus, ezio.melotti, lemburg
Date 2008-07-12.19:03:49
SpamBayes Score 0.0121
Marked as misclassified No
Message-id <1215889432.77.0.572868807141.issue3297@psf.upfronthosting.co.za>
In-reply-to
Content
Marc, perhaps Unicode has refined their definitions since you last looked?

Valid UTF-8 *cannot* contain surrogates[1].  If it does, you have
CESU-8[2][3], not UTF-8.

So there are two bugs: first, the UTF-8 codec should refuse to load
surrogates.  Second, since the original bug showed up before the .pyc is
created, something in the parse/compilation/whatever stage is producing
CESU-8.


[1] 4th bullet point of D92 in
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
[2] http://unicode.org/reports/tr26/
[3] http://en.wikipedia.org/wiki/CESU-8
History
Date User Action Args
2008-07-12 19:03:53Rhamphoryncussetspambayes_score: 0.0121 -> 0.0121
recipients: + Rhamphoryncus, lemburg, ezio.melotti
2008-07-12 19:03:52Rhamphoryncussetspambayes_score: 0.0121 -> 0.0121
messageid: <1215889432.77.0.572868807141.issue3297@psf.upfronthosting.co.za>
2008-07-12 19:03:50Rhamphoryncuslinkissue3297 messages
2008-07-12 19:03:49Rhamphoryncuscreate