Message 102065 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	sjmachin
Recipients	dangra, ezio.melotti, lemburg, sjmachin
Date	2010-04-01.07:29:44
SpamBayes Score	8.584105e-07
Marked as misclassified	No
Message-id	<1270106987.33.0.884133439023.issue8271@psf.upfronthosting.co.za>
In-reply-to

Content
#ezio.melotti: """I'm considering valid all the bytes that start with '10...'""" Sorry, WRONG. Read what I wrote: """Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow.""" Consider these sequences: (1) E0 80 80 (2) E0 9F 80. Both are invalid sequences (over-long). Specifically the first continuation byte may not be in 80-9F. Those bytes start with '10...' but they are invalid after an E0 starter byte. Please read "Table 3-7. Well-Formed UTF-8 Byte Sequences" and surrounding text in Unicode 5.2.0 chapter 3 (bearing in mind that CPython (for good reasons) doesn't implement the surrogates restriction, so that the special case for starter byte ED is not used in CPython). Note the other 3 special cases for the first continuation byte.

#ezio.melotti: """I'm considering valid all the bytes that start with '10...'"""

Sorry, WRONG. Read what I wrote: """Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow."""

Consider these sequences: (1) E0 80 80 (2) E0 9F 80. Both are invalid sequences (over-long). Specifically the first continuation byte may not be in 80-9F. Those bytes start with '10...' but they are invalid after an E0 starter byte.

Please read "Table 3-7. Well-Formed UTF-8 Byte Sequences" and surrounding text in Unicode 5.2.0 chapter 3 (bearing in mind that CPython (for good reasons) doesn't implement the surrogates restriction, so that the special case for starter byte ED is not used in CPython). Note the other 3 special cases for the first continuation byte.

History
Date	User	Action	Args
2010-04-01 07:29:47	sjmachin	set	recipients: + sjmachin, lemburg, ezio.melotti, dangra
2010-04-01 07:29:47	sjmachin	set	messageid: <1270106987.33.0.884133439023.issue8271@psf.upfronthosting.co.za>
2010-04-01 07:29:45	sjmachin	link	issue8271 messages
2010-04-01 07:29:45	sjmachin	create