Message 129495 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	belopolsky, dangra, ezio.melotti, lemburg, pitrou, sjmachin, vstinner
Date	2011-02-26.03:31:22
SpamBayes Score	5.0413823e-10
Marked as misclassified	No
Message-id	<1298691083.18.0.884885059166.issue8271@psf.upfronthosting.co.za>
In-reply-to

Content
After a mail I sent to the Unicode Consortium about the corner case I found, they updated the "Best Practices for Using U+FFFD"[0] and now it says: """ Another example illustrates the application of the concept of maximal subpart for UTF-8 continuation bytes outside the allowable ranges defined in Table 3-7. The UTF-8 sequence <41 E0 9F 80 41> is ill-formed, because <9F> is not an allowed second byte of a UTF-8 sequence commencing with <E0>. In this case, there is an unconvertible offset at <E0> and the maximal subpart at that offset is also <E0>. The subsequence <E0 9F> cannot be a maximal subpart, because it is not an initial subsequence of any well-formed UTF-8 code unit sequence. """ The result of decoding that string with Python is: >>> b'\x41\xE0\x9F\x80\x41'.decode('utf-8', 'replace') 'A��A' i.e. the bytes <E0 9F> are wrongly considered as a maximal subpart and replaced with a single '�' (the second � is the \x80). I'll work on a patch and see how it comes out. [0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 96

After a mail I sent to the Unicode Consortium about the corner case I found, they updated the "Best Practices for Using U+FFFD"[0] and now it says:
"""
 Another example illustrates the application of the concept of maximal subpart for UTF-8 continuation bytes outside the allowable ranges defined in Table 3-7. The UTF-8 sequence <41 E0 9F 80 41> is ill-formed, because <9F> is not an allowed second byte of a UTF-8 sequence commencing with <E0>. In this case, there is an unconvertible offset at <E0> and the maximal subpart at that offset is also <E0>. The subsequence <E0 9F> cannot be a maximal subpart, because it is not an initial subsequence of any well-formed UTF-8 code unit sequence.
"""

The result of decoding that string with Python is:
>>> b'\x41\xE0\x9F\x80\x41'.decode('utf-8', 'replace')
'A��A'
i.e. the bytes <E0 9F> are wrongly considered as a maximal subpart and replaced with a single '�' (the second � is the \x80).

I'll work on a patch and see how it comes out.

[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 96

History
Date	User	Action	Args
2011-02-26 03:31:23	ezio.melotti	set	recipients: + ezio.melotti, lemburg, sjmachin, belopolsky, pitrou, vstinner, dangra
2011-02-26 03:31:23	ezio.melotti	set	messageid: <1298691083.18.0.884885059166.issue8271@psf.upfronthosting.co.za>
2011-02-26 03:31:22	ezio.melotti	link	issue8271 messages
2011-02-26 03:31:22	ezio.melotti	create