Message 248392 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	ezio.melotti, gvanrossum, kennyluck, lemburg, loewis, mjpieters, pitrou, python-dev, serhiy.storchaka, tchrist, vstinner
Date	2015-08-11.06:04:30
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1439273071.36.0.963320980607.issue12892@psf.upfronthosting.co.za>
In-reply-to

Content
There are two causes: 1. UTF-16 and UTF-32 are based on 2- and 4-bytes units. If the surrogateescape error handler will support UTF-16 and UTF-32, encoding could produce the data that can't be decoded back correctly. For example '\udcac \udcac' -> b'\xac\x20\x00\xac' -> '\u20ac\uac20' == '€가'. 2. ASCII bytes (0x00-0x80) can't be escaped with surrogateescape. UTF-16 and UTF-32 data can contain illegal ASCII bytes (b'\xD8\x00' in UTF-16-BE, b'abcd' in UTF-32). For the same reason surrogateescape is not compatible with UTF-7 and CP037.

There are two causes:

1. UTF-16 and UTF-32 are based on 2- and 4-bytes units. If the surrogateescape error handler will support UTF-16 and UTF-32, encoding could produce the data that can't be decoded back correctly. For example '\udcac \udcac' -> b'\xac\x20\x00\xac' -> '\u20ac\uac20' == '€가'.

2. ASCII bytes (0x00-0x80) can't be escaped with surrogateescape. UTF-16 and UTF-32 data can contain illegal ASCII bytes (b'\xD8\x00' in UTF-16-BE, b'abcd' in UTF-32). For the same reason surrogateescape is not compatible with UTF-7 and CP037.

History
Date	User	Action	Args
2015-08-11 06:04:31	serhiy.storchaka	set	recipients: + serhiy.storchaka, lemburg, gvanrossum, loewis, mjpieters, pitrou, vstinner, ezio.melotti, python-dev, tchrist, kennyluck
2015-08-11 06:04:31	serhiy.storchaka	set	messageid: <1439273071.36.0.963320980607.issue12892@psf.upfronthosting.co.za>
2015-08-11 06:04:31	serhiy.storchaka	link	issue12892 messages
2015-08-11 06:04:30	serhiy.storchaka	create