Message 401997 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rafaelblsilva
Recipients	eryksun, ezio.melotti, lemburg, paul.moore, python-dev, rafaelblsilva, serhiy.storchaka, steve.dower, tim.golden, vstinner, zach.ware
Date	2021-09-17.01:50:07
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1631843409.06.0.221584888466.issue45120@roundup.psfhosted.org>
In-reply-to

Content
As encodings are indeed a complex topic, debating this seems like a necessity. I researched this topic when i found an encoding issue regarding a mysql connector: https://github.com/PyMySQL/mysqlclient/pull/502 In MySQL itself there is a mislabel of "latin1" and "cp1252", what mysql calls "latin1" presents the behavior of cp1252. As Inada Naoki pointed out: """ See this: https://dev.mysql.com/doc/refman/8.0/en/charset-we-sets.html MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d. So latin1 in MySQL is actually cp1252. """ You can verify this by passing the byte 0x80 through it to get the string representation, a quick test i find useful: On MYSQL: select convert(unhex('80') using latin1); -- -> returns "€" On Postgresql: select convert_from(E'\\x80'::bytea, 'WIN1252'); -- -> returns "€" select convert_from(E'\\x80'::bytea, 'LATIN1'); -- -> returns the C1 control character "0xc2 0x80" I decided to try to fix this behavior on python because i always found it to be a little odd to receive errors in those codepoints. A discussion i particularly find useful is this one: https://comp.lang.python.narkive.com/C9oHYxxu/latin1-and-cp1252-inconsistent Which i think they didn't notice the "WindowsBestFit" folder at: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/ Digging through the commits to look for dates, i realized Amaury Forgeot d'Arc, created a tool to generate the windows encodings based on calls to "MultiByteToWideChar" which indeed generates the same mapping available on the unicode website, i've attached the file generated by it. Since there might be legacy systems which rely on this "specific" behavior, i don't think "back-porting" this update to older python versions is a good idea. That is the reason i think this should come in new versions, and treated as a "new behavior". The benefit i see in updating this is to prevent even further confusion, with the expected behavior when dealing with those encodings.

As encodings are indeed a complex topic, debating this seems like a necessity. I researched this topic when i found an encoding issue regarding a mysql connector: https://github.com/PyMySQL/mysqlclient/pull/502

In MySQL itself there is a mislabel of "latin1" and "cp1252",  what mysql calls "latin1" presents the behavior of cp1252. As Inada Naoki pointed out:

"""
See this: https://dev.mysql.com/doc/refman/8.0/en/charset-we-sets.html

MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

So latin1 in MySQL is actually cp1252.
"""

You can verify this by passing the byte 0x80 through it to get the string representation, a quick test i find useful:

On MYSQL: 
select convert(unhex('80') using latin1); -- -> returns "€"

On Postgresql: 
select convert_from(E'\\x80'::bytea, 'WIN1252'); -- -> returns "€"
select convert_from(E'\\x80'::bytea, 'LATIN1'); -- -> returns the C1 control character "0xc2 0x80"

I decided to try to fix this behavior on python because i always found it to be a little odd to receive errors in those codepoints. A discussion i particularly find useful is this one: https://comp.lang.python.narkive.com/C9oHYxxu/latin1-and-cp1252-inconsistent

Which i think they didn't notice the "WindowsBestFit" folder at:
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/

Digging through the commits to look for dates, i realized Amaury Forgeot d'Arc, created a tool to generate the windows encodings based on calls to "MultiByteToWideChar" which indeed generates the same mapping available on the unicode website, i've attached the file generated by it. 


Since there might be legacy systems which rely on this "specific" behavior, i don't think "back-porting" this update to older python versions is a good idea. That is the reason i think this should come in new versions, and treated as a "new behavior".

The benefit i see in updating this is to prevent even further confusion, with the expected behavior when dealing with those encodings.

History
Date	User	Action	Args
2021-09-17 01:50:09	rafaelblsilva	set	recipients: + rafaelblsilva, lemburg, paul.moore, vstinner, tim.golden, ezio.melotti, python-dev, zach.ware, serhiy.storchaka, eryksun, steve.dower
2021-09-17 01:50:09	rafaelblsilva	set	messageid: <1631843409.06.0.221584888466.issue45120@roundup.psfhosted.org>
2021-09-17 01:50:09	rafaelblsilva	link	issue45120 messages
2021-09-17 01:50:08	rafaelblsilva	create