Issue 45120: Windows cp encodings "UNDEFINED" entries update

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/89283

classification

Title:	Windows cp encodings "UNDEFINED" entries update
Type:	behavior	Stage:	patch review
Components:	Demos and Tools, Library (Lib), Unicode, Windows	Versions:

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	eryksun, ezio.melotti, lemburg, paul.moore, python-dev, rafaelblsilva, serhiy.storchaka, steve.dower, tim.golden, vstinner, zach.ware
Priority:	normal	Keywords:	patch

Created on 2021-09-06 20:30 by rafaelblsilva, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
cp1252_from_genwincodec.py	rafaelblsilva, 2021-09-17 01:50

Pull Requests
URL	Status	Linked	Edit
PR 28189	open	python-dev, 2021-09-06 20:47

Messages (8)
msg401181 - (view)	Author: Rafael Belo (rafaelblsilva) *	Date: 2021-09-06 20:30
There is a mismatch in specification and behavior in some windows encodings. Some older windows codepages specifications present "UNDEFINED" mapping, whereas in reality, they present another behavior which is updated in a section named "bestfit". For example CP1252 has a corresponding bestfit1525: CP1252: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT bestfit1525: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt From which, in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", whereas in bestfit1252, they map to \u0081 \u008d \u008f \u0090 \u009d respectively. In the Windows API, the function 'MultiByteToWideChar' exhibits the bestfit1252 behavior. This issue and PR proposes a correction for this behavior, updating the windows codepages where some code points where defined as "UNDEFINED" to the corresponding bestfit mapping. Related issue: https://bugs.python.org/issue28712
msg401991 - (view)	Author: Steve Dower (steve.dower) *	Date: 2021-09-16 20:56
Thanks for the PR. Just wanted to acknowledge that we've seen it. Unfortunately, I'm not feeling confident to take this change right now - encodings are a real minefield, and we need to think through the implications. It's been a while since I've done that, so could take some time. Unless one of the other people who have spent time working on this comes in and says they've thought it through and this is the best approach. In which case I'll happily trust them :)
msg401993 - (view)	Author: Eryk Sun (eryksun) *	Date: 2021-09-16 22:11
> in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", > whereas in bestfit1252, they map to \u0081 \u008d \u008f > \u0090 \u009d respectively This is the normal mapping in Windows, not a best-fit encoding. Within Windows, you can access the native encoding via codecs.code_page_encode() and codecs.code_page_decode(). For example: >>> codecs.code_page_encode(1252, '\x81\x8d\x8f\x90\x9d')[0] b'\x81\x8d\x8f\x90\x9d' >>> codecs.code_page_decode(1252, b'\x81\x8d\x8f\x90\x9d')[0] '\x81\x8d\x8f\x90\x9d' WinAPI WideCharToMultiByte() uses a best-fit encoding unless the flag WC_NO_BEST_FIT_CHARS is passed. For example, with code page 1252, Greek "α" is best-fit encoded as Latin b"a". code_page_encode() uses the native best-fit encoding when the "replace" error handler is specified. For example: >>> codecs.code_page_encode(1252, 'α', 'replace')[0] b'a' Regarding Python's encodings, if you need a specific mapping to match Windows, I think this should be discussed on a case by case basis. I see no benefit to supporting a mapping such as "\x81" <-> b"\x81" in code page 1252. That it's not mapped in Python is possibly a small benefit, since to some extent this helps to catch a mismatched encoding. For example, code page 1251 (Cyrilic) maps ordinal b"\x81" to "Ѓ" (i.e. "\u0403").
msg401997 - (view)	Author: Rafael Belo (rafaelblsilva) *	Date: 2021-09-17 01:50
As encodings are indeed a complex topic, debating this seems like a necessity. I researched this topic when i found an encoding issue regarding a mysql connector: https://github.com/PyMySQL/mysqlclient/pull/502 In MySQL itself there is a mislabel of "latin1" and "cp1252", what mysql calls "latin1" presents the behavior of cp1252. As Inada Naoki pointed out: """ See this: https://dev.mysql.com/doc/refman/8.0/en/charset-we-sets.html MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d. So latin1 in MySQL is actually cp1252. """ You can verify this by passing the byte 0x80 through it to get the string representation, a quick test i find useful: On MYSQL: select convert(unhex('80') using latin1); -- -> returns "€" On Postgresql: select convert_from(E'\\x80'::bytea, 'WIN1252'); -- -> returns "€" select convert_from(E'\\x80'::bytea, 'LATIN1'); -- -> returns the C1 control character "0xc2 0x80" I decided to try to fix this behavior on python because i always found it to be a little odd to receive errors in those codepoints. A discussion i particularly find useful is this one: https://comp.lang.python.narkive.com/C9oHYxxu/latin1-and-cp1252-inconsistent Which i think they didn't notice the "WindowsBestFit" folder at: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/ Digging through the commits to look for dates, i realized Amaury Forgeot d'Arc, created a tool to generate the windows encodings based on calls to "MultiByteToWideChar" which indeed generates the same mapping available on the unicode website, i've attached the file generated by it. Since there might be legacy systems which rely on this "specific" behavior, i don't think "back-porting" this update to older python versions is a good idea. That is the reason i think this should come in new versions, and treated as a "new behavior". The benefit i see in updating this is to prevent even further confusion, with the expected behavior when dealing with those encodings.
msg402007 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2021-09-17 07:34
Just to be clear: The Python code page encodings are (mostly) taken from the unicode.org set of mappings (ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/). This is our standards body for such mappings, where possible. In some cases, the Unicode consortium does not provide such mappings and we resort to other standards (ISO, commonly used mapping files in OSes, Wikipedia, etc). Changes to the existing mapping codecs should only be done in case corrections are applied to the mappings under those names by the standard bodies. If you want to add variants such as the best fit ones from MS, we'd have to add them under a different name, e.g. bestfit1252 (see ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/). Otherwise, interop with other systems would no longer. From Eryk's description it sounds like we should always add WC_NO_BEST_FIT_CHARS as an option to MultiByteToWideChar() in order to make sure it doesn't use best fit variants unless explicitly requested.
msg402045 - (view)	Author: Eryk Sun (eryksun) *	Date: 2021-09-17 12:25
> From Eryk's description it sounds like we should always add > WC_NO_BEST_FIT_CHARS as an option to MultiByteToWideChar() > in order to make sure it doesn't use best fit variants > unless explicitly requested. The concept of a "best fit" encoding is unrelated to decoding with MultiByteToWideChar(). By default WideCharToMultiByte() best-fit encodes some otherwise unmapped ordinals to characters in the code page that have similar glyphs. This doesn't round trip (e.g. "α" -> b"a" -> "a"). The WC_NO_BEST_FIT_CHARS flag prevents this behavior. code_page_encode() uses WC_NO_BEST_FIT_CHARS for legacy encodings, unless the "replace" error handler is used. Windows maps every value in single-byte ANSI code pages to a Unicode ordinal, which round trips between MultiByteToWideChar() and WideCharToMultiByte(). Unless otherwise defined, a value in the range 0x80-0x9F is mapped to the corresponding ordinal in the C1 controls block. Otherwise values that have no legacy definition are mapped to a private use area (e.g. U+E000 - U+F8FF). There is no option to make MultiByteToWideChar() fail for byte values that map to a C1 control code. But mappings to the private use area are strictly invalid, and MultiByteToWideChar() will fail in these cases if the flag MB_ERR_INVALID_CHARS is used. code_page_decode() always uses this flag, but to reliably fail one needs to pass final=True, since the codec doesn't know it's a single-byte encoding. For example: >>> codecs.code_page_decode(1253, b'\xaa', 'strict') ('', 0) >>> codecs.code_page_decode(1253, b'\xaa', 'strict', True) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'cp1253' codec can't decode bytes in position 0--1: No mapping for the Unicode character exists in the target code page. Here are the mappings to the private use area in the single-byte "ANSI" code pages: 1255 Hebrew 0xD9 U+F88D 0xDA U+F88E 0xDB U+F88F 0xDC U+F890 0xDD U+F891 0xDE U+F892 0xDF U+F893 0xFB U+F894 0xFC U+F895 0xFF U+F896 Note that 0xCA is defined as the Hebrew character U+05BA [1]. The definition is missing in the unicode.org data and Python's "cp1255" encoding. 874 Thai 0xDB U+F8C1 0xDC U+F8C2 0xDD U+F8C3 0xDE U+F8C4 0xFC U+F8C5 0xFD U+F8C6 0xFE U+F8C7 0xFF U+F8C8 1253 Greek 0xAA U+F8F9 0xD2 U+F8FA 0xFF U+F8FB 1257 Baltic 0xA1 U+F8FC 0xA5 U+F8FD There's no way to get these private use area results from code_page_decode(), but code_page_encode() allows them. For example: >>> codecs.code_page_encode(1253, '\uf8f9')[0] b'\xaa' --- [1] https://en.wikipedia.org/wiki/Windows-1255
msg402087 - (view)	Author: Rafael Belo (rafaelblsilva) *	Date: 2021-09-17 20:16
Eryk Regarding the codecsmodule.c i don't really know its inner workings and how it is connected to other modules, and as of it, changes on that level for this use case are not critical. But it is nice to think and evaluate on that level too, since there might be some tricky situations on windows systems because of that grey zone. My proposal really aims to enhance the Lib/encodings/ module. And as Marc-Andre Lemburg advised, to only change those mappings in case of official corrections on the standard itself. Now i think that really following those standards "strictly" seems to be a good idea. On top of that, adding them under different naming seems like a better idea anyway, since those standards can be seen as different if you take a strict look at the Unicode definitions. Adding them would suffice for the needs that might arise, would still allow for catching mismatched encodings, and can even be "backported" to older python versions. I will adjust the PR accordingly to these comments, thanks for the feedback!
msg402104 - (view)	Author: Eryk Sun (eryksun) *	Date: 2021-09-17 22:34
Rafael, I was discussing code_page_decode() and code_page_encode() both as an alternative for compatibility with other programs and also to explore how MultiByteToWideChar() and WideCharToMultiByte() work -- particularly to explain best-fit mappings, which do not roundtrip. MultiByteToWideChar() does not exhibit "best fit" behavior. I don't even know what that would mean in the context of decoding. With the exception of one change to code page 1255, the definitions that you're looking to add are just for the C1 controls and private use area codes, which are not meaningful. Windows uses these arbitrary definitions to be able to roundtrip between the system ANSI and Unicode APIs. Note that Python's "mbcs" (i.e. "ansi") and "oem" encodings use the code-page codec. For example: >>> _winapi.GetACP() 1252 >>> '\x81\x8d\x8f\x90\x9d'.encode('ansi') b'\x81\x8d\x8f\x90\x9d' Best-fit encode "α" in code page 1252 [1]: >>> 'α'.encode('ansi', 'replace') b'a' In your PR, the change to code page 1255 to add b"\xca" <-> "\u05ba" is the only change that I think is really worthwhile because the unicode.org data has it wrong. You can get the proper character name for the comment using the unicodedata module: >>> print(unicodedata.name('\u05ba')) HEBREW POINT HOLAM HASER FOR VAV I'm +0 in favor of leaving the mappings undefined where Windows completes legacy single-byte code pages by using C1 control codes and private use area codes. It would have been fine if Python's code-page encodings had always been based on the "WindowsBestFit" tables, but only the decoding MBTABLE, since it's reasonable. Ideally, I don't want anything to use the best-fit mappings in WCTABLE. I would rather that the 'replace' handler for code_page_encode() used the replacement character (U+FFFD) or system default character. But the world is not ideal; the system ANSI API uses the WCTABLE best-fit encoding. Back in the day with Python 2.7, it was easy to demonstrate how insidious this is. For example, in 2.7.18: >>> os.listdir(u'.') [u'\u03b1'] >>> os.listdir('.') ['a'] --- [1] https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

History
Date	User	Action	Args
2022-04-11 14:59:49	admin	set	github: 89283
2021-09-17 22:34:22	eryksun	set	messages: + msg402104
2021-09-17 20:16:07	rafaelblsilva	set	messages: + msg402087
2021-09-17 12:25:25	eryksun	set	messages: + msg402045
2021-09-17 07:34:38	lemburg	set	messages: + msg402007
2021-09-17 01:50:09	rafaelblsilva	set	files: + cp1252_from_genwincodec.py messages: + msg401997
2021-09-16 22:11:25	eryksun	set	messages: + msg401993
2021-09-16 20:56:49	steve.dower	set	nosy: + serhiy.storchaka, eryksun messages: + msg401991
2021-09-06 20:47:12	python-dev	set	keywords: + patch nosy: + python-dev pull_requests: + pull_request26615 stage: patch review
2021-09-06 20:30:12	rafaelblsilva	create