Issue 28712: Non-Windows mappings for a couple of Windows code pages

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/72898

classification

Title:	Non-Windows mappings for a couple of Windows code pages
Type:	behavior	Stage:
Components:	Unicode, Windows	Versions:	Python 3.10, Python 3.9, Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Artoria2e5, benjamin.peterson, eryksun, ezio.melotti, ned.deily, paul.moore, serhiy.storchaka, steve.dower, tim.golden, zach.ware
Priority:	normal	Keywords:

Created on 2016-11-16 05:40 by Artoria2e5, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
windows10_14959.txt	Artoria2e5, 2016-11-16 05:40	Windows 10b14959 output
win10_14959_py36.txt	Artoria2e5, 2016-11-16 13:36	Correct Windows 10b14959 output
pycp_ctypes.py	Artoria2e5, 2016-11-16 14:31	A test script that runs on Windows and uses native mbcs as a reference.
codepage_table.csv	eryksun, 2016-11-16 21:06

Messages (23)
msg280914 - (view)	Author: Mingye Wang (Artoria2e5) *	Date: 2016-11-16 05:40
Mappings for 0x81 and 0x8D in multiple Windows code pages diverge from what Windows does. Attached is a script that tests for this behavior. (These two bytes are not necessary the only problems, but for sure they are the most widespread and famous ones. Again, refer to Unicode best fit for something that works.) This problem is seen in Python 2.7.10 on Windows 10b14959, but apparently it is known since long ago[1]. Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``. [1]: https://ftfy.readthedocs.io/en/latest/#module-ftfy.bad_codecs.sloppy
msg280915 - (view)	Author: Mingye Wang (Artoria2e5) *	Date: 2016-11-16 05:44
> Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``. ... but since Cygwin packagers did not enable Win32 APIs for their build, I cannot test the script directly.
msg280918 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-16 06:51
It seems to me there is something wrong with your test. For example decoding b'\x81\x8d' from CP1251 (as well from any other codepage!) gives you u'\x81\x8d', but codes 0x81 and 0x8D are assigned to different characters: 'Ѓ' (U+0402) and 'Ќ' (U+040C). 0x81 0x0403 #CYRILLIC CAPITAL LETTER GJE 0x8D 0x040C #CYRILLIC CAPITAL LETTER KJE [1] https://en.wikipedia.org/wiki/Windows-1251 [2] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT [3] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1251.txt
msg280944 - (view)	Author: Mingye Wang (Artoria2e5) *	Date: 2016-11-16 13:36
Ugh... This is weird. Attached is a correct version use Python 3.6's 'code page' methods. I have modified the script a little to make sure it runs on Py3.
msg280949 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-16 13:53
What is the output of new script?
msg280956 - (view)	Author: Mingye Wang (Artoria2e5) *	Date: 2016-11-16 14:31
The output is already attached as win10_14959_py36.txt. PS: after playing with ctypes, I got a version of pycp that works with Py < 3.3 too (attached with comment).
msg280965 - (view)	Author: Steve Dower (steve.dower) *	Date: 2016-11-16 16:48
So is this a bug in the hardcoded encoding tables in Python? I briefly considered making them all use the OS functions, but then they'll be inconsistent with other platforms (where the tables should work fine). Do you have a proposed fix? That will help illustrate where the problem is.
msg280967 - (view)	Author: Mingye Wang (Artoria2e5) *	Date: 2016-11-16 16:55
Yes, it's a table issue. My suggested fix is to replace them all with WindowsBestFit tables, where MS currently redirects https://msdn.microsoft.com/en-us/globalization/mt767590 visitors to. These old "WINDOWS" tables appear abandoned since long ago.
msg280970 - (view)	Author: Mingye Wang (Artoria2e5) *	Date: 2016-11-16 17:03
... On the other hand, I am happy to use these Win32 functions if they are faster, but still the table should be made correct in the first place. (See also issue28343 (936) and issue28693 (950) for problems with DBCS Chinese code pages.)
msg280971 - (view)	Author: Steve Dower (steve.dower) *	Date: 2016-11-16 17:40
No idea which is faster, but the tables have better compatibility. However, I'm not sure that changing the tables in already released versions is a great idea, since it could "corrupt" programs without warning. Adding the release managers to weigh in - my gut feel is that targeted table fixes plus validation tests are okay for 3.6 if we hurry, but are probably not suitable for 2.7 or 3.5.
msg280973 - (view)	Author: Ned Deily (ned.deily) *	Date: 2016-11-16 17:48
I'm not qualified to offer a technical opinion on Windows matters like this so, for 3.6, I leave it to your discretion, Steve. If you do decide to push this change, please do so before 3.6.0b4 on Monday.
msg280974 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-16 17:53
Codecs are strict by default in Python. Call MultiByteToWideChar() with the MB_ERR_INVALID_CHARS flag as Python does. You also could use _codecs.code_page_decode().
msg280979 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-11-16 19:18
Serhiy, single-byte codepages map every byte value, even if it's just to a Unicode C1 control code [1]. For example: import ctypes kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) MB_ERR_INVALID_CHARS = 0x00000008 def mbtwc_errcheck(result, func, args): if not result and args[-1]: raise ctypes.WinError(ctypes.get_last_error()) return args kernel32.MultiByteToWideChar.errcheck = mbtwc_errcheck def decode(codepage, data, strict=True): flags = MB_ERR_INVALID_CHARS if strict else 0 n = kernel32.MultiByteToWideChar(codepage, flags, data, len(data), None, 0) buf = (ctypes.c_wchar * n)() kernel32.MultiByteToWideChar(codepage, flags, data, len(data), buf, n) return buf.value codepages = [437, 874] + list(range(1250, 1259)) for cp in codepages: print('cp%d:' % cp, ascii(decode(cp, b'\x81\x8d'))) Output: cp437: '\xfc\xec' cp874: '\x81\x8d' cp1250: '\x81\u0164' cp1251: '\u0403\u040c' cp1252: '\x81\x8d' cp1253: '\x81\x8d' cp1254: '\x81\x8d' cp1255: '\x81\x8d' cp1256: '\u067e\u0686' cp1257: '\x81\xa8' cp1258: '\x81\x8d' [1]: https://en.wikipedia.org/wiki/C0_and_C1_control_codes
msg280980 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-16 19:37
Thanks Eryk. Could you please run following script and attach the output? import codecs codepages = [424, 856, 857, 864, 869, 874, 932, 949, 950, 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258] for cp in codepages: table = [] for i in range(256): try: c = codecs.code_page_decode(cp, bytes([i]), None, True) except Exception: c = None table.append(c) print(cp, ascii(table))
msg280983 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-11-16 20:02
How about just the ASCII repr of the 256 decoded characters in CSV? I don't think the list of 2-tuple results is useful. For these single-byte codepages it's always 1 byte consumed.
msg280986 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-16 20:19
This would be helpful too if every byte is decoded to exactly 1 character.
msg280989 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-11-16 20:38
I don't think the 2nd tuple element is useful when decoding a single byte. It either works or it doesn't, such as failing for non-ASCII bytes with multibyte codepages such as 932 and 950. I'm attaching the output from the following, which you should be able to open in a spreadsheet: import codecs codepages = [424, 856, 857, 864, 869, 874, 932, 949, 950, 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258] for cp in codepages: table = [] for i in range(256): try: c = codecs.code_page_decode(cp, bytes([i]), None, True) c = ascii(c[0]) except Exception: c = None table.append(c) print(cp, *table, sep=',')
msg280990 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-11-16 21:06
I rewrote it using the csv module since I can't remember the escaping rules.
msg281014 - (view)	Author: Mingye Wang (Artoria2e5) *	Date: 2016-11-17 00:18
> Codecs are strict by default in Python. Call MultiByteToWideChar() with the MB_ERR_INVALID_CHARS flag as Python does. Great catch. Without MB_ERR_INVALID_CHARS or WC_NO_BEST_FIT_CHARS Windows would perform the "best fit" behavior described in the BestFit files, which is not marked explicitly (they didn't add '<< Best Fit Mapping' like in the readme) in these files and requires checking for existence of reverse mapping[1]. When MB_ERR_INVALID_CHARS is set, Windows would perform a strict check. [2]: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt By the way, will there be a 'mbcsbestfitreplace' error handler on Windows to invoke "best fit" behavior? It might be useful for interoperating with common Windows programs and users. (Implementation for other platforms can be constructed from WindowsBestFit charts, but it might be too large relative to its usefulness.)
msg281019 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-11-17 02:25
The ANSI and OEM codepages are conveniently supported on a Windows system as the encodings 'mbcs' and 'oem' (new in 3.6). The best-fit mapping is used by the 'replace' error handler (see the encode_code_page_flags function in Objects/unicodeobject.c). For other Windows codepages, while it's not as convenient, you can use codecs.code_page_encode. For example: >>> codecs.code_page_encode(1252, 'α', 'replace') (b'a', 1) For decoding, MB_ERR_INVALID_CHARS has no effect on decoding single-byte codepages because they map every byte. It only affects decoding byte sequences that are invalid in multibyte codepages such as 932 and 65001. Without this flag, invalid sequences are silently decoded as the codepage's Unicode default character. This is usually "?", but for 932 it's Katakana middle dot (U+30FB), and for UTF-8 it's U+FFFD. codecs.code_page_decode uses MB_ERR_INVALID_CHARS almost always, except not for UTF-7 (see the decode_code_page_flags function). So its 'replace' error handling is completely Python's own implementation. For example: MultiByteToWideChar without MB_ERR_INVALID_CHARS: >>> decode(932, b'\xe05', strict=False) '\u30fb' versus code_page_decode: >>> codecs.code_page_decode(932, b'\xe05', 'replace', True) ('\ufffd5', 2)
msg281023 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-11-17 06:58
Windows API doc is not easy to understand. I wrote this doc when I fixed code pages in Python 3: http://unicodebook.readthedocs.io/operating_systems.html#windows
msg281026 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-17 08:07
Thank you Eryk. That is what I want. I just missed that code_page_decode() returns a tuple. Seems Windows maps undefined codes to Unicode characters if they are in the range 0x80-0x9f and makes an error if they are outside of this range. But if the code starts multibyte sequence, the single byte is an error even if it is in the range 0x80-0x9f (codepages 932, 949, 950). This could be emulated by either decoding with errors='surrogateescape' and postprocessing the result (replace '\udc80'-'\udc9f' with '\x80'-'\x9f' and handle '\udca0'-'\udcff' as error) or writing custom error handler that does the job (but perhaps needed several error handlers corresponding 'strict', 'replace', 'ignore', etc). Adding a new codec of cause is an option too. There are few other minor differences between Python and Windows: cp864: On Windows 0x25 is mapped to '%' (U+0025) instead of '٪' (U+066A). cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3. cp1255: 0xCA is mapped to U+05BA instead of be undefined. The first two differences can be handled by postprocessing, the latter needs changing the codec.
msg281044 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-11-17 15:54
Thanks, Serihy. When I looked at this previously, I mistakenly assumed that any undefined codes would be decoded using the codepage's default Unicode character. But for single-byte codepages in the range above 0x9F, Windows instead maps undefined codes to the Private Use Area (PUA). For example, using decode() from above: ERROR_NO_UNICODE_TRANSLATION = 0x0459 codepages = 857, 864, 874, 1253, 1255, 1257 for cp in codepages: undefined = [] for i in range(256): b = bytes([i]) try: decode(cp, b) except OSError as e: if e.winerror == ERROR_NO_UNICODE_TRANSLATION: c = decode(cp, b, False) undefined.append('{:02x}=>{:04x}'.format(ord(b), ord(c))) print(cp, *undefined, sep=', ') output: 857, d5=>f8bb, e7=>f8bc, f2=>f8bd 864, a6=>f8be, a7=>f8bf, ff=>f8c0 874, db=>f8c1, dc=>f8c2, dd=>f8c3, de=>f8c4, fc=>f8c5, fd=>f8c6, fe=>f8c7, ff=>f8c8 1253, aa=>f8f9, d2=>f8fa, ff=>f8fb 1255, d9=>f88d, da=>f88e, db=>f88f, dc=>f890, dd=>f891, de=>f892, df=>f893, fb=>f894, fc=>f895, ff=>f896 1257, a1=>f8fc, a5=>f8fd Do you think Python's 'replace' handler should prevent adding the MB_ERR_INVALID_CHARS flag for PyUnicode_DecodeCodePageStateful? One benefit is that the PUA code can be encoded back to the original byte value: >>> codecs.code_page_encode(1257, '\uf8fd') (b'\xa5', 1) > cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3. Windows maps these byte values to PUA codes if the MB_ERR_INVALID_CHARS flag isn't used: >>> decode(932, b'\xa0\xfd\xfe\xff', False) '\uf8f0\uf8f1\uf8f2\uf8f3'

History
Date	User	Action	Args
2022-04-11 14:58:39	admin	set	github: 72898
2021-03-08 19:02:53	vstinner	set	nosy: - vstinner
2021-03-04 22:30:28	larry	set	nosy: - larry
2021-03-04 21:56:19	eryksun	set	versions: + Python 3.8, Python 3.9, Python 3.10, - Python 2.7, Python 3.5, Python 3.6, Python 3.7
2016-11-17 15:54:34	eryksun	set	messages: + msg281044
2016-11-17 08:07:49	serhiy.storchaka	set	messages: + msg281026
2016-11-17 06:58:23	vstinner	set	messages: + msg281023
2016-11-17 02:25:03	eryksun	set	messages: + msg281019
2016-11-17 00:18:06	Artoria2e5	set	messages: + msg281014
2016-11-16 21:06:46	eryksun	set	files: + codepage_table.csv messages: + msg280990
2016-11-16 20:52:11	eryksun	set	files: - codepage_table.csv
2016-11-16 20:38:18	eryksun	set	files: + codepage_table.csv messages: + msg280989
2016-11-16 20:19:04	serhiy.storchaka	set	messages: + msg280986
2016-11-16 20:02:04	eryksun	set	messages: + msg280983
2016-11-16 19:37:39	serhiy.storchaka	set	messages: + msg280980
2016-11-16 19:18:00	eryksun	set	nosy: + eryksun messages: + msg280979
2016-11-16 17:53:56	serhiy.storchaka	set	messages: + msg280974
2016-11-16 17:48:12	ned.deily	set	messages: + msg280973
2016-11-16 17:40:03	steve.dower	set	nosy: + larry, benjamin.peterson, ned.deily messages: + msg280971
2016-11-16 17:03:32	Artoria2e5	set	messages: + msg280970
2016-11-16 16:55:20	Artoria2e5	set	messages: + msg280967
2016-11-16 16:48:11	steve.dower	set	messages: + msg280965 versions: - Python 3.3, Python 3.4
2016-11-16 16:11:14	serhiy.storchaka	set	nosy: + paul.moore, tim.golden, zach.ware, steve.dower components: + Windows
2016-11-16 14:31:34	Artoria2e5	set	files: - pycp.py
2016-11-16 14:31:21	Artoria2e5	set	files: + pycp_ctypes.py messages: + msg280956
2016-11-16 13:53:22	serhiy.storchaka	set	messages: + msg280949
2016-11-16 13:37:31	Artoria2e5	set	files: - pycp.py
2016-11-16 13:37:23	Artoria2e5	set	files: + pycp.py
2016-11-16 13:36:40	Artoria2e5	set	files: + win10_14959_py36.txt messages: + msg280944
2016-11-16 06:51:00	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg280918
2016-11-16 05:44:36	Artoria2e5	set	messages: + msg280915
2016-11-16 05:40:54	Artoria2e5	set	files: + windows10_14959.txt
2016-11-16 05:40:28	Artoria2e5	create