msg280914 - (view) |
Author: Mingye Wang (Artoria2e5) * |
Date: 2016-11-16 05:40 |
Mappings for 0x81 and 0x8D in multiple Windows code pages diverge from what Windows does. Attached is a script that tests for this behavior. (These two bytes are not necessary the only problems, but for sure they are the most widespread and famous ones. Again, refer to Unicode best fit for something that works.)
This problem is seen in Python 2.7.10 on Windows 10b14959, but apparently it is known since long ago[1]. Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``.
[1]: https://ftfy.readthedocs.io/en/latest/#module-ftfy.bad_codecs.sloppy
|
msg280915 - (view) |
Author: Mingye Wang (Artoria2e5) * |
Date: 2016-11-16 05:44 |
> Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``.
... but since Cygwin packagers did not enable Win32 APIs for their build, I cannot test the script directly.
|
msg280918 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-11-16 06:51 |
It seems to me there is something wrong with your test. For example decoding b'\x81\x8d' from CP1251 (as well from any other codepage!) gives you u'\x81\x8d', but codes 0x81 and 0x8D are assigned to different characters: 'Ѓ' (U+0402) and 'Ќ' (U+040C).
0x81 0x0403 #CYRILLIC CAPITAL LETTER GJE
0x8D 0x040C #CYRILLIC CAPITAL LETTER KJE
[1] https://en.wikipedia.org/wiki/Windows-1251
[2] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT
[3] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1251.txt
|
msg280944 - (view) |
Author: Mingye Wang (Artoria2e5) * |
Date: 2016-11-16 13:36 |
Ugh... This is weird. Attached is a correct version use Python 3.6's 'code page' methods. I have modified the script a little to make sure it runs on Py3.
|
msg280949 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-11-16 13:53 |
What is the output of new script?
|
msg280956 - (view) |
Author: Mingye Wang (Artoria2e5) * |
Date: 2016-11-16 14:31 |
The output is already attached as win10_14959_py36.txt.
PS: after playing with ctypes, I got a version of pycp that works with Py < 3.3 too (attached with comment).
|
msg280965 - (view) |
Author: Steve Dower (steve.dower) * |
Date: 2016-11-16 16:48 |
So is this a bug in the hardcoded encoding tables in Python? I briefly considered making them all use the OS functions, but then they'll be inconsistent with other platforms (where the tables should work fine).
Do you have a proposed fix? That will help illustrate where the problem is.
|
msg280967 - (view) |
Author: Mingye Wang (Artoria2e5) * |
Date: 2016-11-16 16:55 |
Yes, it's a table issue. My suggested fix is to replace them all with WindowsBestFit tables, where MS currently redirects https://msdn.microsoft.com/en-us/globalization/mt767590 visitors to. These old "WINDOWS" tables appear abandoned since long ago.
|
msg280970 - (view) |
Author: Mingye Wang (Artoria2e5) * |
Date: 2016-11-16 17:03 |
... On the other hand, I am happy to use these Win32 functions if they are faster, but still the table should be made correct in the first place. (See also issue28343 (936) and issue28693 (950) for problems with DBCS Chinese code pages.)
|
msg280971 - (view) |
Author: Steve Dower (steve.dower) * |
Date: 2016-11-16 17:40 |
No idea which is faster, but the tables have better compatibility.
However, I'm not sure that changing the tables in already released versions is a great idea, since it could "corrupt" programs without warning. Adding the release managers to weigh in - my gut feel is that targeted table fixes plus validation tests are okay for 3.6 if we hurry, but are probably not suitable for 2.7 or 3.5.
|
msg280973 - (view) |
Author: Ned Deily (ned.deily) * |
Date: 2016-11-16 17:48 |
I'm not qualified to offer a technical opinion on Windows matters like this so, for 3.6, I leave it to your discretion, Steve. If you do decide to push this change, please do so before 3.6.0b4 on Monday.
|
msg280974 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-11-16 17:53 |
Codecs are strict by default in Python. Call MultiByteToWideChar() with the MB_ERR_INVALID_CHARS flag as Python does. You also could use _codecs.code_page_decode().
|
msg280979 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-11-16 19:18 |
Serhiy, single-byte codepages map every byte value, even if it's just to a Unicode C1 control code [1].
For example:
import ctypes
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
MB_ERR_INVALID_CHARS = 0x00000008
def mbtwc_errcheck(result, func, args):
if not result and args[-1]:
raise ctypes.WinError(ctypes.get_last_error())
return args
kernel32.MultiByteToWideChar.errcheck = mbtwc_errcheck
def decode(codepage, data, strict=True):
flags = MB_ERR_INVALID_CHARS if strict else 0
n = kernel32.MultiByteToWideChar(codepage, flags,
data, len(data),
None, 0)
buf = (ctypes.c_wchar * n)()
kernel32.MultiByteToWideChar(codepage, flags,
data, len(data),
buf, n)
return buf.value
codepages = [437, 874] + list(range(1250, 1259))
for cp in codepages:
print('cp%d:' % cp, ascii(decode(cp, b'\x81\x8d')))
Output:
cp437: '\xfc\xec'
cp874: '\x81\x8d'
cp1250: '\x81\u0164'
cp1251: '\u0403\u040c'
cp1252: '\x81\x8d'
cp1253: '\x81\x8d'
cp1254: '\x81\x8d'
cp1255: '\x81\x8d'
cp1256: '\u067e\u0686'
cp1257: '\x81\xa8'
cp1258: '\x81\x8d'
[1]: https://en.wikipedia.org/wiki/C0_and_C1_control_codes
|
msg280980 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-11-16 19:37 |
Thanks Eryk. Could you please run following script and attach the output?
import codecs
codepages = [424, 856, 857, 864, 869, 874, 932, 949, 950, 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258]
for cp in codepages:
table = []
for i in range(256):
try:
c = codecs.code_page_decode(cp, bytes([i]), None, True)
except Exception:
c = None
table.append(c)
print(cp, ascii(table))
|
msg280983 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-11-16 20:02 |
How about just the ASCII repr of the 256 decoded characters in CSV? I don't think the list of 2-tuple results is useful. For these single-byte codepages it's always 1 byte consumed.
|
msg280986 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-11-16 20:19 |
This would be helpful too if every byte is decoded to exactly 1 character.
|
msg280989 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-11-16 20:38 |
I don't think the 2nd tuple element is useful when decoding a single byte. It either works or it doesn't, such as failing for non-ASCII bytes with multibyte codepages such as 932 and 950.
I'm attaching the output from the following, which you should be able to open in a spreadsheet:
import codecs
codepages = [424, 856, 857, 864, 869, 874, 932, 949, 950,
1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258]
for cp in codepages:
table = []
for i in range(256):
try:
c = codecs.code_page_decode(cp, bytes([i]), None, True)
c = ascii(c[0])
except Exception:
c = None
table.append(c)
print(cp, *table, sep=',')
|
msg280990 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-11-16 21:06 |
I rewrote it using the csv module since I can't remember the escaping rules.
|
msg281014 - (view) |
Author: Mingye Wang (Artoria2e5) * |
Date: 2016-11-17 00:18 |
> Codecs are strict by default in Python. Call MultiByteToWideChar() with the MB_ERR_INVALID_CHARS flag as Python does.
Great catch. Without MB_ERR_INVALID_CHARS or WC_NO_BEST_FIT_CHARS Windows would perform the "best fit" behavior described in the BestFit files, which is not marked explicitly (they didn't add '<< Best Fit Mapping' like in the readme) in these files and requires checking for existence of reverse mapping[1]. When MB_ERR_INVALID_CHARS is set, Windows would perform a strict check.
[2]: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt
By the way, will there be a 'mbcsbestfitreplace' error handler on Windows to invoke "best fit" behavior? It might be useful for interoperating with common Windows programs and users. (Implementation for other platforms can be constructed from WindowsBestFit charts, but it might be too large relative to its usefulness.)
|
msg281019 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-11-17 02:25 |
The ANSI and OEM codepages are conveniently supported on a Windows system as the encodings 'mbcs' and 'oem' (new in 3.6). The best-fit mapping is used by the 'replace' error handler (see the encode_code_page_flags function in Objects/unicodeobject.c). For other Windows codepages, while it's not as convenient, you can use codecs.code_page_encode. For example:
>>> codecs.code_page_encode(1252, 'α', 'replace')
(b'a', 1)
For decoding, MB_ERR_INVALID_CHARS has no effect on decoding single-byte codepages because they map every byte. It only affects decoding byte sequences that are invalid in multibyte codepages such as 932 and 65001. Without this flag, invalid sequences are silently decoded as the codepage's Unicode default character. This is usually "?", but for 932 it's Katakana middle dot (U+30FB), and for UTF-8 it's U+FFFD. codecs.code_page_decode uses MB_ERR_INVALID_CHARS almost always, except not for UTF-7 (see the decode_code_page_flags function). So its 'replace' error handling is completely Python's own implementation. For example:
MultiByteToWideChar without MB_ERR_INVALID_CHARS:
>>> decode(932, b'\xe05', strict=False)
'\u30fb'
versus code_page_decode:
>>> codecs.code_page_decode(932, b'\xe05', 'replace', True)
('\ufffd5', 2)
|
msg281023 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2016-11-17 06:58 |
Windows API doc is not easy to understand. I wrote this doc when I fixed
code pages in Python 3:
http://unicodebook.readthedocs.io/operating_systems.html#windows
|
msg281026 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-11-17 08:07 |
Thank you Eryk. That is what I want. I just missed that code_page_decode() returns a tuple.
Seems Windows maps undefined codes to Unicode characters if they are in the range 0x80-0x9f and makes an error if they are outside of this range. But if the code starts multibyte sequence, the single byte is an error even if it is in the range 0x80-0x9f (codepages 932, 949, 950).
This could be emulated by either decoding with errors='surrogateescape' and postprocessing the result (replace '\udc80'-'\udc9f' with '\x80'-'\x9f' and handle '\udca0'-'\udcff' as error) or writing custom error handler that does the job (but perhaps needed several error handlers corresponding 'strict', 'replace', 'ignore', etc). Adding a new codec of cause is an option too.
There are few other minor differences between Python and Windows:
cp864: On Windows 0x25 is mapped to '%' (U+0025) instead of '٪' (U+066A).
cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3.
cp1255: 0xCA is mapped to U+05BA instead of be undefined.
The first two differences can be handled by postprocessing, the latter needs changing the codec.
|
msg281044 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-11-17 15:54 |
Thanks, Serihy. When I looked at this previously, I mistakenly assumed that any undefined codes would be decoded using the codepage's default Unicode character. But for single-byte codepages in the range above 0x9F, Windows instead maps undefined codes to the Private Use Area (PUA). For example, using decode() from above:
ERROR_NO_UNICODE_TRANSLATION = 0x0459
codepages = 857, 864, 874, 1253, 1255, 1257
for cp in codepages:
undefined = []
for i in range(256):
b = bytes([i])
try:
decode(cp, b)
except OSError as e:
if e.winerror == ERROR_NO_UNICODE_TRANSLATION:
c = decode(cp, b, False)
undefined.append('{:02x}=>{:04x}'.format(ord(b), ord(c)))
print(cp, *undefined, sep=', ')
output:
857, d5=>f8bb, e7=>f8bc, f2=>f8bd
864, a6=>f8be, a7=>f8bf, ff=>f8c0
874, db=>f8c1, dc=>f8c2, dd=>f8c3, de=>f8c4, fc=>f8c5, fd=>f8c6, fe=>f8c7, ff=>f8c8
1253, aa=>f8f9, d2=>f8fa, ff=>f8fb
1255, d9=>f88d, da=>f88e, db=>f88f, dc=>f890, dd=>f891, de=>f892, df=>f893, fb=>f894, fc=>f895, ff=>f896
1257, a1=>f8fc, a5=>f8fd
Do you think Python's 'replace' handler should prevent adding the MB_ERR_INVALID_CHARS flag for PyUnicode_DecodeCodePageStateful? One benefit is that the PUA code can be encoded back to the original byte value:
>>> codecs.code_page_encode(1257, '\uf8fd')
(b'\xa5', 1)
> cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3.
Windows maps these byte values to PUA codes if the MB_ERR_INVALID_CHARS flag isn't used:
>>> decode(932, b'\xa0\xfd\xfe\xff', False)
'\uf8f0\uf8f1\uf8f2\uf8f3'
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:39 | admin | set | github: 72898 |
2021-03-08 19:02:53 | vstinner | set | nosy:
- vstinner
|
2021-03-04 22:30:28 | larry | set | nosy:
- larry
|
2021-03-04 21:56:19 | eryksun | set | versions:
+ Python 3.8, Python 3.9, Python 3.10, - Python 2.7, Python 3.5, Python 3.6, Python 3.7 |
2016-11-17 15:54:34 | eryksun | set | messages:
+ msg281044 |
2016-11-17 08:07:49 | serhiy.storchaka | set | messages:
+ msg281026 |
2016-11-17 06:58:23 | vstinner | set | messages:
+ msg281023 |
2016-11-17 02:25:03 | eryksun | set | messages:
+ msg281019 |
2016-11-17 00:18:06 | Artoria2e5 | set | messages:
+ msg281014 |
2016-11-16 21:06:46 | eryksun | set | files:
+ codepage_table.csv
messages:
+ msg280990 |
2016-11-16 20:52:11 | eryksun | set | files:
- codepage_table.csv |
2016-11-16 20:38:18 | eryksun | set | files:
+ codepage_table.csv
messages:
+ msg280989 |
2016-11-16 20:19:04 | serhiy.storchaka | set | messages:
+ msg280986 |
2016-11-16 20:02:04 | eryksun | set | messages:
+ msg280983 |
2016-11-16 19:37:39 | serhiy.storchaka | set | messages:
+ msg280980 |
2016-11-16 19:18:00 | eryksun | set | nosy:
+ eryksun messages:
+ msg280979
|
2016-11-16 17:53:56 | serhiy.storchaka | set | messages:
+ msg280974 |
2016-11-16 17:48:12 | ned.deily | set | messages:
+ msg280973 |
2016-11-16 17:40:03 | steve.dower | set | nosy:
+ larry, benjamin.peterson, ned.deily messages:
+ msg280971
|
2016-11-16 17:03:32 | Artoria2e5 | set | messages:
+ msg280970 |
2016-11-16 16:55:20 | Artoria2e5 | set | messages:
+ msg280967 |
2016-11-16 16:48:11 | steve.dower | set | messages:
+ msg280965 versions:
- Python 3.3, Python 3.4 |
2016-11-16 16:11:14 | serhiy.storchaka | set | nosy:
+ paul.moore, tim.golden, zach.ware, steve.dower components:
+ Windows
|
2016-11-16 14:31:34 | Artoria2e5 | set | files:
- pycp.py |
2016-11-16 14:31:21 | Artoria2e5 | set | files:
+ pycp_ctypes.py
messages:
+ msg280956 |
2016-11-16 13:53:22 | serhiy.storchaka | set | messages:
+ msg280949 |
2016-11-16 13:37:31 | Artoria2e5 | set | files:
- pycp.py |
2016-11-16 13:37:23 | Artoria2e5 | set | files:
+ pycp.py |
2016-11-16 13:36:40 | Artoria2e5 | set | files:
+ win10_14959_py36.txt
messages:
+ msg280944 |
2016-11-16 06:51:00 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg280918
|
2016-11-16 05:44:36 | Artoria2e5 | set | messages:
+ msg280915 |
2016-11-16 05:40:54 | Artoria2e5 | set | files:
+ windows10_14959.txt |
2016-11-16 05:40:28 | Artoria2e5 | create | |