classification
Title: Non-Windows mappings for a couple of Windows code pages
Type: behavior Stage:
Components: Unicode, Windows Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Artoria2e5, benjamin.peterson, eryksun, ezio.melotti, larry, ned.deily, paul.moore, serhiy.storchaka, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2016-11-16 05:40 by Artoria2e5, last changed 2016-11-17 15:54 by eryksun.

Files
File name Uploaded Description Edit
windows10_14959.txt Artoria2e5, 2016-11-16 05:40 Windows 10b14959 output
win10_14959_py36.txt Artoria2e5, 2016-11-16 13:36 Correct Windows 10b14959 output
pycp_ctypes.py Artoria2e5, 2016-11-16 14:31 A test script that runs on Windows and uses native mbcs as a reference.
codepage_table.csv eryksun, 2016-11-16 21:06
Messages (23)
msg280914 - (view) Author: Mingye Wang (Artoria2e5) * Date: 2016-11-16 05:40
Mappings for 0x81 and 0x8D in multiple Windows code pages diverge from what Windows does. Attached is a script that tests for this behavior. (These two bytes are not necessary the only problems, but for sure they are the most widespread and famous ones. Again, refer to Unicode best fit for something that works.)

This problem is seen in Python 2.7.10 on Windows 10b14959, but apparently it is known since long ago[1]. Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``.
  [1]: https://ftfy.readthedocs.io/en/latest/#module-ftfy.bad_codecs.sloppy
msg280915 - (view) Author: Mingye Wang (Artoria2e5) * Date: 2016-11-16 05:44
> Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``.

... but since Cygwin packagers did not enable Win32 APIs for their build, I cannot test the script directly.
msg280918 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-16 06:51
It seems to me there is something wrong with your test. For example decoding b'\x81\x8d' from CP1251 (as well from any other codepage!) gives you u'\x81\x8d', but codes 0x81 and 0x8D are assigned to different characters: 'Ѓ' (U+0402) and 'Ќ' (U+040C).

0x81	0x0403	#CYRILLIC CAPITAL LETTER GJE
0x8D	0x040C	#CYRILLIC CAPITAL LETTER KJE

[1] https://en.wikipedia.org/wiki/Windows-1251
[2] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT
[3] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1251.txt
msg280944 - (view) Author: Mingye Wang (Artoria2e5) * Date: 2016-11-16 13:36
Ugh... This is weird. Attached is a correct version use Python 3.6's 'code page' methods. I have modified the script a little to make sure it runs on Py3.
msg280949 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-16 13:53
What is the output of new script?
msg280956 - (view) Author: Mingye Wang (Artoria2e5) * Date: 2016-11-16 14:31
The output is already attached as win10_14959_py36.txt.

PS: after playing with ctypes, I got a version of pycp that works with Py < 3.3 too (attached with comment).
msg280965 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-11-16 16:48
So is this a bug in the hardcoded encoding tables in Python? I briefly considered making them all use the OS functions, but then they'll be inconsistent with other platforms (where the tables should work fine).

Do you have a proposed fix? That will help illustrate where the problem is.
msg280967 - (view) Author: Mingye Wang (Artoria2e5) * Date: 2016-11-16 16:55
Yes, it's a table issue. My suggested fix is to replace them all with WindowsBestFit tables, where MS currently redirects https://msdn.microsoft.com/en-us/globalization/mt767590 visitors to. These old "WINDOWS" tables appear abandoned since long ago.
msg280970 - (view) Author: Mingye Wang (Artoria2e5) * Date: 2016-11-16 17:03
... On the other hand, I am happy to use these Win32 functions if they are faster, but still the table should be made correct in the first place. (See also issue28343 (936) and issue28693 (950) for problems with DBCS Chinese code pages.)
msg280971 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-11-16 17:40
No idea which is faster, but the tables have better compatibility.

However, I'm not sure that changing the tables in already released versions is a great idea, since it could "corrupt" programs without warning. Adding the release managers to weigh in - my gut feel is that targeted table fixes plus validation tests are okay for 3.6 if we hurry, but are probably not suitable for 2.7 or 3.5.
msg280973 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-11-16 17:48
I'm not qualified to offer a technical opinion on Windows matters like this so, for 3.6, I leave it to your discretion, Steve.  If you do decide to push this change, please do so before 3.6.0b4 on Monday.
msg280974 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-16 17:53
Codecs are strict by default in Python. Call MultiByteToWideChar() with the MB_ERR_INVALID_CHARS flag as Python does. You also could use _codecs.code_page_decode().
msg280979 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-11-16 19:18
Serhiy, single-byte codepages map every byte value, even if it's just to a Unicode C1 control code [1]. 

For example:

    import ctypes
    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    MB_ERR_INVALID_CHARS = 0x00000008

    def mbtwc_errcheck(result, func, args):
        if not result and args[-1]:
            raise ctypes.WinError(ctypes.get_last_error())
        return args

    kernel32.MultiByteToWideChar.errcheck = mbtwc_errcheck

    def decode(codepage, data, strict=True):
        flags = MB_ERR_INVALID_CHARS if strict else 0
        n = kernel32.MultiByteToWideChar(codepage, flags,
                                         data, len(data),
                                         None, 0)
        buf = (ctypes.c_wchar * n)()
        kernel32.MultiByteToWideChar(codepage, flags,
                                     data, len(data),
                                     buf, n)
        return buf.value


    codepages = [437, 874] + list(range(1250, 1259))
    for cp in codepages:
        print('cp%d:' % cp, ascii(decode(cp, b'\x81\x8d')))

Output:
    
    cp437: '\xfc\xec'
    cp874: '\x81\x8d'
    cp1250: '\x81\u0164'
    cp1251: '\u0403\u040c'
    cp1252: '\x81\x8d'
    cp1253: '\x81\x8d'
    cp1254: '\x81\x8d'
    cp1255: '\x81\x8d'
    cp1256: '\u067e\u0686'
    cp1257: '\x81\xa8'
    cp1258: '\x81\x8d'

[1]: https://en.wikipedia.org/wiki/C0_and_C1_control_codes
msg280980 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-16 19:37
Thanks Eryk. Could you please run following script and attach the output?

import codecs
codepages = [424, 856, 857, 864, 869, 874, 932, 949, 950, 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258]
for cp in codepages:
    table = []
    for i in range(256):
        try:
            c = codecs.code_page_decode(cp, bytes([i]), None, True)
        except Exception:
            c = None
        table.append(c)
    print(cp, ascii(table))
msg280983 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-11-16 20:02
How about just the ASCII repr of the 256 decoded characters in CSV? I don't think the list of 2-tuple results is useful. For these single-byte codepages it's always 1 byte consumed.
msg280986 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-16 20:19
This would be helpful too if every byte is decoded to exactly 1 character.
msg280989 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-11-16 20:38
I don't think the 2nd tuple element is useful when decoding a single byte. It either works or it doesn't, such as failing for non-ASCII bytes with multibyte codepages such as 932 and 950. 

I'm attaching the output from the following, which you should be able to open in a spreadsheet:

    import codecs
    codepages = [424, 856, 857, 864, 869, 874, 932, 949, 950,
                 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258]
    for cp in codepages:
        table = []
        for i in range(256):
            try:
                c = codecs.code_page_decode(cp, bytes([i]), None, True)
                c = ascii(c[0])
            except Exception:
                c = None
            table.append(c)
        print(cp, *table, sep=',')
msg280990 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-11-16 21:06
I rewrote it using the csv module since I can't remember the escaping rules.
msg281014 - (view) Author: Mingye Wang (Artoria2e5) * Date: 2016-11-17 00:18
> Codecs are strict by default in Python. Call MultiByteToWideChar() with the MB_ERR_INVALID_CHARS flag as Python does.

Great catch. Without MB_ERR_INVALID_CHARS or WC_NO_BEST_FIT_CHARS Windows would perform the "best fit" behavior described in the BestFit files, which is not marked explicitly (they didn't add '<< Best Fit Mapping' like in the readme) in these files and requires checking for existence of reverse mapping[1]. When MB_ERR_INVALID_CHARS is set, Windows would perform a strict check.
  [2]: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt

By the way, will there be a 'mbcsbestfitreplace' error handler on Windows to invoke "best fit" behavior? It might be useful for interoperating with common Windows programs and users. (Implementation for other platforms can be constructed from WindowsBestFit charts, but it might be too large relative to its usefulness.)
msg281019 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-11-17 02:25
The ANSI and OEM codepages are conveniently supported on a Windows system as the encodings 'mbcs' and 'oem' (new in 3.6). The best-fit mapping is used by the 'replace' error handler (see the encode_code_page_flags function in Objects/unicodeobject.c). For other Windows codepages, while it's not as convenient, you can use codecs.code_page_encode. For example:

    >>> codecs.code_page_encode(1252, 'α', 'replace')
    (b'a', 1)

For decoding, MB_ERR_INVALID_CHARS has no effect on decoding single-byte codepages because they map every byte. It only affects decoding byte sequences that are invalid in multibyte codepages such as 932 and 65001. Without this flag, invalid sequences are silently decoded as the codepage's Unicode default character. This is usually "?", but for 932 it's Katakana middle dot (U+30FB), and for UTF-8 it's U+FFFD. codecs.code_page_decode uses MB_ERR_INVALID_CHARS almost always, except not for UTF-7 (see the decode_code_page_flags function). So its 'replace' error handling is completely Python's own implementation. For example:

MultiByteToWideChar without MB_ERR_INVALID_CHARS:

    >>> decode(932, b'\xe05', strict=False)
    '\u30fb'

versus code_page_decode:

    >>> codecs.code_page_decode(932, b'\xe05', 'replace', True)
    ('\ufffd5', 2)
msg281023 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-17 06:58
Windows API doc is not easy to understand. I wrote this doc when I fixed
code pages in Python 3:
http://unicodebook.readthedocs.io/operating_systems.html#windows
msg281026 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-17 08:07
Thank you Eryk. That is what I want. I just missed that code_page_decode() returns a tuple.

Seems Windows maps undefined codes to Unicode characters if they are in the range 0x80-0x9f and makes an error if they are outside of this range. But if the code starts multibyte sequence, the single byte is an error even if it is in the range 0x80-0x9f (codepages 932, 949, 950).

This could be emulated by either decoding with errors='surrogateescape' and postprocessing the result (replace '\udc80'-'\udc9f' with '\x80'-'\x9f' and handle '\udca0'-'\udcff' as error) or writing custom error handler that does the job (but perhaps needed several error handlers corresponding 'strict', 'replace', 'ignore', etc). Adding a new codec of cause is an option too.

There are few other minor differences between Python and Windows:

cp864: On Windows 0x25 is mapped to '%' (U+0025) instead of '٪' (U+066A).
cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3.
cp1255: 0xCA is mapped to U+05BA instead of be undefined.

The first two differences can be handled by postprocessing, the latter needs changing the codec.
msg281044 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-11-17 15:54
Thanks, Serihy. When I looked at this previously, I mistakenly assumed that any undefined codes would be decoded using the codepage's default Unicode character. But for single-byte codepages in the range above 0x9F, Windows instead maps undefined codes to the Private Use Area (PUA). For example, using decode() from above:

    ERROR_NO_UNICODE_TRANSLATION = 0x0459
    codepages = 857, 864, 874, 1253, 1255, 1257
    for cp in codepages:
        undefined = []
        for i in range(256):
            b = bytes([i])
            try:
                decode(cp, b)
            except OSError as e:
                if e.winerror == ERROR_NO_UNICODE_TRANSLATION:
                    c = decode(cp, b, False)
                    undefined.append('{:02x}=>{:04x}'.format(ord(b), ord(c)))
        print(cp, *undefined, sep=', ')

output:

        857, d5=>f8bb, e7=>f8bc, f2=>f8bd
        864, a6=>f8be, a7=>f8bf, ff=>f8c0
        874, db=>f8c1, dc=>f8c2, dd=>f8c3, de=>f8c4, fc=>f8c5, fd=>f8c6, fe=>f8c7, ff=>f8c8
        1253, aa=>f8f9, d2=>f8fa, ff=>f8fb
        1255, d9=>f88d, da=>f88e, db=>f88f, dc=>f890, dd=>f891, de=>f892, df=>f893, fb=>f894, fc=>f895, ff=>f896
        1257, a1=>f8fc, a5=>f8fd

Do you think Python's 'replace' handler should prevent adding the MB_ERR_INVALID_CHARS flag for PyUnicode_DecodeCodePageStateful? One benefit is that the PUA code can be encoded back to the original byte value:

    >>> codecs.code_page_encode(1257, '\uf8fd')
    (b'\xa5', 1)

> cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3.

Windows maps these byte values to PUA codes if the MB_ERR_INVALID_CHARS flag isn't used:

    >>> decode(932, b'\xa0\xfd\xfe\xff', False)
    '\uf8f0\uf8f1\uf8f2\uf8f3'
History
Date User Action Args
2016-11-17 15:54:34eryksunsetmessages: + msg281044
2016-11-17 08:07:49serhiy.storchakasetmessages: + msg281026
2016-11-17 06:58:23vstinnersetmessages: + msg281023
2016-11-17 02:25:03eryksunsetmessages: + msg281019
2016-11-17 00:18:06Artoria2e5setmessages: + msg281014
2016-11-16 21:06:46eryksunsetfiles: + codepage_table.csv

messages: + msg280990
2016-11-16 20:52:11eryksunsetfiles: - codepage_table.csv
2016-11-16 20:38:18eryksunsetfiles: + codepage_table.csv

messages: + msg280989
2016-11-16 20:19:04serhiy.storchakasetmessages: + msg280986
2016-11-16 20:02:04eryksunsetmessages: + msg280983
2016-11-16 19:37:39serhiy.storchakasetmessages: + msg280980
2016-11-16 19:18:00eryksunsetnosy: + eryksun
messages: + msg280979
2016-11-16 17:53:56serhiy.storchakasetmessages: + msg280974
2016-11-16 17:48:12ned.deilysetmessages: + msg280973
2016-11-16 17:40:03steve.dowersetnosy: + larry, benjamin.peterson, ned.deily
messages: + msg280971
2016-11-16 17:03:32Artoria2e5setmessages: + msg280970
2016-11-16 16:55:20Artoria2e5setmessages: + msg280967
2016-11-16 16:48:11steve.dowersetmessages: + msg280965
versions: - Python 3.3, Python 3.4
2016-11-16 16:11:14serhiy.storchakasetnosy: + paul.moore, tim.golden, zach.ware, steve.dower
components: + Windows
2016-11-16 14:31:34Artoria2e5setfiles: - pycp.py
2016-11-16 14:31:21Artoria2e5setfiles: + pycp_ctypes.py

messages: + msg280956
2016-11-16 13:53:22serhiy.storchakasetmessages: + msg280949
2016-11-16 13:37:31Artoria2e5setfiles: - pycp.py
2016-11-16 13:37:23Artoria2e5setfiles: + pycp.py
2016-11-16 13:36:40Artoria2e5setfiles: + win10_14959_py36.txt

messages: + msg280944
2016-11-16 06:51:00serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg280918
2016-11-16 05:44:36Artoria2e5setmessages: + msg280915
2016-11-16 05:40:54Artoria2e5setfiles: + windows10_14959.txt
2016-11-16 05:40:28Artoria2e5create