Message 281044 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	Artoria2e5, benjamin.peterson, eryksun, ezio.melotti, larry, ned.deily, paul.moore, serhiy.storchaka, steve.dower, tim.golden, vstinner, zach.ware
Date	2016-11-17.15:54:34
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1479398074.31.0.0391134593896.issue28712@psf.upfronthosting.co.za>
In-reply-to

Content
Thanks, Serihy. When I looked at this previously, I mistakenly assumed that any undefined codes would be decoded using the codepage's default Unicode character. But for single-byte codepages in the range above 0x9F, Windows instead maps undefined codes to the Private Use Area (PUA). For example, using decode() from above: ERROR_NO_UNICODE_TRANSLATION = 0x0459 codepages = 857, 864, 874, 1253, 1255, 1257 for cp in codepages: undefined = [] for i in range(256): b = bytes([i]) try: decode(cp, b) except OSError as e: if e.winerror == ERROR_NO_UNICODE_TRANSLATION: c = decode(cp, b, False) undefined.append('{:02x}=>{:04x}'.format(ord(b), ord(c))) print(cp, *undefined, sep=', ') output: 857, d5=>f8bb, e7=>f8bc, f2=>f8bd 864, a6=>f8be, a7=>f8bf, ff=>f8c0 874, db=>f8c1, dc=>f8c2, dd=>f8c3, de=>f8c4, fc=>f8c5, fd=>f8c6, fe=>f8c7, ff=>f8c8 1253, aa=>f8f9, d2=>f8fa, ff=>f8fb 1255, d9=>f88d, da=>f88e, db=>f88f, dc=>f890, dd=>f891, de=>f892, df=>f893, fb=>f894, fc=>f895, ff=>f896 1257, a1=>f8fc, a5=>f8fd Do you think Python's 'replace' handler should prevent adding the MB_ERR_INVALID_CHARS flag for PyUnicode_DecodeCodePageStateful? One benefit is that the PUA code can be encoded back to the original byte value: >>> codecs.code_page_encode(1257, '\uf8fd') (b'\xa5', 1) > cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3. Windows maps these byte values to PUA codes if the MB_ERR_INVALID_CHARS flag isn't used: >>> decode(932, b'\xa0\xfd\xfe\xff', False) '\uf8f0\uf8f1\uf8f2\uf8f3'

Thanks, Serihy. When I looked at this previously, I mistakenly assumed that any undefined codes would be decoded using the codepage's default Unicode character. But for single-byte codepages in the range above 0x9F, Windows instead maps undefined codes to the Private Use Area (PUA). For example, using decode() from above:

    ERROR_NO_UNICODE_TRANSLATION = 0x0459
    codepages = 857, 864, 874, 1253, 1255, 1257
    for cp in codepages:
        undefined = []
        for i in range(256):
            b = bytes([i])
            try:
                decode(cp, b)
            except OSError as e:
                if e.winerror == ERROR_NO_UNICODE_TRANSLATION:
                    c = decode(cp, b, False)
                    undefined.append('{:02x}=>{:04x}'.format(ord(b), ord(c)))
        print(cp, *undefined, sep=', ')

output:

        857, d5=>f8bb, e7=>f8bc, f2=>f8bd
        864, a6=>f8be, a7=>f8bf, ff=>f8c0
        874, db=>f8c1, dc=>f8c2, dd=>f8c3, de=>f8c4, fc=>f8c5, fd=>f8c6, fe=>f8c7, ff=>f8c8
        1253, aa=>f8f9, d2=>f8fa, ff=>f8fb
        1255, d9=>f88d, da=>f88e, db=>f88f, dc=>f890, dd=>f891, de=>f892, df=>f893, fb=>f894, fc=>f895, ff=>f896
        1257, a1=>f8fc, a5=>f8fd

Do you think Python's 'replace' handler should prevent adding the MB_ERR_INVALID_CHARS flag for PyUnicode_DecodeCodePageStateful? One benefit is that the PUA code can be encoded back to the original byte value:

    >>> codecs.code_page_encode(1257, '\uf8fd')
    (b'\xa5', 1)

> cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3.

Windows maps these byte values to PUA codes if the MB_ERR_INVALID_CHARS flag isn't used:

    >>> decode(932, b'\xa0\xfd\xfe\xff', False)
    '\uf8f0\uf8f1\uf8f2\uf8f3'

History
Date	User	Action	Args
2016-11-17 15:54:34	eryksun	set	recipients: + eryksun, paul.moore, vstinner, larry, tim.golden, benjamin.peterson, ned.deily, ezio.melotti, zach.ware, serhiy.storchaka, steve.dower, Artoria2e5
2016-11-17 15:54:34	eryksun	set	messageid: <1479398074.31.0.0391134593896.issue28712@psf.upfronthosting.co.za>
2016-11-17 15:54:34	eryksun	link	issue28712 messages
2016-11-17 15:54:34	eryksun	create