Message 402045 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, ezio.melotti, lemburg, paul.moore, python-dev, rafaelblsilva, serhiy.storchaka, steve.dower, tim.golden, vstinner, zach.ware
Date	2021-09-17.12:25:24
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1631881525.12.0.206585633642.issue45120@roundup.psfhosted.org>
In-reply-to

Content
> From Eryk's description it sounds like we should always add > WC_NO_BEST_FIT_CHARS as an option to MultiByteToWideChar() > in order to make sure it doesn't use best fit variants > unless explicitly requested. The concept of a "best fit" encoding is unrelated to decoding with MultiByteToWideChar(). By default WideCharToMultiByte() best-fit encodes some otherwise unmapped ordinals to characters in the code page that have similar glyphs. This doesn't round trip (e.g. "α" -> b"a" -> "a"). The WC_NO_BEST_FIT_CHARS flag prevents this behavior. code_page_encode() uses WC_NO_BEST_FIT_CHARS for legacy encodings, unless the "replace" error handler is used. Windows maps every value in single-byte ANSI code pages to a Unicode ordinal, which round trips between MultiByteToWideChar() and WideCharToMultiByte(). Unless otherwise defined, a value in the range 0x80-0x9F is mapped to the corresponding ordinal in the C1 controls block. Otherwise values that have no legacy definition are mapped to a private use area (e.g. U+E000 - U+F8FF). There is no option to make MultiByteToWideChar() fail for byte values that map to a C1 control code. But mappings to the private use area are strictly invalid, and MultiByteToWideChar() will fail in these cases if the flag MB_ERR_INVALID_CHARS is used. code_page_decode() always uses this flag, but to reliably fail one needs to pass final=True, since the codec doesn't know it's a single-byte encoding. For example: >>> codecs.code_page_decode(1253, b'\xaa', 'strict') ('', 0) >>> codecs.code_page_decode(1253, b'\xaa', 'strict', True) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'cp1253' codec can't decode bytes in position 0--1: No mapping for the Unicode character exists in the target code page. Here are the mappings to the private use area in the single-byte "ANSI" code pages: 1255 Hebrew 0xD9 U+F88D 0xDA U+F88E 0xDB U+F88F 0xDC U+F890 0xDD U+F891 0xDE U+F892 0xDF U+F893 0xFB U+F894 0xFC U+F895 0xFF U+F896 Note that 0xCA is defined as the Hebrew character U+05BA [1]. The definition is missing in the unicode.org data and Python's "cp1255" encoding. 874 Thai 0xDB U+F8C1 0xDC U+F8C2 0xDD U+F8C3 0xDE U+F8C4 0xFC U+F8C5 0xFD U+F8C6 0xFE U+F8C7 0xFF U+F8C8 1253 Greek 0xAA U+F8F9 0xD2 U+F8FA 0xFF U+F8FB 1257 Baltic 0xA1 U+F8FC 0xA5 U+F8FD There's no way to get these private use area results from code_page_decode(), but code_page_encode() allows them. For example: >>> codecs.code_page_encode(1253, '\uf8f9')[0] b'\xaa' --- [1] https://en.wikipedia.org/wiki/Windows-1255

> From Eryk's description it sounds like we should always add 
> WC_NO_BEST_FIT_CHARS as an option to MultiByteToWideChar() 
> in order to make sure it doesn't use best fit variants 
> unless explicitly requested.

The concept of a "best fit" encoding is unrelated to decoding with MultiByteToWideChar(). By default WideCharToMultiByte() best-fit encodes some otherwise unmapped ordinals to characters in the code page that have similar glyphs. This doesn't round trip (e.g. "α" -> b"a" -> "a"). The WC_NO_BEST_FIT_CHARS flag prevents this behavior. code_page_encode() uses WC_NO_BEST_FIT_CHARS for legacy encodings, unless the "replace" error handler is used.

Windows maps every value in single-byte ANSI code pages to a Unicode ordinal, which round trips between MultiByteToWideChar() and WideCharToMultiByte(). Unless otherwise defined, a value in the range 0x80-0x9F is mapped to the corresponding ordinal in the C1 controls block. Otherwise values that have no legacy definition are mapped to a private use area (e.g. U+E000 - U+F8FF). 

There is no option to make MultiByteToWideChar() fail for byte values that map to a C1 control code. But mappings to the private use area are strictly invalid, and MultiByteToWideChar() will fail in these cases if the flag MB_ERR_INVALID_CHARS is used. code_page_decode() always uses this flag, but to reliably fail one needs to pass final=True, since the codec doesn't know it's a single-byte encoding. For example:

    >>> codecs.code_page_decode(1253, b'\xaa', 'strict')
    ('', 0)

    >>> codecs.code_page_decode(1253, b'\xaa', 'strict', True)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'cp1253' codec can't decode bytes in position 0--1: 
    No mapping for the Unicode character exists in the target code page.

Here are the mappings to the private use area in the single-byte "ANSI" code pages:

    1255 Hebrew
    0xD9    U+F88D
    0xDA    U+F88E
    0xDB    U+F88F
    0xDC    U+F890
    0xDD    U+F891
    0xDE    U+F892
    0xDF    U+F893
    0xFB    U+F894
    0xFC    U+F895
    0xFF    U+F896

Note that 0xCA is defined as the Hebrew character U+05BA [1]. The definition is missing in the unicode.org data and Python's "cp1255" encoding.

    874 Thai
    0xDB    U+F8C1
    0xDC    U+F8C2
    0xDD    U+F8C3
    0xDE    U+F8C4
    0xFC    U+F8C5
    0xFD    U+F8C6
    0xFE    U+F8C7
    0xFF    U+F8C8

    1253 Greek
    0xAA    U+F8F9
    0xD2    U+F8FA
    0xFF    U+F8FB

    1257 Baltic
    0xA1    U+F8FC
    0xA5    U+F8FD

There's no way to get these private use area results from code_page_decode(), but code_page_encode() allows them. For example:

    >>> codecs.code_page_encode(1253, '\uf8f9')[0]
    b'\xaa'

---

[1] https://en.wikipedia.org/wiki/Windows-1255

History
Date	User	Action	Args
2021-09-17 12:25:25	eryksun	set	recipients: + eryksun, lemburg, paul.moore, vstinner, tim.golden, ezio.melotti, python-dev, zach.ware, serhiy.storchaka, steve.dower, rafaelblsilva
2021-09-17 12:25:25	eryksun	set	messageid: <1631881525.12.0.206585633642.issue45120@roundup.psfhosted.org>
2021-09-17 12:25:25	eryksun	link	issue45120 messages
2021-09-17 12:25:24	eryksun	create