Message 402104 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, ezio.melotti, lemburg, paul.moore, python-dev, rafaelblsilva, serhiy.storchaka, steve.dower, tim.golden, vstinner, zach.ware
Date	2021-09-17.22:34:22
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1631918062.41.0.627876124193.issue45120@roundup.psfhosted.org>
In-reply-to

Content
Rafael, I was discussing code_page_decode() and code_page_encode() both as an alternative for compatibility with other programs and also to explore how MultiByteToWideChar() and WideCharToMultiByte() work -- particularly to explain best-fit mappings, which do not roundtrip. MultiByteToWideChar() does not exhibit "best fit" behavior. I don't even know what that would mean in the context of decoding. With the exception of one change to code page 1255, the definitions that you're looking to add are just for the C1 controls and private use area codes, which are not meaningful. Windows uses these arbitrary definitions to be able to roundtrip between the system ANSI and Unicode APIs. Note that Python's "mbcs" (i.e. "ansi") and "oem" encodings use the code-page codec. For example: >>> _winapi.GetACP() 1252 >>> '\x81\x8d\x8f\x90\x9d'.encode('ansi') b'\x81\x8d\x8f\x90\x9d' Best-fit encode "α" in code page 1252 [1]: >>> 'α'.encode('ansi', 'replace') b'a' In your PR, the change to code page 1255 to add b"\xca" <-> "\u05ba" is the only change that I think is really worthwhile because the unicode.org data has it wrong. You can get the proper character name for the comment using the unicodedata module: >>> print(unicodedata.name('\u05ba')) HEBREW POINT HOLAM HASER FOR VAV I'm +0 in favor of leaving the mappings undefined where Windows completes legacy single-byte code pages by using C1 control codes and private use area codes. It would have been fine if Python's code-page encodings had always been based on the "WindowsBestFit" tables, but only the decoding MBTABLE, since it's reasonable. Ideally, I don't want anything to use the best-fit mappings in WCTABLE. I would rather that the 'replace' handler for code_page_encode() used the replacement character (U+FFFD) or system default character. But the world is not ideal; the system ANSI API uses the WCTABLE best-fit encoding. Back in the day with Python 2.7, it was easy to demonstrate how insidious this is. For example, in 2.7.18: >>> os.listdir(u'.') [u'\u03b1'] >>> os.listdir('.') ['a'] --- [1] https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

Rafael, I was discussing code_page_decode() and code_page_encode() both as an alternative for compatibility with other programs and also to explore how MultiByteToWideChar() and WideCharToMultiByte() work -- particularly to explain best-fit mappings, which do not roundtrip. MultiByteToWideChar() does not exhibit "best fit" behavior. I don't even know what that would mean in the context of decoding. 

With the exception of one change to code page 1255, the definitions that you're looking to add are just for the C1 controls and private use area codes, which are not meaningful. Windows uses these arbitrary definitions to be able to roundtrip between the system ANSI and Unicode APIs.

Note that Python's "mbcs" (i.e. "ansi") and "oem" encodings use the code-page codec. For example:

    >>> _winapi.GetACP()
    1252

    >>> '\x81\x8d\x8f\x90\x9d'.encode('ansi')
    b'\x81\x8d\x8f\x90\x9d'

Best-fit encode "α" in code page 1252 [1]:

    >>> 'α'.encode('ansi', 'replace')
    b'a'

In your PR, the change to code page 1255 to add b"\xca" <-> "\u05ba" is the only change that I think is really worthwhile because the unicode.org data has it wrong. You can get the proper character name for the comment using the unicodedata module:

    >>> print(unicodedata.name('\u05ba'))
    HEBREW POINT HOLAM HASER FOR VAV

I'm +0 in favor of leaving the mappings undefined where Windows completes legacy single-byte code pages by using C1 control codes and private use area codes. It would have been fine if Python's code-page encodings had always been based on the "WindowsBestFit" tables, but only the decoding MBTABLE, since it's reasonable. 

Ideally, I don't want anything to use the best-fit mappings in WCTABLE. I would rather that the 'replace' handler for code_page_encode() used the replacement character (U+FFFD) or system default character. But the world is not ideal; the system ANSI API uses the WCTABLE best-fit encoding. Back in the day with Python 2.7, it was easy to demonstrate how insidious this is. For example, in 2.7.18:

    >>> os.listdir(u'.')
    [u'\u03b1']

    >>> os.listdir('.')
    ['a']

---

[1] https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

History
Date	User	Action	Args
2021-09-17 22:34:22	eryksun	set	recipients: + eryksun, lemburg, paul.moore, vstinner, tim.golden, ezio.melotti, python-dev, zach.ware, serhiy.storchaka, steve.dower, rafaelblsilva
2021-09-17 22:34:22	eryksun	set	messageid: <1631918062.41.0.627876124193.issue45120@roundup.psfhosted.org>
2021-09-17 22:34:22	eryksun	link	issue45120 messages
2021-09-17 22:34:22	eryksun	create