Message 401993 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, ezio.melotti, lemburg, paul.moore, python-dev, rafaelblsilva, serhiy.storchaka, steve.dower, tim.golden, vstinner, zach.ware
Date	2021-09-16.22:11:25
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1631830285.67.0.580506382086.issue45120@roundup.psfhosted.org>
In-reply-to

Content
> in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", > whereas in bestfit1252, they map to \u0081 \u008d \u008f > \u0090 \u009d respectively This is the normal mapping in Windows, not a best-fit encoding. Within Windows, you can access the native encoding via codecs.code_page_encode() and codecs.code_page_decode(). For example: >>> codecs.code_page_encode(1252, '\x81\x8d\x8f\x90\x9d')[0] b'\x81\x8d\x8f\x90\x9d' >>> codecs.code_page_decode(1252, b'\x81\x8d\x8f\x90\x9d')[0] '\x81\x8d\x8f\x90\x9d' WinAPI WideCharToMultiByte() uses a best-fit encoding unless the flag WC_NO_BEST_FIT_CHARS is passed. For example, with code page 1252, Greek "α" is best-fit encoded as Latin b"a". code_page_encode() uses the native best-fit encoding when the "replace" error handler is specified. For example: >>> codecs.code_page_encode(1252, 'α', 'replace')[0] b'a' Regarding Python's encodings, if you need a specific mapping to match Windows, I think this should be discussed on a case by case basis. I see no benefit to supporting a mapping such as "\x81" <-> b"\x81" in code page 1252. That it's not mapped in Python is possibly a small benefit, since to some extent this helps to catch a mismatched encoding. For example, code page 1251 (Cyrilic) maps ordinal b"\x81" to "Ѓ" (i.e. "\u0403").

> in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", 
> whereas in bestfit1252, they map to \u0081 \u008d \u008f 
> \u0090 \u009d respectively

This is the normal mapping in Windows, not a best-fit encoding. Within Windows, you can access the native encoding via codecs.code_page_encode() and codecs.code_page_decode(). For example:

    >>> codecs.code_page_encode(1252, '\x81\x8d\x8f\x90\x9d')[0]
    b'\x81\x8d\x8f\x90\x9d'

    >>> codecs.code_page_decode(1252, b'\x81\x8d\x8f\x90\x9d')[0]
    '\x81\x8d\x8f\x90\x9d'

WinAPI WideCharToMultiByte() uses a best-fit encoding unless the flag WC_NO_BEST_FIT_CHARS is passed. For example, with code page 1252, Greek "α" is best-fit encoded as Latin b"a". code_page_encode() uses the native best-fit encoding when the "replace" error handler is specified. For example:

    >>> codecs.code_page_encode(1252, 'α', 'replace')[0]
    b'a'

Regarding Python's encodings, if you need a specific mapping to match Windows, I think this should be discussed on a case by case basis. I see no benefit to supporting a mapping such as "\x81" <-> b"\x81" in code page 1252. That it's not mapped in Python is possibly a small benefit, since to some extent this helps to catch a mismatched encoding. For example, code page 1251 (Cyrilic) maps ordinal b"\x81" to "Ѓ" (i.e. "\u0403").

History
Date	User	Action	Args
2021-09-16 22:11:25	eryksun	set	recipients: + eryksun, lemburg, paul.moore, vstinner, tim.golden, ezio.melotti, python-dev, zach.ware, serhiy.storchaka, steve.dower, rafaelblsilva
2021-09-16 22:11:25	eryksun	set	messageid: <1631830285.67.0.580506382086.issue45120@roundup.psfhosted.org>
2021-09-16 22:11:25	eryksun	link	issue45120 messages
2021-09-16 22:11:25	eryksun	create