classification
Title: codecs.lookup() ignores non-ASCII characters, whereas encodings.normalize_encoding() copies them
Type: Stage: patch review
Components: Library (Lib) Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, serhiy.storchaka, shihai1991, vstinner
Priority: normal Keywords: patch

Created on 2020-01-14 21:54 by vstinner, last changed 2020-03-19 10:40 by shihai1991.

Pull Requests
URL Status Linked Edit
PR 18845 closed shihai1991, 2020-03-08 09:50
PR 18987 open shihai1991, 2020-03-14 15:03
PR 19069 open shihai1991, 2020-03-19 10:40
Messages (6)
msg360004 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 21:54
bpo-37751 changed codecs.lookup() in a subtle way: non-ASCII characters are now ignored, whereas they were copied unmodified previously.

I would prefer that codecs.lookup() and encodings.normalize_encoding() behave the same. Either always ignore or always copy.

Moreover, it seems like there is no test on how the encoding names are normalized in codecs.register(). I recall that using codecs.register() in an unit test causes troubles since there is no API to unregister a search function. Maybe we should just add a private function for test in _testcapi.

Serhiy Storchaka wrote an example on my PR:
https://github.com/python/cpython/pull/17997/files

> There are other differences. For example, normalize_encoding("КОИ-8") returns "кои_8", but codecs.lookup normalizes it to "8".

> The comment in the sources is also not correct.
msg363645 - (view) Author: hai shi (shihai1991) * Date: 2020-03-08 09:55
> I would prefer that codecs.lookup() and encodings.normalize_encoding() behave the same. Either always ignore or always copy.

How about calling `encodings.normalize_encoding() in codecs.normalizestring()` to keep same behavior?(I create PR18845)

> Maybe we should just add a private function for test in _testcapi

I can try to add some test cases in next weekend ;)
msg364180 - (view) Author: hai shi (shihai1991) * Date: 2020-03-14 15:02
> How about calling `encodings.normalize_encoding() in codecs.normalizestring()` to keep same behavior?(I create PR18845)

I have try this idea, but it make the testcase of test_io.py failed because some object will call `codecs.Lookup()` in `__del__()`.-->extension module will be cleaned before calling `__del__().`

> I would prefer that codecs.lookup() and encodings.normalize_encoding() behave the same. Either always ignore or always copy.

I try to add a `_Py_normalize_unicode_encoding()` in unicodeobject.c to support non-ASCII encoding names' normalization(PR18987), but this PR caused many testcases failed.

For example:

In master:
python3.9 -c "print('a\xac\u1234\u20ac\u8000\U0010ffff'.encode('iso-8859-15', 'namereplace'))"
result:
b'a\xac\\N{ETHIOPIC SYLLABLE SEE}\xa4\\N{CJK UNIFIED IDEOGRAPH-8000}\\U0010ffff'

after PR18987:
./python -c "print('a\xac\u1234\u20ac\u8000\U0010ffff'.encode('iso-8859-15', 'namereplace'))"
result:
b'a\xac\\N{ETHIOPIC SYLLABLE SEE}\\N{EURO SIGN}\\N{CJK UNIFIED IDEOGRAPH-8000}\\U0010ffff'
msg364449 - (view) Author: hai shi (shihai1991) * Date: 2020-03-17 16:57
> Maybe we should just add a private function for test in _testcapi.

Oh, there have a problem with this idea:
struct _is is defined in internal/pycore_pystate.h.
_testcapi must test the public Python C API, not CPython internal C API
msg364452 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-17 17:03
> _testcapi must test the public Python C API, not CPython internal C API

Use _testinternalcapi in this case.
msg364456 - (view) Author: hai shi (shihai1991) * Date: 2020-03-17 17:16
> Use _testinternalcapi in this case.
oh, forgive me. I don't atttention this extension module before :(
History
Date User Action Args
2020-03-19 10:40:43shihai1991setpull_requests: + pull_request18424
2020-03-17 17:16:32shihai1991setmessages: + msg364456
2020-03-17 17:03:30vstinnersetmessages: + msg364452
2020-03-17 16:57:57shihai1991setmessages: + msg364449
2020-03-14 15:03:52shihai1991setpull_requests: + pull_request18344
2020-03-14 15:02:55shihai1991setmessages: + msg364180
2020-03-08 09:55:01shihai1991setmessages: + msg363645
2020-03-08 09:50:41shihai1991setkeywords: + patch
nosy: + shihai1991

pull_requests: + pull_request18202
stage: patch review
2020-01-14 21:54:42vstinnercreate