This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: codecs.lookup() ignores non-ASCII characters, whereas encodings.normalize_encoding() copies them
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, serhiy.storchaka, shihai1991, vstinner
Priority: normal Keywords: patch

Created on 2020-01-14 21:54 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 18845 closed shihai1991, 2020-03-08 09:50
PR 18987 closed shihai1991, 2020-03-14 15:03
PR 19069 merged shihai1991, 2020-03-19 10:40
PR 22219 merged shihai1991, 2020-09-12 18:10
PR 22360 merged shihai1991, 2020-09-22 14:51
PR 17997 vstinner, 2020-10-01 16:23
Messages (11)
msg360004 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 21:54
bpo-37751 changed codecs.lookup() in a subtle way: non-ASCII characters are now ignored, whereas they were copied unmodified previously.

I would prefer that codecs.lookup() and encodings.normalize_encoding() behave the same. Either always ignore or always copy.

Moreover, it seems like there is no test on how the encoding names are normalized in codecs.register(). I recall that using codecs.register() in an unit test causes troubles since there is no API to unregister a search function. Maybe we should just add a private function for test in _testcapi.

Serhiy Storchaka wrote an example on my PR:
https://github.com/python/cpython/pull/17997/files

> There are other differences. For example, normalize_encoding("КОИ-8") returns "кои_8", but codecs.lookup normalizes it to "8".

> The comment in the sources is also not correct.
msg363645 - (view) Author: Hai Shi (shihai1991) * (Python triager) Date: 2020-03-08 09:55
> I would prefer that codecs.lookup() and encodings.normalize_encoding() behave the same. Either always ignore or always copy.

How about calling `encodings.normalize_encoding() in codecs.normalizestring()` to keep same behavior?(I create PR18845)

> Maybe we should just add a private function for test in _testcapi

I can try to add some test cases in next weekend ;)
msg364180 - (view) Author: Hai Shi (shihai1991) * (Python triager) Date: 2020-03-14 15:02
> How about calling `encodings.normalize_encoding() in codecs.normalizestring()` to keep same behavior?(I create PR18845)

I have try this idea, but it make the testcase of test_io.py failed because some object will call `codecs.Lookup()` in `__del__()`.-->extension module will be cleaned before calling `__del__().`

> I would prefer that codecs.lookup() and encodings.normalize_encoding() behave the same. Either always ignore or always copy.

I try to add a `_Py_normalize_unicode_encoding()` in unicodeobject.c to support non-ASCII encoding names' normalization(PR18987), but this PR caused many testcases failed.

For example:

In master:
python3.9 -c "print('a\xac\u1234\u20ac\u8000\U0010ffff'.encode('iso-8859-15', 'namereplace'))"
result:
b'a\xac\\N{ETHIOPIC SYLLABLE SEE}\xa4\\N{CJK UNIFIED IDEOGRAPH-8000}\\U0010ffff'

after PR18987:
./python -c "print('a\xac\u1234\u20ac\u8000\U0010ffff'.encode('iso-8859-15', 'namereplace'))"
result:
b'a\xac\\N{ETHIOPIC SYLLABLE SEE}\\N{EURO SIGN}\\N{CJK UNIFIED IDEOGRAPH-8000}\\U0010ffff'
msg364449 - (view) Author: Hai Shi (shihai1991) * (Python triager) Date: 2020-03-17 16:57
> Maybe we should just add a private function for test in _testcapi.

Oh, there have a problem with this idea:
struct _is is defined in internal/pycore_pystate.h.
_testcapi must test the public Python C API, not CPython internal C API
msg364452 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-17 17:03
> _testcapi must test the public Python C API, not CPython internal C API

Use _testinternalcapi in this case.
msg364456 - (view) Author: Hai Shi (shihai1991) * (Python triager) Date: 2020-03-17 17:16
> Use _testinternalcapi in this case.
oh, forgive me. I don't atttention this extension module before :(
msg377772 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-01 16:22
See also bpo-37751 and my PR 17997.
msg378281 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-08 19:21
New changeset 3f342376ab0da3b4c8a38a27edfafba8e8cdf52d by Hai Shi in branch 'master':
bpo-39337: Add a test case for normalizing of codec names (GH-19069)
https://github.com/python/cpython/commit/3f342376ab0da3b4c8a38a27edfafba8e8cdf52d
msg378626 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-14 15:43
New changeset c5b049b91ca50c615f9a5425055c2b79a82ac547 by Hai Shi in branch 'master':
bpo-39337: encodings.normalize_encoding() now ignores non-ASCII characters (GH-22219)
https://github.com/python/cpython/commit/c5b049b91ca50c615f9a5425055c2b79a82ac547
msg378628 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-14 15:46
Thanks Hai Shi for the change and for new codecs and encodings tests.
msg378630 - (view) Author: Hai Shi (shihai1991) * (Python triager) Date: 2020-10-14 15:55
Thanks for everyone's continus review :)
History
Date User Action Args
2022-04-11 14:59:25adminsetgithub: 83518
2020-10-14 15:55:50shihai1991setmessages: + msg378630
2020-10-14 15:46:55vstinnersetstatus: open -> closed
versions: + Python 3.10, - Python 3.9
messages: + msg378628

resolution: fixed
stage: patch review -> resolved
2020-10-14 15:43:38vstinnersetmessages: + msg378626
2020-10-08 19:21:16vstinnersetmessages: + msg378281
2020-10-01 16:23:20vstinnersetpull_requests: + pull_request21499
2020-10-01 16:22:44vstinnersetmessages: + msg377772
2020-09-22 14:51:27shihai1991setpull_requests: + pull_request21401
2020-09-12 18:10:22shihai1991setpull_requests: + pull_request21274
2020-03-19 10:40:43shihai1991setpull_requests: + pull_request18424
2020-03-17 17:16:32shihai1991setmessages: + msg364456
2020-03-17 17:03:30vstinnersetmessages: + msg364452
2020-03-17 16:57:57shihai1991setmessages: + msg364449
2020-03-14 15:03:52shihai1991setpull_requests: + pull_request18344
2020-03-14 15:02:55shihai1991setmessages: + msg364180
2020-03-08 09:55:01shihai1991setmessages: + msg363645
2020-03-08 09:50:41shihai1991setkeywords: + patch
nosy: + shihai1991

pull_requests: + pull_request18202
stage: patch review
2020-01-14 21:54:42vstinnercreate