classification
Title: In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores.
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, hroncok, lemburg, qigangxu, shihai1991, vstinner
Priority: normal Keywords: patch

Created on 2019-08-03 11:34 by qigangxu, last changed 2020-01-14 21:54 by vstinner.

Pull Requests
URL Status Linked Edit
PR 15092 merged qigangxu, 2019-08-03 12:45
PR 17997 open vstinner, 2020-01-14 12:40
Messages (14)
msg348953 - (view) Author: Jordon.X (qigangxu) * Date: 2019-08-03 11:34
In codecs.c,  when _PyCodec_Lookup() call normalizestring(), both spaces and hyphens should be convered to underscores. Not convert spaces to hyphens.

see:https://github.com/python/peps/blob/master/pep-0100.txt, Codecs (Coder/Decoders) Lookup
msg348954 - (view) Author: Jordon.X (qigangxu) * Date: 2019-08-03 11:55
and I will try to fix it.
msg348956 - (view) Author: hai shi (shihai1991) * Date: 2019-08-03 12:57
Hm, there is a bit misleading between desc(https://github.com/python/cpython/blob/master/Python/codecs.c#L53) and the code (https://github.com/python/cpython/blob/master/Python/codecs.c#L74).
msg348959 - (view) Author: Jordon.X (qigangxu) * Date: 2019-08-03 13:13
The design and code of the following four places need to be consistent,

No.1 https://github.com/python/peps/blob/master/pep-0100.txt#L292
No.2 https://github.com/python/cpython/blob/master/Python/codecs.c#L113
No.3 https://github.com/python/cpython/blob/master/Python/codecs.c#L53  
No.4 https://github.com/python/cpython/blob/master/Python/codecs.c#74
msg349448 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2019-08-12 08:37
Jordon is right. Conversion has to be to underscores, not hyphens. I guess this bug was introduced when the normalization function was converted to C.
msg350086 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-21 13:26
New changeset 20f59fe1f7748ae899aceee4cb560e5e1f528a1f by Victor Stinner (Jordon Xu) in branch 'master':
bpo-37751: Fix codecs.lookup() normalization (GH-15092)
https://github.com/python/cpython/commit/20f59fe1f7748ae899aceee4cb560e5e1f528a1f
msg350087 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-21 13:27
Thanks for the fix Jordon Xu.

IMHO this change is not strictly a bugfix, but more like an enhancement. I close the issue.

If you consider that a backport to Python 3.7 and 3.8 is needed, please say so.
msg350155 - (view) Author: Jordon.X (qigangxu) * Date: 2019-08-22 04:42
Thanks vstinner. I also don't think it's necessary to backport to the old version. Close this issue is fine.
msg359970 - (view) Author: Miro HronĨok (hroncok) * Date: 2020-01-14 12:34
The change is backwards incompatible and a backport would break things. See for example how it breaks latexcodec:

https://bugzilla.redhat.com/show_bug.cgi?id=1789613#c2
msg359971 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 12:41
> The change is backwards incompatible and a backport would break things. See for example how it breaks latexcodec:

I reopen the issue. I proposed PR 17997 to *document* the incompatible change in What's New in Python 3.8. IMO it's a deliberate change and it's correct.

I rely on Marc-Andre Lemburg who implemented codecs and encodings modules. He wrote: "Jordon is right. Conversion has to be to underscores, not hyphens.".
msg359972 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 12:42
It seems quite easy to update latexcodec project to support Python 3.9. I proposed a solution there:
https://bugzilla.redhat.com/show_bug.cgi?id=1789613#c6
msg359973 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2020-01-14 13:07
Just to clarify: the change in the C implementation was the breaking change. The patch just restores the previous behavior: https://github.com/python/cpython/blob/master/Lib/encodings/__init__.py#L43

Please note that external codec packages should not rely on the semantics of the Python stdlib encodings package's search function. They should really register their own search function: https://docs.python.org/3.9/library/codecs.html#codecs.register

It's good practice to always only use ASCII lower case chars and the underscore for codec names.
msg359974 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 13:11
> Please note that external codec packages should not rely on the semantics of the Python stdlib encodings package's search function.

latexcodec does register a search function.

> It's good practice to always only use ASCII lower case chars and the underscore for codec names.

latexcodec uses encoding names like "latex+ascii" and their search function used "+" as a separator.

Don't worry, I just fixed latexcodec, my fix is already merged upstream! I simply changed the search function to split on "_" if the name contains "_".

* https://github.com/mcmtroffaes/latexcodec/commit/a30ae2cf061d7369b1aaa8179ddd1b486974fdad
* https://github.com/mcmtroffaes/latexcodec/pull/76
* https://github.com/mcmtroffaes/latexcodec/issues/75
msg360005 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 21:54
I created bpo-39337: codecs.lookup() ignores non-ASCII characters, whereas encodings.normalize_encoding() copies them.
History
Date User Action Args
2020-01-14 21:54:58vstinnersetmessages: + msg360005
2020-01-14 13:11:13vstinnersetmessages: + msg359974
2020-01-14 13:07:34lemburgsetmessages: + msg359973
2020-01-14 12:42:40vstinnersetmessages: + msg359972
2020-01-14 12:41:44vstinnersetstatus: closed -> open
resolution: fixed ->
messages: + msg359971
2020-01-14 12:40:02vstinnersetpull_requests: + pull_request17401
2020-01-14 12:34:30hroncoksetnosy: + hroncok
messages: + msg359970
2019-08-22 04:42:11qigangxusetmessages: + msg350155
2019-08-21 13:27:23vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg350087

stage: patch review -> resolved
2019-08-21 13:26:33vstinnersetmessages: + msg350086
2019-08-12 08:37:08lemburgsetnosy: + lemburg
messages: + msg349448
2019-08-03 13:13:33qigangxusetmessages: + msg348959
2019-08-03 12:57:14shihai1991setnosy: + shihai1991
messages: + msg348956
2019-08-03 12:45:33qigangxusetkeywords: + patch
stage: patch review
pull_requests: + pull_request14838
2019-08-03 11:55:54qigangxusetmessages: + msg348954
2019-08-03 11:34:13qigangxucreate