This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores.
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: akdor1154, bodograumann, ezio.melotti, gregory.p.smith, lemburg, mark, methane, miss-islington, qigangxu, shihai1991, vstinner
Priority: normal Keywords: patch

Created on 2019-08-03 11:34 by qigangxu, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15092 merged qigangxu, 2019-08-03 12:45
PR 17997 closed vstinner, 2020-01-14 12:40
PR 23096 merged shihai1991, 2020-11-02 05:01
PR 25643 merged methane, 2021-04-27 01:38
PR 25659 merged miss-islington, 2021-04-27 14:13
PR 25677 merged miss-islington, 2021-04-28 00:37
Messages (31)
msg348953 - (view) Author: Jordon.X (qigangxu) * Date: 2019-08-03 11:34
In codecs.c,  when _PyCodec_Lookup() call normalizestring(), both spaces and hyphens should be convered to underscores. Not convert spaces to hyphens.

see:https://github.com/python/peps/blob/master/pep-0100.txt, Codecs (Coder/Decoders) Lookup
msg348954 - (view) Author: Jordon.X (qigangxu) * Date: 2019-08-03 11:55
and I will try to fix it.
msg348956 - (view) Author: Hai Shi (shihai1991) * (Python triager) Date: 2019-08-03 12:57
Hm, there is a bit misleading between desc(https://github.com/python/cpython/blob/master/Python/codecs.c#L53) and the code (https://github.com/python/cpython/blob/master/Python/codecs.c#L74).
msg348959 - (view) Author: Jordon.X (qigangxu) * Date: 2019-08-03 13:13
The design and code of the following four places need to be consistent,

No.1 https://github.com/python/peps/blob/master/pep-0100.txt#L292
No.2 https://github.com/python/cpython/blob/master/Python/codecs.c#L113
No.3 https://github.com/python/cpython/blob/master/Python/codecs.c#L53  
No.4 https://github.com/python/cpython/blob/master/Python/codecs.c#74
msg349448 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2019-08-12 08:37
Jordon is right. Conversion has to be to underscores, not hyphens. I guess this bug was introduced when the normalization function was converted to C.
msg350086 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-21 13:26
New changeset 20f59fe1f7748ae899aceee4cb560e5e1f528a1f by Victor Stinner (Jordon Xu) in branch 'master':
bpo-37751: Fix codecs.lookup() normalization (GH-15092)
https://github.com/python/cpython/commit/20f59fe1f7748ae899aceee4cb560e5e1f528a1f
msg350087 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-21 13:27
Thanks for the fix Jordon Xu.

IMHO this change is not strictly a bugfix, but more like an enhancement. I close the issue.

If you consider that a backport to Python 3.7 and 3.8 is needed, please say so.
msg350155 - (view) Author: Jordon.X (qigangxu) * Date: 2019-08-22 04:42
Thanks vstinner. I also don't think it's necessary to backport to the old version. Close this issue is fine.
msg359970 - (view) Author: Miro Hrončok (hroncok) * Date: 2020-01-14 12:34
The change is backwards incompatible and a backport would break things. See for example how it breaks latexcodec:

https://bugzilla.redhat.com/show_bug.cgi?id=1789613#c2
msg359971 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 12:41
> The change is backwards incompatible and a backport would break things. See for example how it breaks latexcodec:

I reopen the issue. I proposed PR 17997 to *document* the incompatible change in What's New in Python 3.8. IMO it's a deliberate change and it's correct.

I rely on Marc-Andre Lemburg who implemented codecs and encodings modules. He wrote: "Jordon is right. Conversion has to be to underscores, not hyphens.".
msg359972 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 12:42
It seems quite easy to update latexcodec project to support Python 3.9. I proposed a solution there:
https://bugzilla.redhat.com/show_bug.cgi?id=1789613#c6
msg359973 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2020-01-14 13:07
Just to clarify: the change in the C implementation was the breaking change. The patch just restores the previous behavior: https://github.com/python/cpython/blob/master/Lib/encodings/__init__.py#L43

Please note that external codec packages should not rely on the semantics of the Python stdlib encodings package's search function. They should really register their own search function: https://docs.python.org/3.9/library/codecs.html#codecs.register

It's good practice to always only use ASCII lower case chars and the underscore for codec names.
msg359974 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 13:11
> Please note that external codec packages should not rely on the semantics of the Python stdlib encodings package's search function.

latexcodec does register a search function.

> It's good practice to always only use ASCII lower case chars and the underscore for codec names.

latexcodec uses encoding names like "latex+ascii" and their search function used "+" as a separator.

Don't worry, I just fixed latexcodec, my fix is already merged upstream! I simply changed the search function to split on "_" if the name contains "_".

* https://github.com/mcmtroffaes/latexcodec/commit/a30ae2cf061d7369b1aaa8179ddd1b486974fdad
* https://github.com/mcmtroffaes/latexcodec/pull/76
* https://github.com/mcmtroffaes/latexcodec/issues/75
msg360005 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-14 21:54
I created bpo-39337: codecs.lookup() ignores non-ASCII characters, whereas encodings.normalize_encoding() copies them.
msg391648 - (view) Author: akdor1154 (akdor1154) Date: 2021-04-23 01:37
If I understand the target of this issue, this is a breaking change in python 3.9 .

E.g. see https://github.com/SAP/PyHDB/issues/149

Ideally if we did not intend to break libraries then can this be fixed?
Or if it is acceptable to have such a breaking change, can it be documented as such? (maybe this is https://github.com/python/cpython/pull/23096 ? though I would not interpret that as a breaking change at first glance)
msg391652 - (view) Author: Hai Shi (shihai1991) * (Python triager) Date: 2021-04-23 04:55
>Ideally if we did not intend to break libraries then can this be fixed?
Or if it is acceptable to have such a breaking change.

Hi, akdor1154, thanks for your notice. It was a bugfix in fact:) https://bugs.python.org/issue37751#msg349448

> maybe this is https://github.com/python/cpython/pull/23096 ? though I would not interpret that as a breaking change at first glance

@victor ping :)
msg391653 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-04-23 05:47
I think it is too late. Python 3.9 has been released already.  Reverting the change is also breaking change.

PEP 100 says:
"Search functions are expected to take one argument, the encoding name in all lower case letters and with hyphens and spaces converted to underscores"
https://www.python.org/dev/peps/pep-0100/#codecs-coder-decoders-lookup

But codecs.register() says:
"Search functions are expected to take one argument, being the encoding name in all lower case letters".

I don't know historical reason why two document are inconsistent.
https://docs.python.org/3/library/codecs.html#codecs.register
msg391654 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-04-23 05:53
codecs.register() was added in this commit.
https://github.com/python/cpython/commit/e2d67f98d1aade1059b2ff3278672b2ffbaf180e

And its docstring has been added in this commit.
https://github.com/python/cpython/commit/0ae2981dec3de96a1f7d63b0535992cf1462ac92

Both commits doesn't describe why normalization was differ from PEP 100.
msg391664 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2021-04-23 07:45
On 23.04.2021 03:37, akdor1154 wrote:
> 
> akdor1154 <akdor1154@gmail.com> added the comment:
> 
> If I understand the target of this issue, this is a breaking change in python 3.9 .
> 
> E.g. see https://github.com/SAP/PyHDB/issues/149
> 
> Ideally if we did not intend to break libraries then can this be fixed?
> Or if it is acceptable to have such a breaking change, can it be documented as such? (maybe this is https://github.com/python/cpython/pull/23096 ? though I would not interpret that as a breaking change at first glance)

This patch only restored the behavior we had before (and for many many
years). It's not breaking, it's in fact resolving a break which was
caused by earlier:

https://bugs.python.org/issue37751#msg349448

Please note that search functions determine how to map codec names
to codec implementations. The codec search function in the encodings
package uses one way to do this (and depends on how the package
is structured).

The approach taken by the encodings search function is listed here:
https://github.com/python/cpython/blob/master/Lib/encodings/__init__.py#L43

Other search functions can work in different ways.

Now, unfortunately, parts of this kind of normalization have also made
its way into the codecs module itself and into the Unicode
implementation and perhaps not always in a way which allows search
functions to use a different approach or which is consistent.

As I mentioned before, the safest way to go about this is to use
alnum only names for codecs, with the addition of underscores to
separate important parts.

The Python implementation should make sure that such names continue
to work when passed through any codec name normalization.
msg391666 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2021-04-23 08:00
On 23.04.2021 07:47, Inada Naoki wrote:
> 
> Inada Naoki <songofacandy@gmail.com> added the comment:
> 
> I think it is too late. Python 3.9 has been released already.  Reverting the change is also breaking change.
> 
> PEP 100 says:
> "Search functions are expected to take one argument, the encoding name in all lower case letters and with hyphens and spaces converted to underscores"
> https://www.python.org/dev/peps/pep-0100/#codecs-coder-decoders-lookup
> 
> But codecs.register() says:
> "Search functions are expected to take one argument, being the encoding name in all lower case letters".
> 
> I don't know historical reason why two document are inconsistent.
> https://docs.python.org/3/library/codecs.html#codecs.register

I guess just an oversight on my part.

PEP 100 is certainly what I meant and implemented. I should have also
made it clear in PEP 100 that I meant lower case ASCII letters.
msg391671 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-04-23 09:10
New changeset 32980fb669a6857276da18895fcc0cb6f6fbb544 by Hai Shi in branch 'master':
bpo-37751: Document codecs.lookup() change in What's New in Python 3.9 (GH-23096)
https://github.com/python/cpython/commit/32980fb669a6857276da18895fcc0cb6f6fbb544
msg391785 - (view) Author: Hai Shi (shihai1991) * (Python triager) Date: 2021-04-24 16:55
Thanks Marc-Andre for your supplement of PEP-100.
Thanks Inada, victor for your review and merge.

After PR-23096 merged, I suggest to close this bpo.
If there have any other questions, we can reopen it again.
msg392077 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-04-27 14:25
New changeset 531c81038e28b6cfa0f9791467bf671c88c6f4c4 by Miss Islington (bot) in branch '3.9':
bpo-37751: Document codecs.lookup() change in What's New in Python 3.9 (GH-23096) (GH-25659)
https://github.com/python/cpython/commit/531c81038e28b6cfa0f9791467bf671c88c6f4c4
msg392159 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-04-28 00:37
New changeset 5c84bb506aaca01f5f750116d8f7a41d41f8124d by Inada Naoki in branch 'master':
bpo-37751: Update `codecs.register()` doc. (GH-25643)
https://github.com/python/cpython/commit/5c84bb506aaca01f5f750116d8f7a41d41f8124d
msg392161 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-04-28 02:26
New changeset cf9d65c5af7905d9e9945a297dbbf15d3bcace15 by Miss Islington (bot) in branch '3.9':
bpo-37751: Update `codecs.register()` doc. (GH-25643)
https://github.com/python/cpython/commit/cf9d65c5af7905d9e9945a297dbbf15d3bcace15
msg392178 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-04-28 10:07
Thanks Inada-san for documenting the change in codecs.register() doc!
msg395730 - (view) Author: Bodo Graumann (bodograumann) Date: 2021-06-13 06:44
Unfortunately this is not quite finished yet.

First of all, the change is bigger than what is documented: “Changed in version 3.9: Hyphens and spaces are converted to underscore.“

In reality, now
| Normalization works as follows: all non-alphanumeric
| characters except the dot used for Python package names are
| collapsed and replaced with a single underscore, e.g. '  -;#'
| becomes '_'. Leading and trailing underscores are removed.”
Cf. [encodings/__init__.py](https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Lib/encodings/__init__.py#L47-L50)

Secondly, this change breaks lots of iconv codecs with the python-iconv binding. E.g. `ASCII//TRANSLIT` is now normalized to `ascii_translit`, which iconv does not understand. Codec names which use hyphens also break and iinm not all of them have aliases in iconv without hyphens.
Cf. [python-iconv #4](https://github.com/bodograumann/python-iconv/issues/4)

The codecs api feels extremely well-fitting for integrating iconv in python and any alternative I can think of seems unsatisfactory.
Please advise.
msg395764 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-13 21:33
> The codecs api feels extremely well-fitting for integrating iconv in python and any alternative I can think of seems unsatisfactory.

This issue is now closed, would you mind to open a new issue?
msg411536 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-01-25 00:19
https://bugs.python.org/issue46508 filed to track fixing the acceptance and use of garbage codec values regression that this caused.
msg411537 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-01-25 00:21
(note: this might not be the true cause of that issue; though it sounds potentially related - I haven't investigated far enough yet)
msg411538 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-01-25 00:29
note that Bodo's own followup issue about the breaking change for python-iconv was filed as https://bugs.python.org/issue44723
History
Date User Action Args
2022-04-11 14:59:18adminsetgithub: 81932
2022-01-25 00:29:03gregory.p.smithsetmessages: + msg411538
2022-01-25 00:21:26gregory.p.smithsetmessages: + msg411537
2022-01-25 00:19:35gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg411536
2021-06-13 21:33:10vstinnersetmessages: + msg395764
2021-06-13 06:44:16bodograumannsetnosy: + bodograumann
messages: + msg395730
2021-04-28 10:07:12vstinnersetmessages: + msg392178
2021-04-28 02:26:18methanesetmessages: + msg392161
2021-04-28 00:37:19miss-islingtonsetpull_requests: + pull_request24368
2021-04-28 00:37:09methanesetmessages: + msg392159
2021-04-27 14:25:30vstinnersetmessages: + msg392077
2021-04-27 14:13:48miss-islingtonsetnosy: + miss-islington

pull_requests: + pull_request24350
2021-04-27 01:38:19methanesetpull_requests: + pull_request24336
2021-04-24 16:55:03shihai1991setstatus: open -> closed
resolution: fixed
messages: + msg391785

stage: patch review -> resolved
2021-04-23 09:10:51vstinnersetmessages: + msg391671
2021-04-23 08:04:25hroncoksetnosy: - hroncok
2021-04-23 08:00:59lemburgsetmessages: + msg391666
2021-04-23 07:45:16lemburgsetmessages: + msg391664
title: In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores. -> In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores.
2021-04-23 05:53:07methanesetnosy: + mark
messages: + msg391654
2021-04-23 05:47:33methanesetnosy: + methane
messages: + msg391653
2021-04-23 04:55:39shihai1991setmessages: + msg391652
2021-04-23 01:37:32akdor1154setnosy: + akdor1154
messages: + msg391648
2020-11-02 05:01:42shihai1991setstage: resolved -> patch review
pull_requests: + pull_request22010
2020-01-14 21:54:58vstinnersetmessages: + msg360005
2020-01-14 13:11:13vstinnersetmessages: + msg359974
2020-01-14 13:07:34lemburgsetmessages: + msg359973
2020-01-14 12:42:40vstinnersetmessages: + msg359972
2020-01-14 12:41:44vstinnersetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg359971
2020-01-14 12:40:02vstinnersetpull_requests: + pull_request17401
2020-01-14 12:34:30hroncoksetnosy: + hroncok
messages: + msg359970
2019-08-22 04:42:11qigangxusetmessages: + msg350155
2019-08-21 13:27:23vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg350087

stage: patch review -> resolved
2019-08-21 13:26:33vstinnersetmessages: + msg350086
2019-08-12 08:37:08lemburgsetnosy: + lemburg
messages: + msg349448
2019-08-03 13:13:33qigangxusetmessages: + msg348959
2019-08-03 12:57:14shihai1991setnosy: + shihai1991
messages: + msg348956
2019-08-03 12:45:33qigangxusetkeywords: + patch
stage: patch review
pull_requests: + pull_request14838
2019-08-03 11:55:54qigangxusetmessages: + msg348954
2019-08-03 11:34:13qigangxucreate