Issue 44723: Codec name normalization breaks custom codecs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/88886

classification

Title:	Codec name normalization breaks custom codecs
Type:	behavior	Stage:
Components:	Unicode	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	bodograumann, ezio.melotti, gregory.p.smith, methane, vstinner
Priority:	normal	Keywords:

Created on 2021-07-23 10:18 by bodograumann, last changed 2022-04-11 14:59 by admin.

Messages (1)
msg398042 - (view)	Author: Bodo Graumann (bodograumann)	Date: 2021-07-23 10:18
This is a follow up on https://bugs.python.org/issue37751 concerning normalization of codec names. First of all, the changes made therein are not documented correctly. In the implementation \| Normalization works as follows: all non-alphanumeric \| characters except the dot used for Python package names are \| collapsed and replaced with a single underscore, e.g. ' -;#' \| becomes '_'. Leading and trailing underscores are removed.” Cf. [encodings/__init__.py](https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Lib/encodings/__init__.py#L47-L50) The documentation however only states that: \| Search functions are expected to take one argument, being the encoding name in all lower case letters with hyphens and spaces converted to underscores Cf. https://docs.python.org/3/library/codecs.html#codecs.register Secondly, this change breaks lots of iconv codecs with the python-iconv binding. E.g. `ASCII//TRANSLIT` is now normalized to `ascii_translit`, which iconv does not understand. Codec names which use hyphens also break and iinm not all of them have aliases in iconv without hyphens. Cf. [python-iconv #4](https://github.com/bodograumann/python-iconv/issues/4) How about first looking up the given name and only then, if the given name could not be found, looking for the codec by its normalized name?

msg398042 - (view)

Author: Bodo Graumann (bodograumann)

Date: 2021-07-23 10:18

This is a follow up on https://bugs.python.org/issue37751 concerning normalization of codec names.

First of all, the changes made therein are not documented correctly.
In the implementation
| Normalization works as follows: all non-alphanumeric
| characters except the dot used for Python package names are
| collapsed and replaced with a single underscore, e.g. '  -;#'
| becomes '_'. Leading and trailing underscores are removed.”
Cf. [encodings/__init__.py](https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Lib/encodings/__init__.py#L47-L50)

The documentation however only states that:
| Search functions are expected to take one argument, being the encoding name in all lower case letters with hyphens and spaces converted to underscores
Cf. https://docs.python.org/3/library/codecs.html#codecs.register

Secondly, this change breaks lots of iconv codecs with the python-iconv binding. E.g. `ASCII//TRANSLIT` is now normalized to `ascii_translit`, which iconv does not understand. Codec names which use hyphens also break and iinm not all of them have aliases in iconv without hyphens.
Cf. [python-iconv #4](https://github.com/bodograumann/python-iconv/issues/4)

How about first looking up the given name and only then, if the given name could not be found, looking for the codec by its normalized name?

History
Date	User	Action	Args
2022-04-11 14:59:47	admin	set	github: 88886
2022-01-25 02:02:30	methane	set	nosy: + methane
2022-01-25 00:29:59	gregory.p.smith	set	nosy: + gregory.p.smith
2021-07-23 10:18:59	bodograumann	create