classification
Title: Codec name normalization breaks custom codecs
Type: behavior Stage:
Components: Unicode Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: bodograumann, ezio.melotti, vstinner
Priority: normal Keywords:

Created on 2021-07-23 10:18 by bodograumann, last changed 2021-07-23 10:18 by bodograumann.

Messages (1)
msg398042 - (view) Author: Bodo Graumann (bodograumann) Date: 2021-07-23 10:18
This is a follow up on https://bugs.python.org/issue37751 concerning normalization of codec names.

First of all, the changes made therein are not documented correctly.
In the implementation
| Normalization works as follows: all non-alphanumeric
| characters except the dot used for Python package names are
| collapsed and replaced with a single underscore, e.g. '  -;#'
| becomes '_'. Leading and trailing underscores are removed.”
Cf. [encodings/__init__.py](https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Lib/encodings/__init__.py#L47-L50)

The documentation however only states that:
| Search functions are expected to take one argument, being the encoding name in all lower case letters with hyphens and spaces converted to underscores
Cf. https://docs.python.org/3/library/codecs.html#codecs.register

Secondly, this change breaks lots of iconv codecs with the python-iconv binding. E.g. `ASCII//TRANSLIT` is now normalized to `ascii_translit`, which iconv does not understand. Codec names which use hyphens also break and iinm not all of them have aliases in iconv without hyphens.
Cf. [python-iconv #4](https://github.com/bodograumann/python-iconv/issues/4)

How about first looking up the given name and only then, if the given name could not be found, looking for the codec by its normalized name?
History
Date User Action Args
2021-07-23 10:18:59bodograumanncreate