This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Stricter codec names
Type: behavior Stage:
Components: Documentation, Library (Lib) Versions: Python 3.0, Python 3.1, Python 2.7, Python 2.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: lemburg Nosy List: belopolsky, ezio.melotti, georg.brandl, lemburg, mrabarnett, pitrou
Priority: normal Keywords:

Created on 2009-05-02 08:00 by ezio.melotti, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
iana.py belopolsky, 2011-02-24 02:10
Messages (15)
msg86933 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-05-02 08:00
I noticed that codec names[1]:
1) can contain random/unnecessary spaces and punctuation;
2) have several aliases that could probably be removed;

A few examples of valid codec names (done with Python 3):
>>> s = 'xxx'
>>> s.encode('utf')
b'xxx'
>>> s.encode('utf-')
b'xxx'
>>> s.encode('}Utf~->8<-~siG{ ;)')
b'\xef\xbb\xbfxxx'

'utf' is an alias for UTF-8 and that doesn't quite make sense to me that
'utf' alone refers to UTF-8.
'utf-' could be a mistyped 'utf-8', 'utf-7' or even 'utf-16'; I'd like
it to raise an error instead.
The third example is not probably something that can be found in the
real world (I hope) but it shows how permissive the parsing of the names is.

Apparently the whitespaces are removed and the punctuation is used to
split the name in several parts and then the check is performed.


About the aliases: in the documentation the "official" name for the
UTF-8 codec is 'utf_8' and there are 3 more aliases: U8, UTF, utf8. For
ISO-8859-1, the "official" name is 'latin_1' and there are 7 more
aliases: iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1.
The Zen says "There should be one—and preferably only one—obvious way to
do it.", so I suggest to
1) disallow random punctuation and spaces within the name (only allow
leading and trailing spaces);
2) change the default names to, for example: 'utf-8', 'iso-8859-1'
instead of 'utf_8' and 'iso8859_1'. The name are case-insentive.
3) remove the unnecessary aliases, for example: 'UTF', 'U8' for UTF-8
and 'iso8859-1', '8859', 'latin', 'L1' for ISO-8859-1;

This last point could break some code and may need some
DeprecationWarning. If there are good reason to keep around these codecs
only the other two issues can be addressed. 
If the name of the codec has to be a valid variable name (that is,
without '-'), only the documentation could be changed to have 'utf-8',
'iso-8859-1', etc. as preferred name.

[1]: http://docs.python.org/library/codecs.html#standard-encodings
     http://docs.python.org/3.0/library/codecs.html#standard-encodings
msg86935 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-05-02 09:20
I don't think this is a good idea.  Accepting all common forms for
encoding names means that you can usually give Python an encoding name
from, e.g. a HTML page, or any other file or system that specifies an
encoding.  If we only supported, e.g., "UTF-8" and no other spelling,
that would make life much more difficult.  If you look into
encodings/__init__.py, you can see that throwing out all
non-alphanumerics is a conscious design choice in encoding name
normalization.

The only thing I don't know is why "utf" is an alias for utf-8.

Assigning to Marc-Andre, who implemented most of codecs.
msg86937 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-02 11:35
Is there any reason for allowing "utf" as an alias to utf-8? It sounds
much too ambiguous. The other silly variants (those with lots of
spurious puncutuations characters) could be forbidden too.
msg86956 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2009-05-02 16:22
How about a 'full' form and a 'key' form generated by the function:

def codec_key(name):
    return name.lower().replace("-", "").replace("_", "")

The key form would be the key to an available codec, and the key
generated by a user-supplied codec name would have to match one of those
keys.

For example:

Full: "UTF-8", key: "utf8".

Full: "ISO-8859-1", key: "iso88591".
msg87034 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-05-03 07:07
Actually I'd like to have some kind of convention mainly when the user
writes the encoding as a string, e.g. s.encode('utf-8'). Indeed, if the
encoding comes from a webpage or somewhere else it makes sense to have
some flexibility.

I think that 'utf-8' is the most widely used name for the UTF-8 codec
and it's not even mentioned in the table of the standard encodings. So
someone will use 'utf-8', someone else 'utf_8' and some users could even
pick one of the aliases, like 'U8'.

Probably is enough to add 'utf-8', 'iso-8859-1' and similar as
"preferred form" and explain why and how the codec names are normalized
and what are the valid aliases.

Regarding the ambiguity of 'UTF', it is not the only one, there's also
'LATIN' among the aliases of ISO-8859-1.
msg87103 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-05-04 08:25
On 2009-05-02 11:20, Georg Brandl wrote:
> Georg Brandl <georg@python.org> added the comment:
> 
> I don't think this is a good idea.  Accepting all common forms for
> encoding names means that you can usually give Python an encoding name
> from, e.g. a HTML page, or any other file or system that specifies an
> encoding.  If we only supported, e.g., "UTF-8" and no other spelling,
> that would make life much more difficult.  If you look into
> encodings/__init__.py, you can see that throwing out all
> non-alphanumerics is a conscious design choice in encoding name
> normalization.
> 
> The only thing I don't know is why "utf" is an alias for utf-8.
> 
> Assigning to Marc-Andre, who implemented most of codecs.

-1 on making codec names strict.

The reason why we have to many aliases is to enhance compatibility
with other software and data, not to encourage use of these aliases
in Python itself.
msg87140 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-05-04 17:04
So, do you also think "utf" and "latin" should stay?
msg87144 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2009-05-04 18:06
Well, there are multiple UTF encodings, so no to "utf".

Are there multiple Latin encodings? Not in Python 2.6.2 under those names.

I'd probably insist on names that are strictish(?), ie correct, give or
take a '-' or '_'.
msg87226 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-05-05 08:31
On 2009-05-04 19:04, Georg Brandl wrote:
> Georg Brandl <georg@python.org> added the comment:
> 
> So, do you also think "utf" and "latin" should stay?

For Python 3.x, I think those can be removed. For 2.x it's better to
keep them.

Note that UTF-8 was the first official Unicode transfer encoding,
that's why it's sometimes referred to as "UTF".

The situation is similar for Latin-1. It was the first of a series of
encodings defined by ECMA which was later published by ISO under the name
ISO-8859 - long after the name "Latin-1" became popular which is why
it's the default name in Python.
msg129238 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-02-24 01:37
What is the status of this.  Status=open and Resolution=rejected contradict each other.

This discussion is relevant for issue11303.  Currently alias lookup incurs huge performance penalty in some cases.
msg129239 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-02-24 02:10
> Accepting all common forms for
> encoding names means that you can usually give Python an encoding name
> from, e.g. a HTML page, or any other file or system that specifies an
> encoding.

I don't buy this argument.  Running attached script on http://www.iana.org/assignments/character-sets shows that there are hundreds of registered charsets that are not accepted by python:

$ ./python.exe iana.py| wc -l
     413

Any serious HTML or XML processing software should be based on the IANA character-sets file rather than on the ad-hoc list of aliases that made it into encodings/aliases.py.
msg129248 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-02-24 04:00
Ezio and I discussed on IRC the implementation of alias lookup and neither of us was able to point out to the function that strips non-alphanumeric characters from encoding names.

It turns out that there are three "normalize" functions that are successively applied to the encoding name during evaluation of str.encode/str.decode.

1. normalize_encoding() in unicodeobject.c
2. normalizestring() in codecs.c
3. normalize_encoding() in encodings/__init__.py

Each performs a slightly different transformation and only the last one strips non-alphanumeric characters.

The complexity of codec lookup is comparable with that of the import mechanism!
msg129254 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-02-24 09:05
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> What is the status of this.  Status=open and Resolution=rejected contradict each other.

Sorry, forgot to close the ticket.
msg129255 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-02-24 09:20
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
>> Accepting all common forms for
>> encoding names means that you can usually give Python an encoding name
>> from, e.g. a HTML page, or any other file or system that specifies an
>> encoding.
> 
> I don't buy this argument.  Running attached script on http://www.iana.org/assignments/character-sets shows that there are hundreds of registered charsets that are not accepted by python:
> 
> $ ./python.exe iana.py| wc -l
>      413
> 
> Any serious HTML or XML processing software should be based on the IANA character-sets file rather than on the ad-hoc list of aliases that made it into encodings/aliases.py.

Let's do a reality check:

How often do you see requests for additions to the aliases we
have in Python ? Perhaps one every year, if at all.

We take great care not to add aliases that are not in common
use or that do not have a proven track record of really being
compatible to the codec in question.

If you think we are missing some aliases, please open tickets
for them, indicating why these should be added.

If you really want complete IANA coverage, I suggest you create
a normalization module which maps the IANA names to our names
and upload it to PyPI.
msg129257 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-02-24 09:29
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> Ezio and I discussed on IRC the implementation of alias lookup and neither of us was able to point out to the function that strips non-alphanumeric characters from encoding names.

I think you are misunderstanding the way the codec registry works.

You register codec search functions with it which then have to try
to map a given encoding name to a codec module.

The stdlib ships with one such function (defined in encodings/__init__.py).
This is registered with the codec registry per default.

The codec search function takes care of any normalization and conversion
to the module name used by the codecs from that codec package.

> It turns out that there are three "normalize" functions that are successively applied to the encoding name during evaluation of str.encode/str.decode.
> 
> 1. normalize_encoding() in unicodeobject.c

This was added to have the few shortcuts we have in the C code
for commonly used codecs match more encoding aliases.

The shortcuts completely bypass the codec registry and also
bypass the function call overhead incurred by codecs
run via the codec registry.

> 2. normalizestring() in codecs.c

This is the normalization applied by the codec registry. See PEP 100
for details:

"""
    Search functions are expected to take one argument, the encoding
    name in all lower case letters and with hyphens and spaces
    converted to underscores, ...
"""

> 3. normalize_encoding() in encodings/__init__.py

This is part of the stdlib encodings package's codec search
function.

> Each performs a slightly different transformation and only the last one strips non-alphanumeric characters.
> 
> The complexity of codec lookup is comparable with that of the import mechanism!

It's flexible, but not really complex.

I hope the above clarifies the reasons for the three normalization
functions.
History
Date User Action Args
2022-04-11 14:56:48adminsetgithub: 50152
2011-02-24 09:29:13lemburgsetnosy: lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett
messages: + msg129257
2011-02-24 09:20:37lemburgsetnosy: lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett
messages: + msg129255
2011-02-24 09:06:08lemburgsetstatus: open -> closed
nosy: lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett
2011-02-24 09:05:45lemburgsetnosy: lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett
messages: + msg129254
2011-02-24 04:00:52belopolskysetnosy: lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett
messages: + msg129248
2011-02-24 02:10:19belopolskysetfiles: + iana.py
nosy: lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett
messages: + msg129239
2011-02-24 01:37:25belopolskysetnosy: + belopolsky
messages: + msg129238
2009-05-05 08:31:26lemburgsetmessages: + msg87226
2009-05-04 18:06:32mrabarnettsetmessages: + msg87144
2009-05-04 17:04:54georg.brandlsetmessages: + msg87140
2009-05-04 08:25:48lemburgsetmessages: + msg87103
2009-05-03 07:07:20ezio.melottisetmessages: + msg87034
2009-05-02 16:22:02mrabarnettsetnosy: + mrabarnett
messages: + msg86956
2009-05-02 11:35:26pitrousetstatus: pending -> open
nosy: + pitrou
messages: + msg86937

2009-05-02 09:20:25georg.brandlsetstatus: open -> pending

nosy: + lemburg
messages: + msg86935

assignee: georg.brandl -> lemburg
resolution: rejected
2009-05-02 08:00:19ezio.melotticreate