Issue5902
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009-05-02 08:00 by ezio.melotti, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
iana.py | belopolsky, 2011-02-24 02:10 |
Messages (15) | |||
---|---|---|---|
msg86933 - (view) | Author: Ezio Melotti (ezio.melotti) * | Date: 2009-05-02 08:00 | |
I noticed that codec names[1]: 1) can contain random/unnecessary spaces and punctuation; 2) have several aliases that could probably be removed; A few examples of valid codec names (done with Python 3): >>> s = 'xxx' >>> s.encode('utf') b'xxx' >>> s.encode('utf-') b'xxx' >>> s.encode('}Utf~->8<-~siG{ ;)') b'\xef\xbb\xbfxxx' 'utf' is an alias for UTF-8 and that doesn't quite make sense to me that 'utf' alone refers to UTF-8. 'utf-' could be a mistyped 'utf-8', 'utf-7' or even 'utf-16'; I'd like it to raise an error instead. The third example is not probably something that can be found in the real world (I hope) but it shows how permissive the parsing of the names is. Apparently the whitespaces are removed and the punctuation is used to split the name in several parts and then the check is performed. About the aliases: in the documentation the "official" name for the UTF-8 codec is 'utf_8' and there are 3 more aliases: U8, UTF, utf8. For ISO-8859-1, the "official" name is 'latin_1' and there are 7 more aliases: iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1. The Zen says "There should be one—and preferably only one—obvious way to do it.", so I suggest to 1) disallow random punctuation and spaces within the name (only allow leading and trailing spaces); 2) change the default names to, for example: 'utf-8', 'iso-8859-1' instead of 'utf_8' and 'iso8859_1'. The name are case-insentive. 3) remove the unnecessary aliases, for example: 'UTF', 'U8' for UTF-8 and 'iso8859-1', '8859', 'latin', 'L1' for ISO-8859-1; This last point could break some code and may need some DeprecationWarning. If there are good reason to keep around these codecs only the other two issues can be addressed. If the name of the codec has to be a valid variable name (that is, without '-'), only the documentation could be changed to have 'utf-8', 'iso-8859-1', etc. as preferred name. [1]: http://docs.python.org/library/codecs.html#standard-encodings http://docs.python.org/3.0/library/codecs.html#standard-encodings |
|||
msg86935 - (view) | Author: Georg Brandl (georg.brandl) * | Date: 2009-05-02 09:20 | |
I don't think this is a good idea. Accepting all common forms for encoding names means that you can usually give Python an encoding name from, e.g. a HTML page, or any other file or system that specifies an encoding. If we only supported, e.g., "UTF-8" and no other spelling, that would make life much more difficult. If you look into encodings/__init__.py, you can see that throwing out all non-alphanumerics is a conscious design choice in encoding name normalization. The only thing I don't know is why "utf" is an alias for utf-8. Assigning to Marc-Andre, who implemented most of codecs. |
|||
msg86937 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2009-05-02 11:35 | |
Is there any reason for allowing "utf" as an alias to utf-8? It sounds much too ambiguous. The other silly variants (those with lots of spurious puncutuations characters) could be forbidden too. |
|||
msg86956 - (view) | Author: Matthew Barnett (mrabarnett) * | Date: 2009-05-02 16:22 | |
How about a 'full' form and a 'key' form generated by the function: def codec_key(name): return name.lower().replace("-", "").replace("_", "") The key form would be the key to an available codec, and the key generated by a user-supplied codec name would have to match one of those keys. For example: Full: "UTF-8", key: "utf8". Full: "ISO-8859-1", key: "iso88591". |
|||
msg87034 - (view) | Author: Ezio Melotti (ezio.melotti) * | Date: 2009-05-03 07:07 | |
Actually I'd like to have some kind of convention mainly when the user writes the encoding as a string, e.g. s.encode('utf-8'). Indeed, if the encoding comes from a webpage or somewhere else it makes sense to have some flexibility. I think that 'utf-8' is the most widely used name for the UTF-8 codec and it's not even mentioned in the table of the standard encodings. So someone will use 'utf-8', someone else 'utf_8' and some users could even pick one of the aliases, like 'U8'. Probably is enough to add 'utf-8', 'iso-8859-1' and similar as "preferred form" and explain why and how the codec names are normalized and what are the valid aliases. Regarding the ambiguity of 'UTF', it is not the only one, there's also 'LATIN' among the aliases of ISO-8859-1. |
|||
msg87103 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2009-05-04 08:25 | |
On 2009-05-02 11:20, Georg Brandl wrote: > Georg Brandl <georg@python.org> added the comment: > > I don't think this is a good idea. Accepting all common forms for > encoding names means that you can usually give Python an encoding name > from, e.g. a HTML page, or any other file or system that specifies an > encoding. If we only supported, e.g., "UTF-8" and no other spelling, > that would make life much more difficult. If you look into > encodings/__init__.py, you can see that throwing out all > non-alphanumerics is a conscious design choice in encoding name > normalization. > > The only thing I don't know is why "utf" is an alias for utf-8. > > Assigning to Marc-Andre, who implemented most of codecs. -1 on making codec names strict. The reason why we have to many aliases is to enhance compatibility with other software and data, not to encourage use of these aliases in Python itself. |
|||
msg87140 - (view) | Author: Georg Brandl (georg.brandl) * | Date: 2009-05-04 17:04 | |
So, do you also think "utf" and "latin" should stay? |
|||
msg87144 - (view) | Author: Matthew Barnett (mrabarnett) * | Date: 2009-05-04 18:06 | |
Well, there are multiple UTF encodings, so no to "utf". Are there multiple Latin encodings? Not in Python 2.6.2 under those names. I'd probably insist on names that are strictish(?), ie correct, give or take a '-' or '_'. |
|||
msg87226 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2009-05-05 08:31 | |
On 2009-05-04 19:04, Georg Brandl wrote: > Georg Brandl <georg@python.org> added the comment: > > So, do you also think "utf" and "latin" should stay? For Python 3.x, I think those can be removed. For 2.x it's better to keep them. Note that UTF-8 was the first official Unicode transfer encoding, that's why it's sometimes referred to as "UTF". The situation is similar for Latin-1. It was the first of a series of encodings defined by ECMA which was later published by ISO under the name ISO-8859 - long after the name "Latin-1" became popular which is why it's the default name in Python. |
|||
msg129238 - (view) | Author: Alexander Belopolsky (belopolsky) * | Date: 2011-02-24 01:37 | |
What is the status of this. Status=open and Resolution=rejected contradict each other. This discussion is relevant for issue11303. Currently alias lookup incurs huge performance penalty in some cases. |
|||
msg129239 - (view) | Author: Alexander Belopolsky (belopolsky) * | Date: 2011-02-24 02:10 | |
> Accepting all common forms for > encoding names means that you can usually give Python an encoding name > from, e.g. a HTML page, or any other file or system that specifies an > encoding. I don't buy this argument. Running attached script on http://www.iana.org/assignments/character-sets shows that there are hundreds of registered charsets that are not accepted by python: $ ./python.exe iana.py| wc -l 413 Any serious HTML or XML processing software should be based on the IANA character-sets file rather than on the ad-hoc list of aliases that made it into encodings/aliases.py. |
|||
msg129248 - (view) | Author: Alexander Belopolsky (belopolsky) * | Date: 2011-02-24 04:00 | |
Ezio and I discussed on IRC the implementation of alias lookup and neither of us was able to point out to the function that strips non-alphanumeric characters from encoding names. It turns out that there are three "normalize" functions that are successively applied to the encoding name during evaluation of str.encode/str.decode. 1. normalize_encoding() in unicodeobject.c 2. normalizestring() in codecs.c 3. normalize_encoding() in encodings/__init__.py Each performs a slightly different transformation and only the last one strips non-alphanumeric characters. The complexity of codec lookup is comparable with that of the import mechanism! |
|||
msg129254 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2011-02-24 09:05 | |
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > What is the status of this. Status=open and Resolution=rejected contradict each other. Sorry, forgot to close the ticket. |
|||
msg129255 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2011-02-24 09:20 | |
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > >> Accepting all common forms for >> encoding names means that you can usually give Python an encoding name >> from, e.g. a HTML page, or any other file or system that specifies an >> encoding. > > I don't buy this argument. Running attached script on http://www.iana.org/assignments/character-sets shows that there are hundreds of registered charsets that are not accepted by python: > > $ ./python.exe iana.py| wc -l > 413 > > Any serious HTML or XML processing software should be based on the IANA character-sets file rather than on the ad-hoc list of aliases that made it into encodings/aliases.py. Let's do a reality check: How often do you see requests for additions to the aliases we have in Python ? Perhaps one every year, if at all. We take great care not to add aliases that are not in common use or that do not have a proven track record of really being compatible to the codec in question. If you think we are missing some aliases, please open tickets for them, indicating why these should be added. If you really want complete IANA coverage, I suggest you create a normalization module which maps the IANA names to our names and upload it to PyPI. |
|||
msg129257 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2011-02-24 09:29 | |
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > Ezio and I discussed on IRC the implementation of alias lookup and neither of us was able to point out to the function that strips non-alphanumeric characters from encoding names. I think you are misunderstanding the way the codec registry works. You register codec search functions with it which then have to try to map a given encoding name to a codec module. The stdlib ships with one such function (defined in encodings/__init__.py). This is registered with the codec registry per default. The codec search function takes care of any normalization and conversion to the module name used by the codecs from that codec package. > It turns out that there are three "normalize" functions that are successively applied to the encoding name during evaluation of str.encode/str.decode. > > 1. normalize_encoding() in unicodeobject.c This was added to have the few shortcuts we have in the C code for commonly used codecs match more encoding aliases. The shortcuts completely bypass the codec registry and also bypass the function call overhead incurred by codecs run via the codec registry. > 2. normalizestring() in codecs.c This is the normalization applied by the codec registry. See PEP 100 for details: """ Search functions are expected to take one argument, the encoding name in all lower case letters and with hyphens and spaces converted to underscores, ... """ > 3. normalize_encoding() in encodings/__init__.py This is part of the stdlib encodings package's codec search function. > Each performs a slightly different transformation and only the last one strips non-alphanumeric characters. > > The complexity of codec lookup is comparable with that of the import mechanism! It's flexible, but not really complex. I hope the above clarifies the reasons for the three normalization functions. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:48 | admin | set | github: 50152 |
2011-02-24 09:29:13 | lemburg | set | nosy:
lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett messages: + msg129257 |
2011-02-24 09:20:37 | lemburg | set | nosy:
lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett messages: + msg129255 |
2011-02-24 09:06:08 | lemburg | set | status: open -> closed nosy: lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett |
2011-02-24 09:05:45 | lemburg | set | nosy:
lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett messages: + msg129254 |
2011-02-24 04:00:52 | belopolsky | set | nosy:
lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett messages: + msg129248 |
2011-02-24 02:10:19 | belopolsky | set | files:
+ iana.py nosy: lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett messages: + msg129239 |
2011-02-24 01:37:25 | belopolsky | set | nosy:
+ belopolsky messages: + msg129238 |
2009-05-05 08:31:26 | lemburg | set | messages: + msg87226 |
2009-05-04 18:06:32 | mrabarnett | set | messages: + msg87144 |
2009-05-04 17:04:54 | georg.brandl | set | messages: + msg87140 |
2009-05-04 08:25:48 | lemburg | set | messages: + msg87103 |
2009-05-03 07:07:20 | ezio.melotti | set | messages: + msg87034 |
2009-05-02 16:22:02 | mrabarnett | set | nosy:
+ mrabarnett messages: + msg86956 |
2009-05-02 11:35:26 | pitrou | set | status: pending -> open nosy: + pitrou messages: + msg86937 |
2009-05-02 09:20:25 | georg.brandl | set | status: open -> pending nosy: + lemburg messages: + msg86935 assignee: georg.brandl -> lemburg resolution: rejected |
2009-05-02 08:00:19 | ezio.melotti | create |