This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ezio.melotti
Recipients ezio.melotti, georg.brandl
Date 2009-05-02.08:00:17
SpamBayes Score 2.5334734e-12
Marked as misclassified No
Message-id <1241251220.61.0.609037785801.issue5902@psf.upfronthosting.co.za>
In-reply-to
Content
I noticed that codec names[1]:
1) can contain random/unnecessary spaces and punctuation;
2) have several aliases that could probably be removed;

A few examples of valid codec names (done with Python 3):
>>> s = 'xxx'
>>> s.encode('utf')
b'xxx'
>>> s.encode('utf-')
b'xxx'
>>> s.encode('}Utf~->8<-~siG{ ;)')
b'\xef\xbb\xbfxxx'

'utf' is an alias for UTF-8 and that doesn't quite make sense to me that
'utf' alone refers to UTF-8.
'utf-' could be a mistyped 'utf-8', 'utf-7' or even 'utf-16'; I'd like
it to raise an error instead.
The third example is not probably something that can be found in the
real world (I hope) but it shows how permissive the parsing of the names is.

Apparently the whitespaces are removed and the punctuation is used to
split the name in several parts and then the check is performed.


About the aliases: in the documentation the "official" name for the
UTF-8 codec is 'utf_8' and there are 3 more aliases: U8, UTF, utf8. For
ISO-8859-1, the "official" name is 'latin_1' and there are 7 more
aliases: iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1.
The Zen says "There should be one—and preferably only one—obvious way to
do it.", so I suggest to
1) disallow random punctuation and spaces within the name (only allow
leading and trailing spaces);
2) change the default names to, for example: 'utf-8', 'iso-8859-1'
instead of 'utf_8' and 'iso8859_1'. The name are case-insentive.
3) remove the unnecessary aliases, for example: 'UTF', 'U8' for UTF-8
and 'iso8859-1', '8859', 'latin', 'L1' for ISO-8859-1;

This last point could break some code and may need some
DeprecationWarning. If there are good reason to keep around these codecs
only the other two issues can be addressed. 
If the name of the codec has to be a valid variable name (that is,
without '-'), only the documentation could be changed to have 'utf-8',
'iso-8859-1', etc. as preferred name.

[1]: http://docs.python.org/library/codecs.html#standard-encodings
     http://docs.python.org/3.0/library/codecs.html#standard-encodings
History
Date User Action Args
2009-05-02 08:00:21ezio.melottisetrecipients: + ezio.melotti, georg.brandl
2009-05-02 08:00:20ezio.melottisetmessageid: <1241251220.61.0.609037785801.issue5902@psf.upfronthosting.co.za>
2009-05-02 08:00:19ezio.melottilinkissue5902 messages
2009-05-02 08:00:17ezio.melotticreate