Message 129280 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	belopolsky, eric.araujo, ezio.melotti, jcea, lemburg, sdaoden, vstinner
Date	2011-02-24.16:20:06
SpamBayes Score	5.869805e-09
Marked as misclassified	No
Message-id	<1298564410.15.0.973479289946.issue11303@psf.upfronthosting.co.za>
In-reply-to

Content
I think that the normalization function in unicodeobject.c (only used for internal functions) can skip any character different than a-z, A-Z and 0-9. Something like: >>> import re >>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower()) ... >>> normalize("UTF-8") 'utf8' >>> normalize("ISO-8859-1") 'iso88591' >>> normalize("latin1") 'latin1' So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized to iso88591, latin1 and utf8. I don't know any encoding name where a character outside a-z, A-Z, 0-9 means anything special. But I don't know all encoding names! :-)

I think that the normalization function in unicodeobject.c (only used for internal functions) can skip any character different than a-z, A-Z and 0-9. Something like:

>>> import re
>>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower())
... 
>>> normalize("UTF-8")
'utf8'
>>> normalize("ISO-8859-1")
'iso88591'
>>> normalize("latin1")
'latin1'

So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized to iso88591, latin1 and utf8.

I don't know any encoding name where a character outside a-z, A-Z, 0-9 means anything special. But I don't know all encoding names! :-)

History
Date	User	Action	Args
2011-02-24 16:20:10	vstinner	set	recipients: + vstinner, lemburg, jcea, belopolsky, ezio.melotti, eric.araujo, sdaoden
2011-02-24 16:20:10	vstinner	set	messageid: <1298564410.15.0.973479289946.issue11303@psf.upfronthosting.co.za>
2011-02-24 16:20:06	vstinner	link	issue11303 messages
2011-02-24 16:20:06	vstinner	create