Message 129285 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	belopolsky, eric.araujo, ezio.melotti, jcea, lemburg, sdaoden, vstinner
Date	2011-02-24.16:31:31
SpamBayes Score	1.2767565e-15
Marked as misclassified	No
Message-id	<4D6687E2.20607@egenix.com>
In-reply-to	<1298564410.15.0.973479289946.issue11303@psf.upfronthosting.co.za>

Content
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > I think that the normalization function in unicodeobject.c (only used for internal functions) can skip any character different than a-z, A-Z and 0-9. Something like: > >>>> import re >>>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower()) > ... >>>> normalize("UTF-8") > 'utf8' >>>> normalize("ISO-8859-1") > 'iso88591' >>>> normalize("latin1") > 'latin1' > > So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized to iso88591, latin1 and utf8. > > I don't know any encoding name where a character outside a-z, A-Z, 0-9 means anything special. But I don't know all encoding names! :-) I think rather than removing any hyphens, spaces, etc. the function should additionally: * add hyphens whenever (they are missing and) there's switch from [a-z] to [0-9] That way you end up with the correct names for the given set of optimized encoding names.

STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> I think that the normalization function in unicodeobject.c (only used for internal functions) can skip any character different than a-z, A-Z and 0-9. Something like:
> 
>>>> import re
>>>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower())
> ... 
>>>> normalize("UTF-8")
> 'utf8'
>>>> normalize("ISO-8859-1")
> 'iso88591'
>>>> normalize("latin1")
> 'latin1'
> 
> So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized to iso88591, latin1 and utf8.
> 
> I don't know any encoding name where a character outside a-z, A-Z, 0-9 means anything special. But I don't know all encoding names! :-)

I think rather than removing any hyphens, spaces, etc. the
function should additionally:

 * add hyphens whenever (they are missing and) there's switch
   from [a-z] to [0-9]

That way you end up with the correct names for the given set of
optimized encoding names.

History
Date	User	Action	Args
2011-02-24 16:31:35	lemburg	set	recipients: + lemburg, jcea, belopolsky, vstinner, ezio.melotti, eric.araujo, sdaoden
2011-02-24 16:31:31	lemburg	link	issue11303 messages
2011-02-24 16:31:31	lemburg	create