Message 129322 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	belopolsky, eric.araujo, ezio.melotti, jcea, lemburg, sdaoden, vstinner
Date	2011-02-24.23:56:19
SpamBayes Score	4.2182924e-12
Marked as misclassified	No
Message-id	<1298591781.45.0.843671847574.issue11303@psf.upfronthosting.co.za>
In-reply-to

Content
>> That won't work, Victor, since it makes invalid encoding >> names valid, e.g. 'utf(=)-8'. > .. but this is valid: ... Ah yes, it's because of encodings.normalize_encoding(). It's funny: we have 3 functions to normalize an encoding name, and each function does something else :-) E.g. encodings.normalize_encoding() doesn't replace non-ASCII letters, and don't convert to lowercase. more_aggressive_normalization.patch changes all of the 3 normalization functions and add tests on encodings.normalize_encoding(). I think that speed and backward compatibility is more important than conforming to IANA or other standards. Even if "~~ utf#8 ~~" is ugly, I don't think that it really matter that we accept it. -- If you don't want to touch the normalization functions and just add more aliases in C fast-paths: we should also add utf8, utf16 and utf32. Use of "utf8" in Python: random.Random.seed(), smtpd.SMTPChannel.collect_incoming_data(), tarfile, multiprocessing.connection (xml serialization) PS: On error, UTF-8 decoder raises a UnicodeDecodeError with "utf8" as the encoding name :-)

>> That won't work, Victor, since it makes invalid encoding
>> names valid, e.g. 'utf(=)-8'.

> .. but this *is* valid: ...

Ah yes, it's because of encodings.normalize_encoding(). It's funny: we have 3 functions to normalize an encoding name, and each function does something else :-) E.g. encodings.normalize_encoding() doesn't replace non-ASCII letters, and don't convert to lowercase.

more_aggressive_normalization.patch changes all of the 3 normalization functions and add tests on encodings.normalize_encoding().

I think that speed and backward compatibility is more important than conforming to IANA or other standards.

Even if "~~ utf#8 ~~" is ugly, I don't think that it really matter that we accept it.

--

If you don't want to touch the normalization functions and just add more aliases in C fast-paths: we should also add utf8, utf16 and utf32.

Use of "utf8" in Python: random.Random.seed(), smtpd.SMTPChannel.collect_incoming_data(), tarfile, multiprocessing.connection (xml serialization)

PS: On error, UTF-8 decoder raises a UnicodeDecodeError with "utf8" as the encoding name :-)

History
Date	User	Action	Args
2011-02-24 23:56:21	vstinner	set	recipients: + vstinner, lemburg, jcea, belopolsky, ezio.melotti, eric.araujo, sdaoden
2011-02-24 23:56:21	vstinner	set	messageid: <1298591781.45.0.843671847574.issue11303@psf.upfronthosting.co.za>
2011-02-24 23:56:19	vstinner	link	issue11303 messages
2011-02-24 23:56:19	vstinner	create