Message 209385 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	Arfrever, elixir, ishimoto, jwilk, loewis, methane, mrabarnett, ncoghlan, nikratio, pitrou, rurpy2, serhiy.storchaka, vstinner
Date	2014-01-27.03:44:39
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1390794280.19.0.796391625712.issue15216@psf.upfronthosting.co.za>
In-reply-to

Content
A given encoding may have multiple aliases, and also variant spellings that are normalized before doing the codec lookup. Doing the lookup first means we run through all of the normalisation and aliasing machinery and then compare the canonical names. For example: >>> import codecs >>> codecs.lookup('ANSI_X3.4_1968').name 'ascii' >>> codecs.lookup('ansi_x3.4_1968').name 'ascii' >>> codecs.lookup('ansi-x3.4-1968').name 'ascii' >>> codecs.lookup('ASCII').name 'ascii' >>> codecs.lookup('ascii').name 'ascii' A public "codecs.is_same_encoding" API might be a worthwhile and self-documenting addition, rather than just adding a comment that explains the need for the canonicalisation dance. As far as the second question goes, for non-seekable output streams, this API is inherently a case of "here be dragons" - that's a large part of the reason why it took so long for us to accept it as a feature we really should provide. We need to support writing a BOM to sys.stdout and sys.stderr - potentially doing so in the middle of existing output isn't really any different from the chance of implicitly switching encodings mid-stream.

A given encoding may have multiple aliases, and also variant spellings that are normalized before doing the codec lookup. Doing the lookup first means we run through all of the normalisation and aliasing machinery and then compare the *canonical* names. For example:

>>> import codecs
>>> codecs.lookup('ANSI_X3.4_1968').name
'ascii'
>>> codecs.lookup('ansi_x3.4_1968').name
'ascii'
>>> codecs.lookup('ansi-x3.4-1968').name
'ascii'
>>> codecs.lookup('ASCII').name
'ascii'
>>> codecs.lookup('ascii').name
'ascii'

A public "codecs.is_same_encoding" API might be a worthwhile and self-documenting addition, rather than just adding a comment that explains the need for the canonicalisation dance.

As far as the second question goes, for non-seekable output streams, this API is inherently a case of "here be dragons" - that's a large part of the reason why it took so long for us to accept it as a feature we really should provide. We need to support writing a BOM to sys.stdout and sys.stderr - potentially doing so in the middle of existing output isn't really any different from the chance of implicitly switching encodings mid-stream.

History
Date	User	Action	Args
2014-01-27 03:44:40	ncoghlan	set	recipients: + ncoghlan, loewis, ishimoto, pitrou, vstinner, jwilk, mrabarnett, Arfrever, methane, nikratio, rurpy2, serhiy.storchaka, elixir
2014-01-27 03:44:40	ncoghlan	set	messageid: <1390794280.19.0.796391625712.issue15216@psf.upfronthosting.co.za>
2014-01-27 03:44:40	ncoghlan	link	issue15216 messages
2014-01-27 03:44:39	ncoghlan	create