Message 152399 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kennyluck
Recipients	ezio.melotti, kennyluck
Date	2012-01-31.17:27:55
SpamBayes Score	1.7640334e-12
Marked as misclassified	No
Message-id	<1328030876.68.0.627537179228.issue13913@psf.upfronthosting.co.za>
In-reply-to

Content
Since Python 3.2.2 (I don't have earlier version to test with), >>> "\udc80".encode("utf-8") UnicodeEncodeError: utf-8 codec can't encode character '\udc80'... but >>> b"\xff".decode("utf-8") UnicodeDecodeError: utf8 codec can't decode byte 0xff in position 0 and the table on the documentation of the codec module suggests utf_8 as the name of the codec, which I believe to be equivalent to "utf_8" because '-' is not a valid character of an identifier. Can we at least make the above two consistent? I would go for "utf-8", which was probably introduced for rejecting surrogates, but "utf8" has been there for years. What do we do? I am happy to submit patches for all branches. These are one-liners anyway. The backward compatibility risk should be pretty low as usually you don't get encoding from these errors and I don't see any use of PyUnicode(Encode\|Decode)Error_GetEncoding in trunk, although I'm using it for issue #12892. Also, "latin_1" displays as latin-1 but "iso2022-jp" displays as iso2022_jp. I care less about this nit though.

Since Python 3.2.2 (I don't have earlier version to test with),

>>> "\udc80".encode("utf-8")
UnicodeEncodeError: *utf-8* codec can't encode character '\udc80'...

but

>>> b"\xff".decode("utf-8")
UnicodeDecodeError: *utf8* codec can't decode byte 0xff in position 0

and the table on the documentation of the codec module suggests *utf_8* as the name of the codec, which I believe to be equivalent to "utf_8" because '-' is not a valid character of an identifier.

Can we at least make the above two consistent? I would go for "utf-8", which was probably introduced for rejecting surrogates, but "utf8" has been there for years. What do we do? I am happy to submit patches for all branches. These are one-liners anyway.

The backward compatibility risk should be pretty low as usually you don't get encoding from these errors and I don't see any use of PyUnicode(Encode|Decode)Error_GetEncoding in trunk, although I'm using it for issue #12892. 

Also, "latin_1" displays as *latin-1* but "iso2022-jp" displays as *iso2022_jp*. I care less about this nit though.

History
Date	User	Action	Args
2012-01-31 17:27:56	kennyluck	set	recipients: + kennyluck, ezio.melotti
2012-01-31 17:27:56	kennyluck	set	messageid: <1328030876.68.0.627537179228.issue13913@psf.upfronthosting.co.za>
2012-01-31 17:27:56	kennyluck	link	issue13913 messages
2012-01-31 17:27:55	kennyluck	create