utf-8 or utf8 or utf-8 (codec display name inconsistency) #58121

kennyluck · 2012-01-31T17:27:56Z

BPO	13913
Nosy	@vstinner, @ezio-melotti, @merwok

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-02-14.00:19:45.958>
created_at = <Date 2012-01-31.17:27:56.095>
labels = ['type-feature', 'expert-unicode']
title = 'utf-8 or utf8 or utf-8 (codec display name inconsistency)'
updated_at = <Date 2012-02-15.22:44:40.620>
user = 'https://bugs.python.org/kennyluck'

bugs.python.org fields:

activity = <Date 2012-02-15.22:44:40.620>
actor = 'python-dev'
assignee = 'none'
closed = True
closed_date = <Date 2012-02-14.00:19:45.958>
closer = 'vstinner'
components = ['Unicode']
creation = <Date 2012-01-31.17:27:56.095>
creator = 'kennyluck'
dependencies = []
files = []
hgrepos = []
issue_num = 13913
keywords = []
message_count = 7.0
messages = ['152399', '152421', '153308', '153309', '153417', '153437', '153446']
nosy_count = 5.0
nosy_names = ['vstinner', 'ezio.melotti', 'eric.araujo', 'python-dev', 'kennyluck']
pr_nums = []
priority = 'low'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue13913'
versions = ['Python 3.3']

kennyluck · 2012-01-31T17:27:55Z

Since Python 3.2.2 (I don't have earlier version to test with),

>>> "\udc80".encode("utf-8")
UnicodeEncodeError: *utf-8* codec can't encode character '\udc80'...

but

>>> b"\xff".decode("utf-8")
UnicodeDecodeError: *utf8* codec can't decode byte 0xff in position 0

and the table on the documentation of the codec module suggests *utf_8* as the name of the codec, which I believe to be equivalent to "utf_8" because '-' is not a valid character of an identifier.

Can we at least make the above two consistent? I would go for "utf-8", which was probably introduced for rejecting surrogates, but "utf8" has been there for years. What do we do? I am happy to submit patches for all branches. These are one-liners anyway.

The backward compatibility risk should be pretty low as usually you don't get encoding from these errors and I don't see any use of PyUnicode(Encode|Decode)Error_GetEncoding in trunk, although I'm using it for issue bpo-12892.

Also, "latin_1" displays as *latin-1* but "iso2022-jp" displays as *iso2022_jp*. I care less about this nit though.

kennyluck · 2012-02-01T00:42:30Z

and the table on the documentation of the codec module suggests *utf_8*
as the name of the codec, which I believe to be equivalent to "utf_8"
because '-' is not a valid character of an identifier.

typo: equivalent to "utf_8" → equivalent to "utf-8".

python-dev · 2012-02-14T00:17:37Z

New changeset c861c0a7f40c by Victor Stinner in branch '3.2':
Issue bpo-13913: normalize utf-8 codec name in UTF-8 decoder
http://hg.python.org/cpython/rev/c861c0a7f40c

New changeset af1a9508f7fa by Victor Stinner in branch 'default':
(Merge 3.2) Issue bpo-13913: normalize utf-8 codec name in UTF-8 decoder
http://hg.python.org/cpython/rev/af1a9508f7fa

vstinner · 2012-02-14T00:19:46Z

Use codecs.lookup(alias).name to the the normalize name of a codec. Examples:

>>> import codecs
>>> codecs.lookup('utf-8').name
'utf-8'
>>> codecs.lookup('iso-8859-1').name
'iso8859-1'
>>> codecs.lookup('latin1').name
'iso8859-1'
>>> codecs.lookup('iso2022_jp').name
'iso2022_jp'

All issues look to be addressed, so I close the issue. Thanks for the report!

merwok · 2012-02-15T17:09:06Z

You need to update test_pep3120: http://www.python.org/dev/buildbot/all/builders/AMD64%20Gentoo%20Wide%203.2/builds/910/steps/test/logs/stdio/text

python-dev · 2012-02-15T21:25:03Z

New changeset 5b8f146103fa by Victor Stinner in branch '3.2':
Issue bpo-13913: Fix test_pep3120 for the UTF-8 codec name
http://hg.python.org/cpython/rev/5b8f146103fa

New changeset 170a224ce01e by Victor Stinner in branch 'default':
(Merge 3.2) Issue bpo-13913: Fix test_pep3120 for the UTF-8 codec name
http://hg.python.org/cpython/rev/170a224ce01e

python-dev · 2012-02-15T22:44:41Z

New changeset 824ddf6a30f2 by Victor Stinner in branch '3.2':
Issue bpo-13913: Another fix test_pep3120 for the UTF-8 codec name
http://hg.python.org/cpython/rev/824ddf6a30f2

New changeset 2cfba214c243 by Victor Stinner in branch 'default':
(Merge 3.2) Issue bpo-13913: Another fix test_pep3120 for the UTF-8 codec name
http://hg.python.org/cpython/rev/2cfba214c243

kennyluck mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Jan 31, 2012

merwok added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Feb 4, 2012

vstinner closed this as completed Feb 14, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf-8 or utf8 or utf-8 (codec display name inconsistency) #58121

utf-8 or utf8 or utf-8 (codec display name inconsistency) #58121

kennyluck mannequin commented Jan 31, 2012

kennyluck mannequin commented Jan 31, 2012

kennyluck mannequin commented Feb 1, 2012

python-dev mannequin commented Feb 14, 2012

vstinner commented Feb 14, 2012

merwok commented Feb 15, 2012

python-dev mannequin commented Feb 15, 2012

python-dev mannequin commented Feb 15, 2012

utf-8 or utf8 or utf-8 (codec display name inconsistency) #58121

utf-8 or utf8 or utf-8 (codec display name inconsistency) #58121

Comments

kennyluck mannequin commented Jan 31, 2012

kennyluck mannequin commented Jan 31, 2012

kennyluck mannequin commented Feb 1, 2012

python-dev mannequin commented Feb 14, 2012

vstinner commented Feb 14, 2012

merwok commented Feb 15, 2012

python-dev mannequin commented Feb 15, 2012

python-dev mannequin commented Feb 15, 2012