classification
Title: Encoding alias "unicode"
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 3.3
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, georg.brandl, kxroberto, loewis, vstinner
Priority: normal Keywords:

Created on 2011-11-19 11:35 by kxroberto, last changed 2011-11-25 21:09 by loewis. This issue is now closed.

Messages (9)
msg147936 - (view) Author: kxroberto (kxroberto) Date: 2011-11-19 11:35
"unicode" seems not to be an official unicode encoding name alias.
Yet it is quite frequent on the web - and obviously means UTF-8. 
(search '"text/html; charset=unicode"' in Google)
Chrome and IE display it as UTF-8.  (Mozilla as ASCII, thus mixed up chars).

Should it be added in to aliases.py ?

--- ./aliases.py
+++ ./aliases.py
@@ -511,6 +511,7 @@
     'utf8'               : 'utf_8',
     'utf8_ucs2'          : 'utf_8',
     'utf8_ucs4'          : 'utf_8',
+    'unicode'            : 'utf_8',
 
     # uu_codec codec
     'uu'                 : 'uu_codec',
msg147937 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-11-19 11:49
Sorry, but it's not obviously that Unicode means UTF-8.
msg147938 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011-11-19 12:03
Definitely; this will just serve to create more confusion for beginners over what a Unicode string is:

unicodestring.encode('unicode')   <- WTF?
msg147969 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-19 20:28
Joining the chorus: people who need it in their application will have to add it themselves (monkeypatching the aliases dictionary as appropriate).
msg148309 - (view) Author: kxroberto (kxroberto) Date: 2011-11-25 08:22
I wonder where is the origin, who is the inventor of the frequent charset=unicode? But:


"Sorry, but it's not obviously that Unicode means UTF-8."

When I faced
<meta content="text/html; charset=unicode" http-equiv="Content-Type"/>
the first time on the web, I guessed it is UTF-8 without looking. It even sounds colloquially reasonable ;-)  And its right 99.999% of cases. 
(UTF-16 is less frequent than this non-canonical "unicode")


"Definitely; this will just serve to create more confusion for beginners over what a Unicode string is:
unicodestring.encode('unicode')   <- WTF?"

I guess no python tutorial writer or encoding menu writer poses that example. That string comes in on technical paths:  web, MIME etc.
In the aliases.py there are many other names which are not canonical. frequency > convenience > alias


"Joining the chorus: people who need it in their application will have to add it themselves (monkeypatching the aliases dictionary as appropriate)."

Those people first would need to be aware of the option: Be all-seeing, or all wait for the first bug reports ...  


Reverse question: what would be the minus of having this alias?
msg148312 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-11-25 11:43
> <meta content="text/html; charset=unicode" http-equiv="Content-Type"/>

Python is not a language written for the web, it's generic language to program 
anything! If you have a problem to parse an HTML page, the special case should 
be added to the HTML parser, not to the language.

Do you have the encoding issue with a parser included in Python 
(html.parser.*)? If you have the issue with an third-party parser, you have to 
report the bug there.
msg148353 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011-11-25 19:38
The mapping "unicode" -> "utf-8" is simply not defined unambiguously, in addition to being factually wrong. For example, when Microsoft talks about Unicode they mean UTF-16.
msg148354 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-11-25 19:46
> For example, when Microsoft talks about Unicode they mean UTF-16.

Sorry, but UTF-16 is ambiguously: do you mean UTF-16-LE or UTF-16-BE? ;-)
msg148362 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-25 21:09
> Reverse question: what would be the minus of having this alias?

Please accept that this issue is closed.
History
Date User Action Args
2011-11-25 21:09:54loewissetmessages: + msg148362
2011-11-25 19:46:49vstinnersetmessages: + msg148354
2011-11-25 19:38:42georg.brandlsetmessages: + msg148353
2011-11-25 11:43:21vstinnersetmessages: + msg148312
2011-11-25 08:22:26kxrobertosetmessages: + msg148309
2011-11-19 20:57:25ezio.melottisetstage: resolved
versions: - Python 2.6, Python 3.1, Python 2.7, Python 3.2, Python 3.4
2011-11-19 20:28:46loewissetnosy: + loewis
messages: + msg147969
2011-11-19 12:03:32georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg147938

resolution: rejected
2011-11-19 11:49:47vstinnersetnosy: + vstinner
messages: + msg147937
2011-11-19 11:36:16kxrobertosetnosy: + ezio.melotti

type: enhancement
components: + Unicode
versions: + Python 2.6, Python 3.1, Python 2.7, Python 3.2, Python 3.3, Python 3.4
2011-11-19 11:35:12kxrobertocreate