Issue 14176: Fix unicode literals

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/58384

classification

Title:	Fix unicode literals
Type:	behavior	Stage:
Components:	Unicode	Versions:	Python 3.3

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:		Nosy List:	Jean-Michel.Fauth, benjamin.peterson, ezio.melotti, georg.brandl, jmfauth, loewis, r.david.murray, terry.reedy
Priority:	normal	Keywords:

Created on 2012-03-02 12:36 by Jean-Michel.Fauth, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (15)
msg154763 - (view)	Author: Jean-Michel Fauth (Jean-Michel.Fauth)	Date: 2012-03-02 12:36
Now, that the PEP 414 has been accepted, I can only strongly recommend to fix the problem of unicode literals as a partial workaround. >>> print u'abcœé€' abcé >>> If these six characters are not rendered correctly, you shoud read: LATIN SMALL LETTER A LATIN SMALL LETTER B LATIN SMALL LETTER C LATIN SMALL LIGATURE OE LATIN SMALL LETTER E WITH ACUTE EURO SIGN It is not necessary to give here the list of the numerous libs that do not understand u'unicode literals' as shown above. (I wrote all my Py2 code in a u'unicode mode', and I know how hard it is to have to select between the u'' or unicode() variants. Face it. Python has never worked [], Python does not work, Python will never work. More important, it is more than clear to me, there is no willingness to solve this issue. (The holy compatibilty with not working code). [] Except the pure ASCII serie (Py 1.5) and the Python 3[0,1,2] serie. No offense. I'm pretty sure the creator of this PEP is not even able to type on his machine the list of the 42 characters supposed to be available it the typographies (plural) used by the different countries speaking French. The whole free/open source software disaster in all its splendor. Regards. jmf
msg154765 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2012-03-02 13:53
What exactly is the bug you're reporting? Python 2.7.2 (default, Oct 27 2011, 22:35:02) [GCC 4.5.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> print u'abcœé€' abcœé€
msg154782 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-03-02 17:06
What operating system and what terminal are you using? If Windows: what code page does your terminal run in?
msg154792 - (view)	Author: Jean-Michel Fauth (Jean-Michel.Fauth)	Date: 2012-03-02 20:08
I deliberately hid the information about the used interactive interpreter; just to show you the "experience" of new Python user. (This is what I'm showing to potential Python devs who are interested in this tool; I know Python and use it since v. 1.5.6 as a non computer scientist). The interactive interpreter was: Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> In that precise case, it was Windws 7 Pro (Windows 7 Professionnel, in French because of a Swiss French version) and IDLE is just the IDLE an end user see after a fresh installation. I can ensure you, such a behaviour exists / existed on all Windows versions I used (from Win98, win2000, ...) with all the Python 2 versions since the unicode introduction. The technical reasons/aspects: "sys.defaultencoding", non iso-8859-1 chars [#], non working unicode literals, sys.stdout.encoding = 'cp1252' and so on. [#] For those who do not know, one can not write text in French with Latin-1. Please do not take my aggressive (I recognize it), but sometimes necessary message badly. IDLE is not the cause, I use here IDLE to show as an example the disaster of code containing unicode literals. I'm not really happy to see this mess again in Py3.3 [†]; the key point beeing unicode literals. The Pandora's box is opened. [†] In fact, I will somehow never see or suffer from it. Decisions have been taken. jmf
msg154793 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2012-03-02 20:13
Well, let me soothe your mind then: in Python 3, '...' and u'...' will be absolutely equal, so you won't find any more "mess" with the changes from PEP 414.
msg154794 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-03-02 20:32
Unless I'm misunderstanding, this is a duplicate of issue 1602. You will note that the problem is not with Python (or open source software in general), the problem is that Microsoft treats the command line as a second (or third, or fourth) class citizen.
msg154796 - (view)	Author: Jean-Michel Fauth (Jean-Michel.Fauth)	Date: 2012-03-02 20:35
Sorry, I neglected the most important information. Python 3.2 is working perfectly. It is simply impossible to create non valid strings (type/class 'str') from a keyboard. (non programmatically created). Like the limited characters set I used when I wrote my first program on a PDP-8. Porting Py 2 code was a child play.
msg154797 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-03-02 20:40
OK, so I still don't understand what problem it is you are reporting. What do you mean by "can't craete non-valid strings"? Of course you can't. (I don't see how you could do that programatically, either, although that depends heavily on your definition of non-valid.) Are you reporting that cmd.exe has no support for entering French characters? That wouldn't be a Python bug. Are you reporting that idle lacks the keyboard support for French? (I don't use Idle, so I don't know if that is true or not.)
msg154798 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-03-02 20:41
I'm changing the title since PEP 414 has no bearing here.
msg154806 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-03-02 22:01
As I explained to J-M when he posted much the same to python-list, Idle's French keyboard support is faulty because tcl/tk's French keyboard support is faulty. A patch for this was recently applied to tcl/tk. I hope it will be in a released version that we can incorporate in 3.3. I am sure we all wish that Microsoft (and Apple) would take more of a lead in moving to a one Unicode world from a 200 encodings and codepages world. I am sometimes as frustrated at the current situation as J-M. But unless he can identify a valid Python bug, we should close this.
msg154807 - (view)	Author: Jean-Michel Fauth (Jean-Michel.Fauth)	Date: 2012-03-02 22:10
You do not get it or I do not explain it correctly. I do not care if Py 3.3 accepts '...' ou u'...'. I'm only affraid, Py 3.3 is suffering from the same non working behaviour Python 2 is suffering. I have seen so many things... I can only use an Py2/Py3 analogy, the types beeing differnt. In Python 2, the u'...' and the unicode('...', 'coding') are not equivalent. This leads and has lead to a lot of non working code. unicode() is always working, while u'...' may not work. A lot of libs, are accepting unicode() and are failing in having to accept u'...'. That would mean in Python 3, '...' works and u'...' will not work. Once again, an illustration with IDLE / Py2. >>> import unicodedata as ud >>> for c in u'abcéœ€': print ud.name(c) LATIN SMALL LETTER A LATIN SMALL LETTER B LATIN SMALL LETTER C LATIN SMALL LETTER E WITH ACUTE Traceback (most recent call last): File "<pyshell#3>", line 2, in <module> print ud.name(c) ValueError: no such name >>> # but >>> import sys >>> for c in unicode('abcéœ€', sys.stdout.encoding): print ud.name(c) LATIN SMALL LETTER A LATIN SMALL LETTER B LATIN SMALL LETTER C LATIN SMALL LETTER E WITH ACUTE LATIN SMALL LIGATURE OE EURO SIGN >>> A course, this is actually a no problem with Py 3. I know nothing about the internal of Python. I have however noticed this guilty behaviour happen especially with non iso-8859-1 chars, valid byte string chars but equivalent chars with unicode code point > 255. Infortunately, all these chars which are so important in French. (I heared about similar problems with the mac-roman coding. I do not know the status). So, if this (u'...') works in Py 3.3, the problem can be considered as "solved". At least you have been informed about this potential issue. It still remains that this is a serious problem on Py 2. jmf
msg154808 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-03-02 23:21
> That would mean in Python 3, '...' works and u'...' will not work. You misunderstand the PEP: in 3.3, '...' and u'...' will be exactly the same. The only change is that the interpreter will ignore the u prefix instead of raising SyntaxError. It will be as if 'u' were not there. The only purpose is to let 2.x code run in 3.x without requiring the user to erase the 'u'. I can see how you could misunderstand and think that the 'u' prefix must have some meaning. But is does not. The addition is a bit controversial but Guido approved it with the expectation that it will encourage more conversion of 2.x libraries to run on 3.3. In any case, the tracker is not the place for further discussion of the value of the PEP. > Once again, an illustration with IDLE / Py2. ... > Of course, this is actually a no problem with Py 3. ... > It still remains that this is a serious problem on Py 2. We are painfully aware that 2.x has problems with unicode. You do not need to tell us. I believe that most of the problems that could be sensibly fixed in 2.x have been fixed. 3.0 fixed more problems by changing the language. 3.3 fixes still more problems by changing the internal implementation of unicode, along with the C api, and the meaning of the language on some systems. People who want to avoid all the problems that have been fixed should use 3.3 either from the repository or when it is released. > So, if this (u'...') works in Py 3.3, the problem can be considered as "solved". I am glad you agree and I will close the issue. Please use python-list for any further discussion or questions.
msg154829 - (view)	Author: jmf (jmfauth)	Date: 2012-03-03 11:03
2012/3/3 Terry J. Reedy <report@bugs.python.org> > > Terry J. Reedy <tjreedy@udel.edu> added the comment: > > > That would mean in Python 3, '...' works and u'...' will not work. > > You misunderstand the PEP: in 3.3, '...' and u'...' will be exactly the > same. The only change is that the interpreter will ignore the u prefix > instead of raising SyntaxError. It will be as if 'u' were not there. The > only purpose is to let 2.x code run in 3.x without requiring the user to > erase the 'u'. > > I can see how you could misunderstand and think that the 'u' prefix must > have some meaning. But is does not. The addition is a bit controversial but > Guido approved it with the expectation that it will encourage more > conversion of 2.x libraries to run on 3.3. In any case, the tracker is not > the place for further discussion of the value of the PEP. > > > Once again, an illustration with IDLE / Py2. > ... > > Of course, this is actually a no problem with Py 3. > ... > > It still remains that this is a serious problem on Py 2. > > We are painfully aware that 2.x has problems with unicode. You do not need > to tell us. I believe that most of the problems that could be sensibly > fixed in 2.x have been fixed. 3.0 fixed more problems by changing the > language. 3.3 fixes still more problems by changing the internal > implementation of unicode, along with the C api, and the meaning of the > language on some systems. People who want to avoid all the problems that > have been fixed should use 3.3 either from the repository or when it is > released. > > > So, if this (u'...') works in Py 3.3, the problem can > be considered as "solved". > > I am glad you agree and I will close the issue. > > Preliminary remark. I'm sending this via gmail, so it may happen the glyphs you see are illformed or transfomred by Google. Be ensured I'm typing the "right" glyphs. No, no and no. This is not a tkinter issue. This "strange" behaviour, I do not find a better word, happens with many libraries, can be Python core libs or external libs. To tell you the truth and dispite my experience, I never succeeded to narrow excatly the problem. In Python 2 sometimes, understand with some pieces of code / software, it "works" and somtimes it simply does not. The libs used here a just the first ones, that came to my mind. ----- wxPython 2.8-ansi build. Traceback (most recent call last): File "<input>", line 1, in <module> File "c:\python27\lib\site-packages\wx-2.8-msw-ansi\wx\py\shell.py", line 1242, in writeOut self.write(text) File "c:\python27\lib\site-packages\wx-2.8-msw-ansi\wx\py\shell.py", line 1000, in write self.AddText(text) File "c:\python27\lib\site-packages\wx-2.8-msw-ansi\wx\stc.py", line 1425, in AddText return _stc.StyledTextCtrl_AddText(args, kwargs) File "c:\python27\lib\encodings\cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode characters in position 4-5: character maps to <undefined> abcéœ€ >>> ---- PySide, passing "unicode" to a text widdget. Passing u'abcéœ€' works. Passing unicode('abcéœ€', 'cp1252') works. Passing 'abcé€œ' doesn't ! 'œ€' are missing. --- My interactive wx interpreter using wxPython. Strings as frame title. True ok Traceback (most recent call last): File "<psi last command>", line 1, in <module> File "c:\Python27\lib\site-packages\wx-2.8-msw-ansi\wx\_windows.py", line 505, in __init__ _windows_.Frame_swiginit(self,_windows_.new_Frame(args, **kwargs)) File "c:\Python27\lib\encodings\cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode characters in position 5-6: character maps to <undefined> True ok --- And so on with many libs. You may argue that these libs are guilty. I may argue that Python is somehow guilty, because it let users write non working code. And practically in all the cases, the main problem is due to the usage of unicode literals. Just to show you, I'm quite comfortable with all this coding stuff. The results my interactive intepreter. Special hack, unfortunatelly non portable, works only with Windows and cp1252. abcé?? >>> unicode('abcéœ€', sys.stdout.encoding) abcéœ€ >>> print u'abcéœ€' abcé?? >>> print unicode('abcéœ€', sys.stdout.encoding) abcéœ€ As I am aware of this "feature", all my code is perfectly working. I'm paying attention to the necessity of the usage of u'...' or unicode(...). Unfortunatelly, this not a general case in a lot of code I see, supposed to deal with texts. To draw a conclusion. You are wise enough to understand that, when I'm saying "Python just does not work", I'm unforunatelly not so far away form the reality. I really, very really, expect all this mess (sorry for the word) will not reappear in Py 3.3. Let's wait. 'abcéœ€' >>> print('abcéœ€') abcéœ€ >>> Regards, Jean-Michel Fauth PS The u() trick does not help.
msg154834 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2012-03-03 13:03
I'd like to encourage you to not try this sort of thing out from an interactive interpreter (incidentally, where does "<psi last command>" come from? It doesn't look like Python's REPL). As David and Terry noted, interactions with such a console, be it Windows' "cmd" or IDLE, have their very own idiosyncrasies and bugs. That said, in Python 2.x source files the following two expressions are identical: * u'abcœé€' * unicode('abcœé€', 'encoding the file is in') Both result in a Unicode string with the six characters/codepoints you mentioned. There won't be any code that works with one but not the other. Of course there are libraries that do not handle Unicode strings in general (nothing to do with literals!) correctly, but as you yourself said, that is a problem with the libraries. Lastly, please read PEP 414 if you are not completely sure what it is proposing. You will see that it merely affects the available syntax for Unicode literals and allows the "u" again.
msg154907 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-03-04 20:09
I propose to close this issue as invalid (although out-of-date might be fine as well). Jean-Michel is apparently unable to describe what issue precisely he wants to see fixed, rather than just complaining that open source is a disaster. I don't think we can anything do about open source being a disaster, and I'm not able to reproduce that perception. Jean-Michel: please try to use this bug tracker in the way it is intended, i.e. report one bug at time, following this structure: - this is what I did - this is what happened - this is what should have happened instead

History
Date	User	Action	Args
2022-04-11 14:57:27	admin	set	github: 58384
2012-03-04 20:09:55	loewis	set	messages: + msg154907
2012-03-03 13:03:38	georg.brandl	set	messages: + msg154834
2012-03-03 11:03:39	jmfauth	set	nosy: + jmfauth messages: + msg154829
2012-03-02 23:21:06	terry.reedy	set	status: open -> closed resolution: out of date messages: + msg154808
2012-03-02 22:10:57	Jean-Michel.Fauth	set	messages: + msg154807
2012-03-02 22:01:35	terry.reedy	set	nosy: + terry.reedy messages: + msg154806
2012-03-02 20:41:44	r.david.murray	set	messages: + msg154798 title: Fix unicode literals (for PEP 414) -> Fix unicode literals
2012-03-02 20:40:59	r.david.murray	set	messages: + msg154797
2012-03-02 20:35:50	Jean-Michel.Fauth	set	messages: + msg154796
2012-03-02 20:32:27	r.david.murray	set	nosy: + r.david.murray messages: + msg154794
2012-03-02 20:13:10	georg.brandl	set	nosy: + georg.brandl messages: + msg154793
2012-03-02 20:08:41	Jean-Michel.Fauth	set	messages: + msg154792
2012-03-02 17:06:56	loewis	set	nosy: + loewis messages: + msg154782
2012-03-02 13:58:43	ezio.melotti	set	nosy: + ezio.melotti type: behavior components: + Unicode, - None
2012-03-02 13:53:25	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg154765
2012-03-02 12:36:56	Jean-Michel.Fauth	create