This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Fix unicode literals
Type: behavior Stage:
Components: Unicode Versions: Python 3.3
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Jean-Michel.Fauth, benjamin.peterson, ezio.melotti, georg.brandl, jmfauth, loewis, r.david.murray, terry.reedy
Priority: normal Keywords:

Created on 2012-03-02 12:36 by Jean-Michel.Fauth, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (15)
msg154763 - (view) Author: Jean-Michel Fauth (Jean-Michel.Fauth) Date: 2012-03-02 12:36
Now, that the PEP 414 has been accepted, I can
only strongly recommend to fix the problem
of unicode literals as a partial workaround.

>>> print u'abcœé€'
abcé
>>> 

If these six characters are not rendered correctly, you
shoud read:
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
LATIN SMALL LIGATURE OE
LATIN SMALL LETTER E WITH ACUTE
EURO SIGN

It is not necessary to give here the list of
the numerous libs that do not understand
u'unicode literals' as shown above.

(I wrote all my Py2 code in a u'unicode mode',
and I know how hard it is to have to select
between the u'' or unicode() variants.

Face it. Python has never worked [*], Python does
not work, Python will never work. More important,
it is more than clear to me, there is no willingness
to solve this issue. (The holy compatibilty with not
working code).

[*] Except the pure ASCII serie (Py 1.5) and the
Python 3[0,1,2] serie.

No offense. I'm pretty sure the creator of this
PEP is not even able to type on his machine the
list of the 42 characters supposed to be available
it the typographies (plural) used by the different
countries speaking French.
The whole free/open source software disaster in all
its splendor.

Regards.
jmf
msg154765 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-03-02 13:53
What exactly is the bug you're reporting?

Python 2.7.2 (default, Oct 27 2011, 22:35:02) 
[GCC 4.5.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'abcœé€'
abcœé€
msg154782 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-02 17:06
What operating system and what terminal are you using? If Windows: what code page does your terminal run in?
msg154792 - (view) Author: Jean-Michel Fauth (Jean-Michel.Fauth) Date: 2012-03-02 20:08
I deliberately hid the information about the used interactive
interpreter; just to show you the "experience" of new Python
user. (This is what I'm showing to potential Python devs who
are interested in this tool; I know Python and use it since
v. 1.5.6 as a non computer scientist).

The interactive interpreter was:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> 

In that precise case, it was Windws 7 Pro (Windows 7
Professionnel, in French because of a Swiss French version)
and IDLE is just the IDLE an end user see after a fresh
installation.
I can ensure you, such a behaviour exists / existed on all
Windows versions I used (from Win98, win2000, ...) with all
the Python 2 versions since the unicode introduction.

The technical reasons/aspects: "sys.defaultencoding",
non iso-8859-1 chars [#], *non working unicode literals*,
sys.stdout.encoding = 'cp1252' and so on.

[#] For those who do not know, one can not write text
in French with Latin-1.

Please do not take my aggressive (I recognize it), but sometimes
necessary message badly.

IDLE is not the cause, I use here IDLE to show as an example the
disaster of code containing *unicode literals*.

I'm not really happy to see this mess again in Py3.3 [†]; the key
point beeing *unicode literals*.

The Pandora's box is opened.

[†] In fact, I will somehow never see or suffer from it. Decisions
have been taken.

jmf
msg154793 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-03-02 20:13
Well, let me soothe your mind then: in Python 3, '...' and u'...' will be absolutely equal, so you won't find any more "mess" with the changes from PEP 414.
msg154794 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-02 20:32
Unless I'm misunderstanding, this is a duplicate of issue 1602.

You will note that the problem is *not* with Python (or open source software in general), the problem is that Microsoft treats the command line as a second (or third, or fourth) class citizen.
msg154796 - (view) Author: Jean-Michel Fauth (Jean-Michel.Fauth) Date: 2012-03-02 20:35
Sorry, I neglected the most important information.

Python 3.2 is working perfectly. It is simply impossible
to create non valid strings (type/class 'str') from a
keyboard. (non programmatically created).

Like the limited characters set I used when I wrote my
first program on a PDP-8.

Porting Py 2 code was a child play.
msg154797 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-02 20:40
OK, so I still don't understand what problem it is you are reporting.  What do you mean by "can't craete non-valid strings"?  Of course you can't.  (I don't see how you could do that programatically, either, although that depends heavily on your definition of non-valid.)

Are you reporting that cmd.exe has no support for entering French characters?  That wouldn't be a Python bug.

Are you reporting that idle lacks the keyboard support for French?  (I don't use Idle, so I don't know if that is true or not.)
msg154798 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-02 20:41
I'm changing the title since PEP 414 has no bearing here.
msg154806 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-03-02 22:01
As I explained to J-M when he posted much the same to python-list, Idle's French keyboard support is faulty because tcl/tk's French keyboard support is faulty. A patch for this was recently applied to tcl/tk. I hope it will be in a released version that we can incorporate in 3.3.

I am sure we all wish that Microsoft (and Apple) would take more of a lead in moving to a one Unicode world from a 200 encodings and codepages world. I am sometimes as frustrated at the current situation as J-M. But unless he can identify a valid *Python* bug, we should close this.
msg154807 - (view) Author: Jean-Michel Fauth (Jean-Michel.Fauth) Date: 2012-03-02 22:10
You do not get it or I do not explain it correctly.

I do not care if Py 3.3 accepts '...' ou u'...'. I'm only
affraid, Py 3.3 is suffering from the same non working
behaviour Python 2 is suffering. I have seen so many things...

I can only use an Py2/Py3 analogy, the types beeing differnt.

In Python 2, the u'...' and the unicode('...', 'coding') are
not equivalent. This leads and has lead to a lot of non
working code. unicode() is always working, while u'...'
may not work. A lot of libs, are accepting unicode() and are
failing in having to accept u'...'.
That would mean in Python 3, '...' works and u'...' will not work.

Once again, an *illustration* with IDLE / Py2.

>>> import unicodedata as ud
>>> for c in u'abc需':
	print ud.name(c)

	
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
LATIN SMALL LETTER E WITH ACUTE

Traceback (most recent call last):
  File "<pyshell#3>", line 2, in <module>
    print ud.name(c)
ValueError: no such name
>>> # but
>>> import sys
>>> for c in unicode('abc需', sys.stdout.encoding):
	print ud.name(c)

	
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LIGATURE OE
EURO SIGN
>>> 

A course, this is actually a no problem with Py 3.

I know nothing about the internal of Python. I have however
noticed this guilty behaviour happen especially with non
iso-8859-1 chars, valid byte string chars but equivalent chars
with unicode code point > 255. Infortunately, all these chars
which are so important in French. (I heared about similar problems
with the mac-roman coding. I do not know the status).

So, if this (u'...') works in Py 3.3, the problem can
be considered as "solved".
At least you have been informed about this potential issue.
It still remains that this is a serious problem on Py 2.

jmf
msg154808 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-03-02 23:21
> That would mean in Python 3, '...' works and u'...' will not work.

You misunderstand the PEP: in 3.3, '...' and u'...' will be *exactly* the same. The only change is that the interpreter will ignore the u prefix instead of raising SyntaxError. It will be as if 'u' were not there. The only purpose is to let 2.x code run in 3.x without requiring the user to erase the 'u'.

I can see how you could misunderstand and think that the 'u' prefix must have some meaning. But is does not. The addition is a bit controversial but Guido approved it with the expectation that it will encourage more conversion of 2.x libraries to run on 3.3. In any case, the tracker is not the place for further discussion of the value of the PEP.

> Once again, an *illustration* with IDLE / Py2.
...
> Of course, this is actually a no problem with Py 3.
...
> It still remains that this is a serious problem on Py 2.

We are painfully aware that 2.x has problems with unicode. You do not need to tell us. I believe that most of the problems that could be sensibly fixed in 2.x have been fixed. 3.0 fixed more problems by changing the language. 3.3 fixes still more problems by changing the internal implementation of unicode, along with the C api, and the meaning of the language on some systems. People who want to avoid all the problems that have been fixed should use 3.3 either from the repository or when it is released.

> So, if this (u'...') works in Py 3.3, the problem can
be considered as "solved".

I am glad you agree and I will close the issue.

Please use python-list for any further discussion or questions.
msg154829 - (view) Author: jmf (jmfauth) Date: 2012-03-03 11:03
2012/3/3 Terry J. Reedy <report@bugs.python.org>

>
> Terry J. Reedy <tjreedy@udel.edu> added the comment:
>
> > That would mean in Python 3, '...' works and u'...' will not work.
>
> You misunderstand the PEP: in 3.3, '...' and u'...' will be *exactly* the
> same. The only change is that the interpreter will ignore the u prefix
> instead of raising SyntaxError. It will be as if 'u' were not there. The
> only purpose is to let 2.x code run in 3.x without requiring the user to
> erase the 'u'.
>
> I can see how you could misunderstand and think that the 'u' prefix must
> have some meaning. But is does not. The addition is a bit controversial but
> Guido approved it with the expectation that it will encourage more
> conversion of 2.x libraries to run on 3.3. In any case, the tracker is not
> the place for further discussion of the value of the PEP.
>
> > Once again, an *illustration* with IDLE / Py2.
> ...
> > Of course, this is actually a no problem with Py 3.
> ...
> > It still remains that this is a serious problem on Py 2.
>
> We are painfully aware that 2.x has problems with unicode. You do not need
> to tell us. I believe that most of the problems that could be sensibly
> fixed in 2.x have been fixed. 3.0 fixed more problems by changing the
> language. 3.3 fixes still more problems by changing the internal
> implementation of unicode, along with the C api, and the meaning of the
> language on some systems. People who want to avoid all the problems that
> have been fixed should use 3.3 either from the repository or when it is
> released.
>
> > So, if this (u'...') works in Py 3.3, the problem can
> be considered as "solved".
>
> I am glad you agree and I will close the issue.
>
>

Preliminary remark. I'm sending this via gmail, so it
may happen the glyphs you see are illformed or
transfomred by Google. Be ensured I'm typing the
"right" glyphs.

No, no and no. This is not a tkinter issue. This
"strange" behaviour, I do not find a better word,
happens with many libraries, can be Python core libs
or external libs.
To tell you the truth and dispite my experience,
I never succeeded to narrow excatly the problem.
In Python 2 sometimes, understand with some pieces
of code / software, it "works" and somtimes it
simply does not. The libs used here a just the
first ones, that came to my mind.

-----

wxPython 2.8-ansi build.

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "c:\python27\lib\site-packages\wx-2.8-msw-ansi\wx\py\shell.py", line
1242, in writeOut
    self.write(text)
  File "c:\python27\lib\site-packages\wx-2.8-msw-ansi\wx\py\shell.py", line
1000, in write
    self.AddText(text)
  File "c:\python27\lib\site-packages\wx-2.8-msw-ansi\wx\stc.py", line
1425, in AddText
    return _stc.StyledTextCtrl_AddText(*args, **kwargs)
  File "c:\python27\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position
4-5: character maps to <undefined>

abc需
>>>

----

PySide, passing "unicode" to a text widdget.

Passing u'abc需' works.
Passing unicode('abc需', 'cp1252') works.
Passing 'abc逜' doesn't !  'œ€' are missing.

---

My interactive wx interpreter using wxPython. Strings
as frame title.

True

ok

Traceback (most recent call last):
  File "<psi last command>", line 1, in <module>
  File
"c:\Python27\lib\site-packages\wx-2.8-msw-ansi\wx\_windows.py",
line 505, in __init__
    _windows_.Frame_swiginit(self,_windows_.new_Frame(*args,
**kwargs))
  File "c:\Python27\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in
position 5-6: character maps to <undefined>

True

ok

---

And so on with many libs.

You may argue that these libs are guilty.

I may argue that Python is somehow guilty, because it
let users write non working code.
And practically in all the cases, the main problem is due
to the usage of unicode literals.

Just to show you, I'm quite comfortable with all this
coding stuff. The results my interactive intepreter.
Special hack, unfortunatelly non portable, works
only with Windows and cp1252.

abcé??
>>> unicode('abc需', sys.stdout.encoding)
abc需
>>> print u'abc需'
abcé??
>>> print unicode('abc需', sys.stdout.encoding)
abc需

As I am aware of this "feature", all my code is
perfectly working. I'm paying attention to the
necessity of the usage of u'...' or unicode(...).
Unfortunatelly, this not a general case in a lot of
code I see, supposed to deal with texts.

To draw a conclusion.

You are wise enough to understand that, when I'm
saying "Python just does not work", I'm unforunatelly
not so far away form the reality.

I really, very really, expect all this mess (sorry
for the word) will not reappear in Py 3.3.

Let's wait.

'abc需'
>>> print('abc需')
abc需
>>>

Regards,
Jean-Michel Fauth

PS The u() trick does not help.
msg154834 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-03-03 13:03
I'd like to encourage you to not try this sort of thing out from an interactive interpreter (incidentally, where does "<psi last command>" come from? It doesn't look like Python's REPL).

As David and Terry noted, interactions with such a console, be it Windows' "cmd" or IDLE, have their very own idiosyncrasies and bugs.

That said, in Python 2.x *source files* the following two expressions are identical:

* u'abcœé€'
* unicode('abcœé€', 'encoding the file is in')

Both result in a Unicode string with the six characters/codepoints you mentioned.  There won't be any code that works with one but not the other.

Of course there are libraries that do not handle Unicode strings in general (nothing to do with literals!) correctly, but as you yourself said, that is a problem with the libraries.

Lastly, please read PEP 414 if you are not completely sure what it is proposing.  You will see that it merely affects the available syntax for Unicode literals and allows the "u" again.
msg154907 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-04 20:09
I propose to close this issue as invalid (although out-of-date might be fine as well). Jean-Michel is apparently unable to describe what issue *precisely* he wants to see fixed, rather than just complaining that open source is a disaster. I don't think we can anything do about open source being a disaster, and I'm not able to reproduce that perception.

Jean-Michel: please try to use this bug tracker in the way it is intended, i.e. report one bug at time, following this structure:
- this is what I did
- this is what happened
- this is what should have happened instead
History
Date User Action Args
2022-04-11 14:57:27adminsetgithub: 58384
2012-03-04 20:09:55loewissetmessages: + msg154907
2012-03-03 13:03:38georg.brandlsetmessages: + msg154834
2012-03-03 11:03:39jmfauthsetnosy: + jmfauth
messages: + msg154829
2012-03-02 23:21:06terry.reedysetstatus: open -> closed
resolution: out of date
messages: + msg154808
2012-03-02 22:10:57Jean-Michel.Fauthsetmessages: + msg154807
2012-03-02 22:01:35terry.reedysetnosy: + terry.reedy
messages: + msg154806
2012-03-02 20:41:44r.david.murraysetmessages: + msg154798
title: Fix unicode literals (for PEP 414) -> Fix unicode literals
2012-03-02 20:40:59r.david.murraysetmessages: + msg154797
2012-03-02 20:35:50Jean-Michel.Fauthsetmessages: + msg154796
2012-03-02 20:32:27r.david.murraysetnosy: + r.david.murray
messages: + msg154794
2012-03-02 20:13:10georg.brandlsetnosy: + georg.brandl
messages: + msg154793
2012-03-02 20:08:41Jean-Michel.Fauthsetmessages: + msg154792
2012-03-02 17:06:56loewissetnosy: + loewis
messages: + msg154782
2012-03-02 13:58:43ezio.melottisetnosy: + ezio.melotti
type: behavior
components: + Unicode, - None
2012-03-02 13:53:25benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg154765
2012-03-02 12:36:56Jean-Michel.Fauthcreate