This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ezio.melotti
Recipients ezio.melotti, lemburg
Date 2008-07-06.06:58:41
SpamBayes Score 0.00017741352
Marked as misclassified No
Message-id <1215327524.42.0.065323704773.issue3297@psf.upfronthosting.co.za>
In-reply-to
Content
Problem: when you have Unicode characters with a code point greater than
U+FFFF written directly in the source file (that is, not in the form
u'\Uxxxxxxxx' but as normal chars in a u'' string) the interpreter uses
surrogate pairs for representing these characters only if the pyc
doesn't exist. When the pyc is created it uses a "normal" character
(\Uxxxxxxxx instead of the pair \uxxxx\uxxxx). This could lead to an
unexpected behavior while comparing Unicode strings or in other
situations (even if it could be solved without problems in different
ways - using u'\Uxxxxxxx' or u'\uxxx' instead of the characters,
encoding them before comparing - there shouldn't be differences between
a py and its pyc).

Tested on:
Ubuntu 8.04 with python 2.4: Uses a surrogate pair.
Ubuntu 8.04 with python 2.5: Uses a surrogate pair.
Windows XP SP2 with python 2.4: Uses a "normal" character.

Steps to reproduce the problem:
 1a. download the attached file or create it following the next step;
 1b. in a UTF-8-aware console write `print 
unichr(int('10123', 16))` (or any codepoint >= 10000), copy the printed
character (depending on the console it could be a box, two box or a
character) in a file with the lines `# -*- coding: utf-8 -*-`, `print
'Result:', u'<paste here the char>' == u'\U00010123'` and `print
'Repr:', repr(u'<paste here the char>'), repr(u'\U00010123')`. Save the
file in UTF-8;
 2. open a python interpreter and import the file (`import
unicodetest`). It should print `Result: False` and `Repr:
u'\ud800\udd23' u'\U00010123'` (the character is represented as a
surrogate pair). During this step the pyc file is created.
 3. from the python interpreter write `reload(unicodetest)`. Now it
should print `Result: True` and `Repr: u'\U00010123' u'\U00010123'` (the
char is represented as a "normal" character). Any other reload will
print True. If you delete the pyc and reload again it will print False.

(Instead of using reload() is also possible to create a function and
call it from the module when it's loaded and again with
unicodetest.func(), the result will be the same.)

Expected behavior:
The interpreter should use the same representation in both the situation
(and print True in both the tests).
Another solution could be to change the behavior of == to return True if
a normal char is compared with its surrogate pair (if it makes sense).

Further informations:
The character used for the test is part of the "Unicode Plane 1" (see
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane).
More information about the surrogate pairs can be found here:
http://en.wikipedia.org/wiki/Surrogate_pair#Encoding_of_characters_outside_the_BMP
History
Date User Action Args
2008-07-06 06:58:44ezio.melottisetspambayes_score: 0.000177414 -> 0.00017741352
recipients: + ezio.melotti, lemburg
2008-07-06 06:58:44ezio.melottisetspambayes_score: 0.000177414 -> 0.000177414
messageid: <1215327524.42.0.065323704773.issue3297@psf.upfronthosting.co.za>
2008-07-06 06:58:43ezio.melottilinkissue3297 messages
2008-07-06 06:58:41ezio.melotticreate