Message 69321 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ezio.melotti, lemburg
Date	2008-07-06.06:58:41
SpamBayes Score	0.00017741352
Marked as misclassified	No
Message-id	<1215327524.42.0.065323704773.issue3297@psf.upfronthosting.co.za>
In-reply-to

Content
Problem: when you have Unicode characters with a code point greater than U+FFFF written directly in the source file (that is, not in the form u'\Uxxxxxxxx' but as normal chars in a u'' string) the interpreter uses surrogate pairs for representing these characters only if the pyc doesn't exist. When the pyc is created it uses a "normal" character (\Uxxxxxxxx instead of the pair \uxxxx\uxxxx). This could lead to an unexpected behavior while comparing Unicode strings or in other situations (even if it could be solved without problems in different ways - using u'\Uxxxxxxx' or u'\uxxx' instead of the characters, encoding them before comparing - there shouldn't be differences between a py and its pyc). Tested on: Ubuntu 8.04 with python 2.4: Uses a surrogate pair. Ubuntu 8.04 with python 2.5: Uses a surrogate pair. Windows XP SP2 with python 2.4: Uses a "normal" character. Steps to reproduce the problem: 1a. download the attached file or create it following the next step; 1b. in a UTF-8-aware console write `print unichr(int('10123', 16))` (or any codepoint >= 10000), copy the printed character (depending on the console it could be a box, two box or a character) in a file with the lines `# -- coding: utf-8 --`, `print 'Result:', u'<paste here the char>' == u'\U00010123'` and `print 'Repr:', repr(u'<paste here the char>'), repr(u'\U00010123')`. Save the file in UTF-8; 2. open a python interpreter and import the file (`import unicodetest`). It should print `Result: False` and `Repr: u'\ud800\udd23' u'\U00010123'` (the character is represented as a surrogate pair). During this step the pyc file is created. 3. from the python interpreter write `reload(unicodetest)`. Now it should print `Result: True` and `Repr: u'\U00010123' u'\U00010123'` (the char is represented as a "normal" character). Any other reload will print True. If you delete the pyc and reload again it will print False. (Instead of using reload() is also possible to create a function and call it from the module when it's loaded and again with unicodetest.func(), the result will be the same.) Expected behavior: The interpreter should use the same representation in both the situation (and print True in both the tests). Another solution could be to change the behavior of == to return True if a normal char is compared with its surrogate pair (if it makes sense). Further informations: The character used for the test is part of the "Unicode Plane 1" (see http://en.wikipedia.org/wiki/Basic_Multilingual_Plane). More information about the surrogate pairs can be found here: http://en.wikipedia.org/wiki/Surrogate_pair#Encoding_of_characters_outside_the_BMP

Problem: when you have Unicode characters with a code point greater than
U+FFFF written directly in the source file (that is, not in the form
u'\Uxxxxxxxx' but as normal chars in a u'' string) the interpreter uses
surrogate pairs for representing these characters only if the pyc
doesn't exist. When the pyc is created it uses a "normal" character
(\Uxxxxxxxx instead of the pair \uxxxx\uxxxx). This could lead to an
unexpected behavior while comparing Unicode strings or in other
situations (even if it could be solved without problems in different
ways - using u'\Uxxxxxxx' or u'\uxxx' instead of the characters,
encoding them before comparing - there shouldn't be differences between
a py and its pyc).

Tested on:
Ubuntu 8.04 with python 2.4: Uses a surrogate pair.
Ubuntu 8.04 with python 2.5: Uses a surrogate pair.
Windows XP SP2 with python 2.4: Uses a "normal" character.

Steps to reproduce the problem:
 1a. download the attached file or create it following the next step;
 1b. in a UTF-8-aware console write `print 
unichr(int('10123', 16))` (or any codepoint >= 10000), copy the printed
character (depending on the console it could be a box, two box or a
character) in a file with the lines `# -*- coding: utf-8 -*-`, `print
'Result:', u'<paste here the char>' == u'\U00010123'` and `print
'Repr:', repr(u'<paste here the char>'), repr(u'\U00010123')`. Save the
file in UTF-8;
 2. open a python interpreter and import the file (`import
unicodetest`). It should print `Result: False` and `Repr:
u'\ud800\udd23' u'\U00010123'` (the character is represented as a
surrogate pair). During this step the pyc file is created.
 3. from the python interpreter write `reload(unicodetest)`. Now it
should print `Result: True` and `Repr: u'\U00010123' u'\U00010123'` (the
char is represented as a "normal" character). Any other reload will
print True. If you delete the pyc and reload again it will print False.

(Instead of using reload() is also possible to create a function and
call it from the module when it's loaded and again with
unicodetest.func(), the result will be the same.)

Expected behavior:
The interpreter should use the same representation in both the situation
(and print True in both the tests).
Another solution could be to change the behavior of == to return True if
a normal char is compared with its surrogate pair (if it makes sense).

Further informations:
The character used for the test is part of the "Unicode Plane 1" (see
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane).
More information about the surrogate pairs can be found here:
http://en.wikipedia.org/wiki/Surrogate_pair#Encoding_of_characters_outside_the_BMP

History
Date	User	Action	Args
2008-07-06 06:58:44	ezio.melotti	set	spambayes_score: 0.000177414 -> 0.00017741352 recipients: + ezio.melotti, lemburg
2008-07-06 06:58:44	ezio.melotti	set	spambayes_score: 0.000177414 -> 0.000177414 messageid: <1215327524.42.0.065323704773.issue3297@psf.upfronthosting.co.za>
2008-07-06 06:58:43	ezio.melotti	link	issue3297 messages
2008-07-06 06:58:41	ezio.melotti	create