Message 93486 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ArcRiley, ezio.melotti, loewis
Date	2009-10-03.10:21:57
SpamBayes Score	5.939693e-15
Marked as misclassified	No
Message-id	<1254565321.02.0.79797717449.issue7045@psf.upfronthosting.co.za>
In-reply-to

Content
I can't reproduce it either on Ubuntu 9.04 32-bit. I tried both from the terminal and from the file, using Py3.2a0. As Martin said, the fact that in narrow builds of Python the codepoints outside the BMP are represented with two surrogate pairs is a known "issue". This is how UTF-16 works, even if it has some problematic side-effects. In your example 'line[0]' is not equal to 'first' because line[0] is the codepoint of the first surrogate and 'first' is a scalar value that represents the SHAVIAN LETTER TOT (U+010451). Regarding the traceback you pasted in the first post, have you used print('𐑑') or print(line[0])? This is what I get using line[0]: >>> line = '𐑑𐑧𐑕𐑑𐑦𐑙' >>> first = '𐑑' >>> print(line[0]) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed In this case you are getting an error because lone surrogates are invalid and they can't be encoded. If you use line[:2] instead it works because it takes both the surrogates: >>> print(line[0:2]) 𐑑 >>> first == line[0:2] True If you really got that error with print('𐑛'), then #3297 could be related. Can you also try this and see what it prints? >>> import sys >>> sys.maxunicode

I can't reproduce it either on Ubuntu 9.04 32-bit. I tried both from the
terminal and from the file, using Py3.2a0.

As Martin said, the fact that in narrow builds of Python the codepoints
outside the BMP are represented with two surrogate pairs is a known
"issue". This is how UTF-16 works, even if it has some problematic
side-effects.
In your example 'line[0]' is not equal to 'first' because line[0] is the
codepoint of the first surrogate and 'first' is a scalar value that
represents the SHAVIAN LETTER TOT (U+010451).

Regarding the traceback you pasted in the first post, have you used
print('𐑑') or print(line[0])?

This is what I get using line[0]:
>>> line = '𐑑𐑧𐑕𐑑𐑦𐑙'
>>> first = '𐑑'
>>> print(line[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in
position 0: surrogates not allowed

In this case you are getting an error because lone surrogates are
invalid and they can't be encoded. If you use line[:2] instead it works
because it takes both the surrogates:

>>> print(line[0:2])
𐑑
>>> first == line[0:2]
True

If you really got that error with print('𐑛'), then #3297 could be related.

Can you also try this and see what it prints?
>>> import sys
>>> sys.maxunicode

History
Date	User	Action	Args
2009-10-03 10:22:01	ezio.melotti	set	recipients: + ezio.melotti, loewis, ArcRiley
2009-10-03 10:22:01	ezio.melotti	set	messageid: <1254565321.02.0.79797717449.issue7045@psf.upfronthosting.co.za>
2009-10-03 10:21:58	ezio.melotti	link	issue7045 messages
2009-10-03 10:21:57	ezio.melotti	create