Message 10735 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients
Date	2002-10-10.15:30:02
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=38388 I'm not exactly sure why things work again, but I do know that I looked into this some time ago. Perhaps I simply forgot to close the bug or one of the UTF-8 codec overhauls remedied the problem. Here's what I get with python 2.3 UCS4: >>> len(u'\U000d0000') 1 >>> len(u"\udb00\udc00") 2 >>> u'\U000d0000' == u"\udb00\udc00" False >>> len(unicode(u"\udb00\udc00".encode('utf-8'), 'utf-8')) 1 >>> len(unicode(u'\U000d0000'.encode('utf-8'), 'utf-8')) 1 This is what I get with Python 2.2.1: >>> len(u'\U000d0000') 2 >>> len(u"\udb00\udc00") 2 >>> u'\U000d0000' == u"\udb00\udc00" 1 >>> len(unicode(u"\udb00\udc00".encode('utf-8'), 'utf-8')) 2 >>> len(unicode(u'\U000d0000'.encode('utf-8'), 'utf-8')) 2 There's still a difference there, but the UTF-8 codec behaves consistently.

Logged In: YES 
user_id=38388

I'm not exactly sure why things work again, but I do
know that I looked into this some time ago. Perhaps I
simply forgot to close the bug or one of the UTF-8
codec overhauls remedied the problem.

Here's what I get with python 2.3 UCS4:

>>> len(u'\U000d0000')
1
>>> len(u"\udb00\udc00")
2
>>> u'\U000d0000' == u"\udb00\udc00"
False
>>> len(unicode(u"\udb00\udc00".encode('utf-8'), 'utf-8'))
1
>>> len(unicode(u'\U000d0000'.encode('utf-8'), 'utf-8'))
1

This is what I get with Python 2.2.1:
>>> len(u'\U000d0000')
2
>>> len(u"\udb00\udc00")
2
>>> u'\U000d0000' == u"\udb00\udc00"
1
>>> len(unicode(u"\udb00\udc00".encode('utf-8'), 'utf-8'))
2
>>> len(unicode(u'\U000d0000'.encode('utf-8'), 'utf-8'))
2

There's still a difference there, but the UTF-8 codec behaves
consistently.

History
Date	User	Action	Args
2007-08-23 14:01:15	admin	link	issue554916 messages
2007-08-23 14:01:15	admin	create