This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: encoding uncode objects greater than FFFF
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.0, Python 3.1, Python 2.7, Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, msaghaei
Priority: normal Keywords:

Created on 2009-10-09 09:12 by msaghaei, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (2)
msg93780 - (view) Author: Mahmoud (msaghaei) Date: 2009-10-09 09:12
Odd behaviour with str.encode or codecs.Codec.encode or simailar
functions, when dealing with uncode objects above ffff

with 2.6
>>> u'\u10380'.encode('utf')
'\xe1\x80\xb80'

with 3.x
'\u10380'.encode('utf')
'\xe1\x80\xb80'

correct output must be:
\xf0\x90\x8e\x80
msg93781 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-10-09 09:16
If you want to specify codepoints greater than U+FFFF you have to use
u'\Uxxxxxxxx':
>>> x = u'\u10380'
>>> x.encode('utf-8')
'\xe1\x80\xb80'
>>> x[0]
u'\u1038'
>>> x[1]
u'0'
>>> y = u'\U00010380'
>>> y.encode('utf-8')
'\xf0\x90\x8e\x80'
History
Date User Action Args
2022-04-11 14:56:53adminsetgithub: 51339
2009-10-09 09:16:49ezio.melottisetstatus: open -> closed

nosy: + ezio.melotti
messages: + msg93781

resolution: not a bug
stage: resolved
2009-10-09 09:12:33msaghaeicreate