This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: marshal roundtripping for unicode
Type: Stage:
Components: Unicode Versions: Python 2.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Carl.Friedrich.Bolz, gvanrossum, lemburg, loewis
Priority: normal Keywords:

Created on 2007-11-13 10:53 by Carl.Friedrich.Bolz, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg57444 - (view) Author: Carl Friedrich Bolz-Tereick (Carl.Friedrich.Bolz) * Date: 2007-11-13 10:53
Marshal does not round-trip unicode surrogate pairs for wide unicode-builds:

marshal.loads(marshal.dumps(u"\ud800\udc00")) == u'\U00010000'

This is very annoying, because the size of unicode constants differs
between when you run a module for the first time and subsequent runs
(because the later runs use the pyc file).
msg57462 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-11-13 18:28
I think this is unavoidable. Depending on whether you happen to be using
a narrow or wide unicode build of Python, \Uxxxxxxxx may be turned into
a pair of surrogates anyway. It's not just marshal that's not
roundtripping; the utf-8 codec has the same issue (and so does the
utf-16 codec I presume). You will have to code around it. I think that
the alternative would be more painful in other circumstances.
msg57469 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-11-13 19:29
As Guido says: this is by design. The Unicode type doesn't really
support storage of surrogates; so don't use it for that.
msg57571 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2007-11-15 22:59
I think you have a wrong understanding of round-tripping. 

In Unicode it is really irrelevant if you're using a UCS2 surrogate pair
or a UCS4 representation to describe a code point. The length of the
Unicode representation may change, but the meaning won't, so you don't
lose any information.
History
Date User Action Args
2022-04-11 14:56:28adminsetgithub: 45774
2007-11-15 22:59:20lemburgsetnosy: + lemburg
messages: + msg57571
2007-11-13 19:29:27loewissetstatus: open -> closed
nosy: + loewis
resolution: wont fix
messages: + msg57469
2007-11-13 18:28:40gvanrossumsetnosy: + gvanrossum
messages: + msg57462
2007-11-13 10:53:09Carl.Friedrich.Bolzcreate