This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients Rhamphoryncus, ezio.melotti, lemburg
Date 2008-07-12.09:37:04
SpamBayes Score 0.00495501
Marked as misclassified No
Message-id <>
Adam, I do know what I'm talking about: I was the lead designer of the
Unicode integration you find in Python and implemented most of it.

What you see as repr() of a Unicode object is the result of applying a
codec to the internal representation. Please don't confuse the output of
the codec ("unicode-escape") with the internal representation.

That said, Ezio did uncover a bug and we need to find the cause. It's
likely caused by the fact that the UTF-8 codec does not recombine
surrogates on UCS4 builds. See this comment in the codec implementation:

        case 3:
            if ((s[1] & 0xc0) != 0x80 ||
                (s[2] & 0xc0) != 0x80) {
                errmsg = "invalid data";
		startinpos = s-starts;
		endinpos = startinpos+3;
		goto utf8Error;
            ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) << 6) + (s[2] &
            if (ch < 0x0800) {
		/* Note: UTF-8 encodings of surrogates are considered
		   legal UTF-8 sequences;

		   XXX For wide builds (UCS-4) we should probably try
		       to recombine the surrogates into a single code
                errmsg = "illegal encoding";
		startinpos = s-starts;
		endinpos = startinpos+3;
		goto utf8Error;
		*p++ = (Py_UNICODE)ch;
Date User Action Args
2008-07-12 09:37:07lemburgsetspambayes_score: 0.00495501 -> 0.00495501
recipients: + lemburg, Rhamphoryncus, ezio.melotti
2008-07-12 09:37:07lemburgsetspambayes_score: 0.00495501 -> 0.00495501
messageid: <>
2008-07-12 09:37:06lemburglinkissue3297 messages
2008-07-12 09:37:04lemburgcreate