classification
Title: encode(..., 'xmlcharrefreplace') produces entities for surrogate pairs
Type: behavior Stage: resolved
Components: Library (Lib), Unicode Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, lemburg, python-dev, serhiy.storchaka, vstinner, wiml
Priority: normal Keywords: patch

Created on 2012-09-05 18:54 by wiml, last changed 2013-08-06 13:58 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
issue15866.diff ezio.melotti, 2012-09-06 09:53 Patch against 3.2.
issue15866_2.patch serhiy.storchaka, 2013-03-11 12:54 Patch against 3.2. review
Messages (11)
msg169886 - (view) Author: Wim (wiml) Date: 2012-09-05 18:54
Encoding a (well-formed) Unicode string containing a non-BMP character, using the xmlcharrefreplace error handler, will produce two XML entities for surrogate codepoints instead of one entity for the actual character.

Here's a transcript (Python 2.7.3, x86_64):


  >>> b = '\xf0\x9f\x92\x9d'
  >>> u = b.decode('utf8')
  >>> u
  u'\U0001f49d'
  >>> u.encode('ascii', errors='xmlcharrefreplace')
  '��'
  >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace')
  '��'
  >>> list(u)
  [u'\ud83d', u'\udc9d']
  >>> u.encode('utf8', errors='xmlcharrefreplace')
  '\xf0\x9f\x92\x9d'

The utf8 bytestring is correctly decoded, and the print representation shows one single Unicode character. Encoding using xmlcharrefreplace produces two XML entities, which is wrong[1]: a single non-BMP character should be represented in XML as a single entity reference, in this case presumably '💝'.

As the last two lines show, I'm using a narrow build (so the unicode strings are represented internally in UTF-16, I guess). Converting the string back to utf8 does the right thing, and emits a single UTF8 sequence representing the supplementary-plane codepoint.

(FWIW, the backslashreplace error handler also emits a surrogate pair, but I don't know if there is a complete specification for what that handler does, so it's possible that it's not wrong.)

[1] http://www.w3.org/International/questions/qa-escapes#bytheway
msg169909 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-09-06 09:53
Attached patch against 3.2 seems to fix the problem.
msg169910 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-09-06 10:26
Note that there's similar code in  charmap_encoding_error, PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap, and PyUnicode_EncodeDecimal, however I'm not sure how to reach these paths.
msg169911 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-09-06 11:06
Thanks to the PEP 393, this issue is already fixed in Python 3.3.

$ ./python 
Python 3.3.0rc1+ (default:ba2c1def3710+, Sep  3 2012, 23:20:25) 
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux
>>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace')
b'💝'
msg183446 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-04 13:20
I prefer a little different (simpler for me) form:

                for (p = collstart; p < collend;) {
                    Py_UCS4 ch = *p++;
                    if ((0xD800 <= ch && ch <= 0xDBFF) &&
                        (p < collend) &&
                        (0xDC00 <= *p && *p <= 0xDFFF)) {
                        ch = ((((ch & 0x03FF) << 10) |
                               ((Py_UCS4)*p++ & 0x03FF)) + 0x10000);
                    }
                    str += sprintf(str, "&#%d;", (int)ch);
                }

And please look at the loop above ("determine replacement size"). It should be corrected too. It will be simpler to use a buffer with static size (``char buffer[2+29+1+1];``) as in charmap encoder. Perhaps charmap encoder should be fixed too (and common code extracted to separate function).

I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as b'&#128157;' and not as b'&#55357;&#56477;'? This will be compatible with narrow build but will break compatibility with 3.3+. What is less evil?
msg183482 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-04 18:10
> I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as 
> b'&#128157;' and not as b'&#55357;&#56477;'?

I don't think so.  IIRC surrogates are invalid in UTF-32, and certainly shouldn't be recombined.

> This will be compatible with narrow build but will break compatibility
> with 3.3+. What is less evil?

I think it's better to be compatible with 3.3+.  This is anyway a rather obscure corner case.

Do you want to propose a new patch?
msg183524 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-05 13:33
> I think it's better to be compatible with 3.3+.  This is anyway a rather obscure corner case.

Well, we should not introduce new divergence between 3.2 wide build and 3.3.

> Do you want to propose a new patch?

I will do it.
msg183956 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-11 12:54
Here is a patch which fixes xmlcharrefreplace error handling in other places.

Unfortunately multibyte asian encoders are broken yet. I'll open a separate issue for this.
msg186921 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-14 14:28
Should we really invest time to fix bugs related to astral (non-BMP) characters with rare codecs and error handlers (CJK codecs, xmlcharrefreplace error handler)? Python 3.3 is released and has a much better support of astral characters (in many places). I don't know for CJK codecs: Python 3.3 still uses the legacy Unicode API for CJK codecs and so depend on the wchar_t type (which is 16 bits on Windows). I just fixed Python 3.4 to use the new Unicode API (PEP 393), which always support astral characters support and don't depend on the size of the wchar_t type.

I'm not against fixing Python 2.7, I'm just not interested.
msg186934 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-14 17:37
I tend to agree with Victor: if you want to fix 2.7 go ahead, but if that's too much work it's OK with me to close this issue.
msg194546 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-08-06 13:57
New changeset 719ee60fc5e2 by Serhiy Storchaka in branch '2.7':
Issue #15866: The xmlcharrefreplace error handler no more produces two XML
http://hg.python.org/cpython/rev/719ee60fc5e2
History
Date User Action Args
2013-08-06 13:58:20serhiy.storchakasetstatus: open -> closed
assignee: serhiy.storchaka
resolution: fixed
stage: patch review -> resolved
2013-08-06 13:57:03python-devsetnosy: + python-dev
messages: + msg194546
2013-04-14 17:37:30ezio.melottisetmessages: + msg186934
2013-04-14 14:28:38vstinnersetmessages: + msg186921
2013-04-13 17:58:08serhiy.storchakasetversions: - Python 3.2
2013-03-11 12:54:58serhiy.storchakasetfiles: + issue15866_2.patch

messages: + msg183956
2013-03-05 13:33:53serhiy.storchakasetmessages: + msg183524
2013-03-04 18:10:44ezio.melottisetmessages: + msg183482
2013-03-04 13:20:33serhiy.storchakasetmessages: + msg183446
2013-03-02 13:43:03ezio.melottisetnosy: + serhiy.storchaka
2012-09-06 11:06:50vstinnersetnosy: + vstinner
messages: + msg169911
2012-09-06 10:26:14ezio.melottisetnosy: + lemburg
messages: + msg169910
2012-09-06 09:53:58ezio.melottisetfiles: + issue15866.diff
versions: + Python 3.2
messages: + msg169909

keywords: + patch
stage: patch review
2012-09-05 18:54:33wimlcreate