Issue 15866: encode(..., 'xmlcharrefreplace') produces entities for surrogate pairs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60070

classification

Title:	encode(..., 'xmlcharrefreplace') produces entities for surrogate pairs
Type:	behavior	Stage:	resolved
Components:	Library (Lib), Unicode	Versions:	Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	ezio.melotti, lemburg, python-dev, serhiy.storchaka, vstinner, wiml
Priority:	normal	Keywords:	patch

Created on 2012-09-05 18:54 by wiml, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue15866.diff	ezio.melotti, 2012-09-06 09:53	Patch against 3.2.
issue15866_2.patch	serhiy.storchaka, 2013-03-11 12:54	Patch against 3.2.	review

Messages (11)
msg169886 - (view)	Author: Wim (wiml)	Date: 2012-09-05 18:54
Encoding a (well-formed) Unicode string containing a non-BMP character, using the xmlcharrefreplace error handler, will produce two XML entities for surrogate codepoints instead of one entity for the actual character. Here's a transcript (Python 2.7.3, x86_64): >>> b = '\xf0\x9f\x92\x9d' >>> u = b.decode('utf8') >>> u u'\U0001f49d' >>> u.encode('ascii', errors='xmlcharrefreplace') '&#55357;&#56477;' >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace') '&#55357;&#56477;' >>> list(u) [u'\ud83d', u'\udc9d'] >>> u.encode('utf8', errors='xmlcharrefreplace') '\xf0\x9f\x92\x9d' The utf8 bytestring is correctly decoded, and the print representation shows one single Unicode character. Encoding using xmlcharrefreplace produces two XML entities, which is wrong[1]: a single non-BMP character should be represented in XML as a single entity reference, in this case presumably '💝'. As the last two lines show, I'm using a narrow build (so the unicode strings are represented internally in UTF-16, I guess). Converting the string back to utf8 does the right thing, and emits a single UTF8 sequence representing the supplementary-plane codepoint. (FWIW, the backslashreplace error handler also emits a surrogate pair, but I don't know if there is a complete specification for what that handler does, so it's possible that it's not wrong.) [1] http://www.w3.org/International/questions/qa-escapes#bytheway
msg169909 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-09-06 09:53
Attached patch against 3.2 seems to fix the problem.
msg169910 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-09-06 10:26
Note that there's similar code in charmap_encoding_error, PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap, and PyUnicode_EncodeDecimal, however I'm not sure how to reach these paths.
msg169911 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-09-06 11:06
Thanks to the PEP 393, this issue is already fixed in Python 3.3. $ ./python Python 3.3.0rc1+ (default:ba2c1def3710+, Sep 3 2012, 23:20:25) [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace') b'💝'
msg183446 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-03-04 13:20
I prefer a little different (simpler for me) form: for (p = collstart; p < collend;) { Py_UCS4 ch = p++; if ((0xD800 <= ch && ch <= 0xDBFF) && (p < collend) && (0xDC00 <= p && p <= 0xDFFF)) { ch = ((((ch & 0x03FF) << 10) \| ((Py_UCS4)p++ & 0x03FF)) + 0x10000); } str += sprintf(str, "&#%d;", (int)ch); } And please look at the loop above ("determine replacement size"). It should be corrected too. It will be simpler to use a buffer with static size (``char buffer[2+29+1+1];``) as in charmap encoder. Perhaps charmap encoder should be fixed too (and common code extracted to separate function). I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as b'💝' and not as b'&#55357;&#56477;'? This will be compatible with narrow build but will break compatibility with 3.3+. What is less evil?
msg183482 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-03-04 18:10
> I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as > b'💝' and not as b'&#55357;&#56477;'? I don't think so. IIRC surrogates are invalid in UTF-32, and certainly shouldn't be recombined. > This will be compatible with narrow build but will break compatibility > with 3.3+. What is less evil? I think it's better to be compatible with 3.3+. This is anyway a rather obscure corner case. Do you want to propose a new patch?
msg183524 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-03-05 13:33
> I think it's better to be compatible with 3.3+. This is anyway a rather obscure corner case. Well, we should not introduce new divergence between 3.2 wide build and 3.3. > Do you want to propose a new patch? I will do it.
msg183956 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-03-11 12:54
Here is a patch which fixes xmlcharrefreplace error handling in other places. Unfortunately multibyte asian encoders are broken yet. I'll open a separate issue for this.
msg186921 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-04-14 14:28
Should we really invest time to fix bugs related to astral (non-BMP) characters with rare codecs and error handlers (CJK codecs, xmlcharrefreplace error handler)? Python 3.3 is released and has a much better support of astral characters (in many places). I don't know for CJK codecs: Python 3.3 still uses the legacy Unicode API for CJK codecs and so depend on the wchar_t type (which is 16 bits on Windows). I just fixed Python 3.4 to use the new Unicode API (PEP 393), which always support astral characters support and don't depend on the size of the wchar_t type. I'm not against fixing Python 2.7, I'm just not interested.
msg186934 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-04-14 17:37
I tend to agree with Victor: if you want to fix 2.7 go ahead, but if that's too much work it's OK with me to close this issue.
msg194546 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-08-06 13:57
New changeset 719ee60fc5e2 by Serhiy Storchaka in branch '2.7': Issue #15866: The xmlcharrefreplace error handler no more produces two XML http://hg.python.org/cpython/rev/719ee60fc5e2

History
Date	User	Action	Args
2022-04-11 14:57:35	admin	set	github: 60070
2013-08-06 13:58:20	serhiy.storchaka	set	status: open -> closed assignee: serhiy.storchaka resolution: fixed stage: patch review -> resolved
2013-08-06 13:57:03	python-dev	set	nosy: + python-dev messages: + msg194546
2013-04-14 17:37:30	ezio.melotti	set	messages: + msg186934
2013-04-14 14:28:38	vstinner	set	messages: + msg186921
2013-04-13 17:58:08	serhiy.storchaka	set	versions: - Python 3.2
2013-03-11 12:54:58	serhiy.storchaka	set	files: + issue15866_2.patch messages: + msg183956
2013-03-05 13:33:53	serhiy.storchaka	set	messages: + msg183524
2013-03-04 18:10:44	ezio.melotti	set	messages: + msg183482
2013-03-04 13:20:33	serhiy.storchaka	set	messages: + msg183446
2013-03-02 13:43:03	ezio.melotti	set	nosy: + serhiy.storchaka
2012-09-06 11:06:50	vstinner	set	nosy: + vstinner messages: + msg169911
2012-09-06 10:26:14	ezio.melotti	set	nosy: + lemburg messages: + msg169910
2012-09-06 09:53:58	ezio.melotti	set	files: + issue15866.diff versions: + Python 3.2 messages: + msg169909 keywords: + patch stage: patch review
2012-09-05 18:54:33	wiml	create