Issue 3672: Ill-formed surrogates not treated as errors during encoding/decoding

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47922

classification

Title:	Ill-formed surrogates not treated as errors during encoding/decoding
Type:	behavior	Stage:	test needed
Components:	Unicode	Versions:	Python 3.1

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	loewis	Nosy List:	Rhamphoryncus, benjamin.peterson, ezio.melotti, hippietrail, jwilk, lemburg, loewis, pitrou, python-dev
Priority:	release blocker	Keywords:	patch

Created on 2008-08-24 21:56 by Rhamphoryncus, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
surrogates.diff	loewis, 2009-05-02 09:48

Messages (17)
msg71889 - (view)	Author: Adam Olsen (Rhamphoryncus)	Date: 2008-08-24 21:56
The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or UTF-32 should be treated as errors. Lone surrogates in UTF-16 should probably be treated as errors too (but only during encoding/decoding; unicode objects on UTF-16 builds should allow them to be created through slicing). http://unicode.org/faq/utf_bom.html#30 http://unicode.org/faq/utf_bom.html#42 http://unicode.org/faq/utf_bom.html#40 Lone surrogate in UTF-8 (effectively CESU-8): >>> '\xED\xA0\x81'.decode('utf-8') u'\ud801' Surrogate pair in UTF-8: >>> '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8') u'\ud801\udc00' On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding again will produce the proper non-surrogate scalar value. This has security implications, although rare as characters outside the BMP are rare: >>> u'\ud801\udc00'.encode('utf-16').decode('utf-16') u'\U00010400' Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails (correctly), but encoding one does not: >>> u'\ud801'.encode('utf-16') '\xff\xfe\x01\xd8' I have gotten a report of a user decoding bad data using x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the ill-formed surrogates reached it. Fixing this would cause issue 3297 to blow up loudly, rather than silently.
msg86736 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-04-28 13:13
We could fix it for 3.1, and perhaps leave 2.7 unchanged if some people rely on this (for whatever reason).
msg86817 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-04-29 16:54
While it's probably ok to fix the codecs, there's an issue which makes this difficult at least for the utf-8 codec: The marshal module uses utf-8 to write Unicode objects and these can and need to be able to store the full range of supported UCS2/UCS4 code points, including lone surrogates. If the utf-8 codec were changed to raise an error for these, marshal would no longer be able to write/read Unicode objects. It is likely that other existing Python code (outside the std lib) also relies on this ability. Changing this would only be possible in 3.1. The marshal module would then also have to be changed to use a different encoding which does support encoding lone surrogates. See issue 3297 for a discussion of UTF-8/16 vs. UCS2/4, the implications, motivations, etc.
msg86824 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-04-29 20:39
I think we could preserve the marshal format with yet another error handler - one that emits half surrogates into their intuitive form.
msg86839 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-04-30 08:26
On 2009-04-29 22:39, Martin v. Löwis @psf.upfronthosting.co.za wrote: > Martin v. Löwis <martin@v.loewis.de> added the comment: > > I think we could preserve the marshal format with yet another error > handler - one that emits half surrogates into their intuitive form. That's a good idea. We could have an error handler which then let's the codec accept lone surrogates for utf-8 or just add a new codec which does this and use that for marshal. Still, this can only go into 3.1.
msg86873 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-05-01 09:13
Here is a patch that implements this proposed approach. It introduces a "surrogates" error handler, useful only for the utf-8 codec. If this is accepted, the implementation of PEP 383 can be simplified significantly, essentially removing the need for a separate utf-8b codec (as that could be done in the error handler, as for the other codecs).
msg86874 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-05-01 09:21
rietveld: http://codereview.appspot.com/52081
msg86896 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-05-01 19:48
Fixed indexing error.
msg86913 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2009-05-01 21:30
http://codereview.appspot.com/52081/diff/1/5 File Doc/library/codecs.rst (right): http://codereview.appspot.com/52081/diff/1/5#newcode326 Line 326: In addition, the following error handlers are specific to only selected "In addition, the following error handlers are specific to a single codec." sounds better http://codereview.appspot.com/52081/diff/1/5#newcode335 Line 335: There should probably be a versionchanged directive indicating that "surrogates" was added in 3.1. http://codereview.appspot.com/52081/diff/1/6 File Lib/test/test_codecs.py (right): http://codereview.appspot.com/52081/diff/1/6#newcode544 Line 544: def test_surrogates(self): I think this should be split into 2 tests. "test_lone_surrogates" and "test_surrogate_handler". http://codereview.appspot.com/52081/diff/1/4 File Objects/unicodeobject.c (right): http://codereview.appspot.com/52081/diff/1/4#newcode157 Line 157: static PyObject unicode_encode_call_errorhandler(const char errors, These prototypes are longer than 80 chars some places. I don't think the arguments need to line up with the starting parenthesis. http://codereview.appspot.com/52081/diff/1/4#newcode2393 Line 2393: s, size, &exc, i-1, i, &newpos); "exc" is never Py_DECREFed. http://codereview.appspot.com/52081/diff/1/4#newcode4110 Line 4110: if (!PyUnicode_Check(repunicode)) { Is there a test of this case somewhere? http://codereview.appspot.com/52081/diff/1/2 File Python/codecs.c (right): http://codereview.appspot.com/52081/diff/1/2#newcode758 Line 758: if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) { I believe PyErr_GivenExceptionMatches is more appropriate here, but given the rest of the file uses PyObject_IsInstance, it likely doesn't matter. http://codereview.appspot.com/52081/diff/1/2#newcode771 Line 771: return NULL; This is leaks "object". http://codereview.appspot.com/52081
msg86936 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-05-02 09:44
Reviewers: report_bugs.python.org, Benjamin, Message: Issues fixed in r72188. http://codereview.appspot.com/52081/diff/1/5 File Doc/library/codecs.rst (right): http://codereview.appspot.com/52081/diff/1/5#newcode326 Line 326: In addition, the following error handlers are specific to only selected On 2009/05/01 21:25:44, Benjamin wrote: > "In addition, the following error handlers are specific to a single codec." > sounds better Done. http://codereview.appspot.com/52081/diff/1/5#newcode335 Line 335: On 2009/05/01 21:25:44, Benjamin wrote: > There should probably be a versionchanged directive indicating that "surrogates" > was added in 3.1. Done. http://codereview.appspot.com/52081/diff/1/6 File Lib/test/test_codecs.py (right): http://codereview.appspot.com/52081/diff/1/6#newcode544 Line 544: def test_surrogates(self): On 2009/05/01 21:25:44, Benjamin wrote: > I think this should be split into 2 tests. "test_lone_surrogates" and > "test_surrogate_handler". Done. http://codereview.appspot.com/52081/diff/1/4 File Objects/unicodeobject.c (right): http://codereview.appspot.com/52081/diff/1/4#newcode157 Line 157: static PyObject unicode_encode_call_errorhandler(const char errors, On 2009/05/01 21:25:44, Benjamin wrote: > These prototypes are longer than 80 chars some places. I don't think the > arguments need to line up with the starting parenthesis. Done. http://codereview.appspot.com/52081/diff/1/4#newcode2393 Line 2393: s, size, &exc, i-1, i, &newpos); On 2009/05/01 21:25:44, Benjamin wrote: > "exc" is never Py_DECREFed. Done. http://codereview.appspot.com/52081/diff/1/4#newcode4110 Line 4110: if (!PyUnicode_Check(repunicode)) { On 2009/05/01 21:25:44, Benjamin wrote: > Is there a test of this case somewhere? No. This is temporary - for PEP 383, I will have to support error handlers returning bytes here, also. http://codereview.appspot.com/52081/diff/1/2 File Python/codecs.c (right): http://codereview.appspot.com/52081/diff/1/2#newcode758 Line 758: if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) { On 2009/05/01 21:25:44, Benjamin wrote: > I believe PyErr_GivenExceptionMatches is more appropriate here, but given the > rest of the file uses PyObject_IsInstance, it likely doesn't matter. No. The interface is that an exception instance must be passed; GivenExceptionMatches would also allow for tuples and types. http://codereview.appspot.com/52081/diff/1/2#newcode771 Line 771: return NULL; On 2009/05/01 21:25:44, Benjamin wrote: > This is leaks "object". Done. Please review this at http://codereview.appspot.com/52081 Affected files: M Doc/library/codecs.rst M Lib/test/test_bytes.py M Lib/test/test_codecs.py M Lib/test/test_unicode.py M Lib/test/test_unicodedata.py M Objects/unicodeobject.c M Python/codecs.c M Python/marshal.c
msg86954 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2009-05-02 15:32
I think the new patch looks fine.
msg86966 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2009-05-02 18:54
Something I overlooked is that PyCodec_SurrogateErrors isn't exposed in any headers.
msg86967 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-05-02 18:57
Committed as r72208, blocked as r72209. As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.
msg86968 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2009-05-02 19:01
2009/5/2 <"\"Martin v. Löwis\" <report@bugs.python.org>"@psf.upfronthosting.co.za>: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > Committed as r72208, blocked as r72209. > > As for PyCodec_SurrogateErrors: I'd rather make it static than expose it. Why? All the other error handlers are exposed.
msg86970 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-05-02 19:11
>> As for PyCodec_SurrogateErrors: I'd rather make it static than expose it. > > Why? All the other error handlers are exposed. Sure - but what for? IMO, they all shouldn't be exposed.
msg86971 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2009-05-02 19:15
2009/5/2 <"\"Martin v. Löwis\" <report@bugs.python.org>"@psf.upfronthosting.co.za>: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > >>> As for PyCodec_SurrogateErrors: I'd rather make it static than expose it. >> >> Why? All the other error handlers are exposed. > > Sure - but what for? IMO, they all shouldn't be exposed. The only reason I can think of is consistency, but I don't care that much.
msg275006 - (view)	Author: Roundup Robot (python-dev)	Date: 2016-09-08 12:47
New changeset 2150eadb54c7 by Serhiy Storchaka in branch 'default': Remove old typo. https://hg.python.org/cpython/rev/2150eadb54c7

History
Date	User	Action	Args
2022-04-11 14:56:38	admin	set	github: 47922
2016-09-08 12:47:50	python-dev	set	nosy: + python-dev messages: + msg275006
2010-04-07 14:25:41	ezio.melotti	set	nosy: lemburg, loewis, Rhamphoryncus, pitrou, benjamin.peterson, jwilk, ezio.melotti, hippietrail
2009-06-16 02:47:10	hippietrail	set	nosy: + hippietrail
2009-05-02 19:15:09	benjamin.peterson	set	messages: + msg86971
2009-05-02 19:11:02	loewis	set	messages: + msg86970
2009-05-02 19:01:14	benjamin.peterson	set	messages: + msg86968
2009-05-02 18:57:28	loewis	set	status: open -> closed resolution: accepted messages: + msg86967
2009-05-02 18:54:29	benjamin.peterson	set	messages: + msg86966
2009-05-02 15:32:13	benjamin.peterson	set	assignee: benjamin.peterson -> loewis messages: + msg86954
2009-05-02 09:48:08	loewis	set	files: + surrogates.diff
2009-05-02 09:47:45	loewis	set	files: - surrogates.diff
2009-05-02 09:44:06	loewis	set	messages: + msg86936
2009-05-01 21:30:48	benjamin.peterson	set	messages: + msg86913
2009-05-01 19:48:31	loewis	set	files: + surrogates.diff messages: + msg86896
2009-05-01 19:47:36	loewis	set	files: - surrogates.diff
2009-05-01 09:21:49	loewis	set	messages: + msg86874
2009-05-01 09:13:53	loewis	set	files: + surrogates.diff priority: high -> release blocker assignee: benjamin.peterson keywords: + patch nosy: + benjamin.peterson messages: + msg86873
2009-04-30 08:27:03	lemburg	set	messages: + msg86839
2009-04-29 20:39:33	loewis	set	messages: + msg86824
2009-04-29 16:54:26	lemburg	set	messages: + msg86817
2009-04-28 17:20:22	pitrou	set	nosy: + lemburg, loewis
2009-04-28 13:13:33	pitrou	set	priority: high versions: + Python 3.1 nosy: + pitrou messages: + msg86736 stage: test needed
2009-04-25 15:05:34	jwilk	set	nosy: + jwilk
2008-09-02 06:44:56	ezio.melotti	set	nosy: + ezio.melotti
2008-08-24 21:57:15	Rhamphoryncus	set	type: behavior components: + Unicode
2008-08-24 21:56:51	Rhamphoryncus	create