Message 144646 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	Brian.Merrell, belopolsky, ezio.melotti, merrellb, rhettinger, vstinner
Date	2011-09-29.22:22:21
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1317334943.53.0.994493672259.issue11489@psf.upfronthosting.co.za>
In-reply-to

Content
RFC 4627 doesn't say much about lone surrogates: A string is a sequence of zero or more Unicode characters [UNICODE]. [...] All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C". [...] To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". Raymond> JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load. Even if the input strings are not encodable in UTF-8 because they contain lone surrogates, they can still be converted to an \uXXXX escape, and the resulting JSON document will be valid UTF-8. AFAIK json always uses \uXXXX, so it doesn't produce invalid UTF-8 documents. While decoding, both json.loads('"\xed\xa0\x80"') and json.loads('"\ud800"') result in u'\ud800', but the first is not a valid UTF-8 document because it contains an invalid UTF-8 byte sequence that represent a lone surrogate, whereas the second one contains only ASCII bytes and it's therefore valid. Python 2.7 should probably reject '"\xed\xa0\x80"', but since its UTF-8 codec is somewhat permissive already, I'm not sure it makes much sense changing the behavior now. Python 3 doesn't have this problem because it works only with unicode strings, so you can't pass invalid UTF-8 byte sequences. OTOH the Unicode standard says that lone surrogates shouldn't be passed around, so we might decide to replace them with the replacement char U+FFFD, raise an error, or even provide a way to decide what should be done with them (something like the errors argument of codecs).

RFC 4627 doesn't say much about lone surrogates:
A string is a sequence of zero or more Unicode characters [UNICODE].
[...]

All Unicode characters may be placed within the
quotation marks except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
[...]

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

Raymond> JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load.

Even if the input strings are not encodable in UTF-8 because they contain lone surrogates, they can still be converted to an \uXXXX escape, and the resulting JSON document will be valid UTF-8.
AFAIK json always uses \uXXXX, so it doesn't produce invalid UTF-8 documents.

While decoding, both json.loads('"\xed\xa0\x80"') and json.loads('"\ud800"') result in u'\ud800', but the first is not a valid UTF-8 document because it contains an invalid UTF-8 byte sequence that represent a lone surrogate, whereas the second one contains only ASCII bytes and it's therefore valid.
Python 2.7 should probably reject '"\xed\xa0\x80"', but since its UTF-8 codec is somewhat permissive already, I'm not sure it makes much sense changing the behavior now. Python 3 doesn't have this problem because it works only with unicode strings, so you can't pass invalid UTF-8 byte sequences.

OTOH the Unicode standard says that lone surrogates shouldn't be passed around, so we might decide to replace them with the replacement char U+FFFD, raise an error, or even provide a way to decide what should be done with them (something like the errors argument of codecs).

History
Date	User	Action	Args
2011-09-29 22:22:23	ezio.melotti	set	recipients: + ezio.melotti, rhettinger, belopolsky, vstinner, merrellb, Brian.Merrell
2011-09-29 22:22:23	ezio.melotti	set	messageid: <1317334943.53.0.994493672259.issue11489@psf.upfronthosting.co.za>
2011-09-29 22:22:22	ezio.melotti	link	issue11489 messages
2011-09-29 22:22:21	ezio.melotti	create