classification
Title: JSON should accept lone surrogates
Type: behavior Stage: patch review
Components: Extension Modules, Library (Lib), Unicode Versions: Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: json.dumps not parsable by json.loads (on Linux only)
View: 11489
Assigned To: serhiy.storchaka Nosy List: bob.ippolito, ezio.melotti, pitrou, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-05-04 14:38 by serhiy.storchaka, last changed 2013-05-12 19:28 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
json_decode_lone_surrogates.patch serhiy.storchaka, 2013-05-05 11:45 Patch for 3.3 and 3.4 review
json_decode_lone_surrogates-2.7.patch serhiy.storchaka, 2013-05-05 11:45 Patch for 2.7 review
Messages (7)
msg188364 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-04 14:38
Inspired by simplejson issue [1] which is related to standard json module too. JSON parser 3.3+ and wide builds of 3.2- raise an error on invalid strings (i.e. with unpaired surrogate), while narrow builds and some third-party parsers. Wide builds are right, such JSON data is invalid. However it will be good to be optionally more permissive to input data. Otherwise it is not easy process such invalid data.

I propose to add an "error" parameter to JSON decoder and encoder with the same meaning as in string decoding/encoding. "strict" is default and "surrogatepass" corresponds to narrow builds (and non-strict third-party parsers).

[1] https://github.com/simplejson/simplejson/issues/62
msg188374 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-05-04 15:59
I wonder if json should simply be less strict by default. If you pass the raw unescaped character, the json module accepts it:

>>> json.loads('{"a": "\ud8e9"}')
{'a': '\ud8e9'}

It's only if you pass the escaped representation that json rejects it:

>>> json.loads('{"a": "\\ud8e9"}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/json/__init__.py", line 316, in loads
    return _default_decoder.decode(s)
  File "/home/antoine/cpython/default/Lib/json/decoder.py", line 344, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/antoine/cpython/default/Lib/json/decoder.py", line 360, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Unpaired high surrogate: line 1 column 9 (char 8)
msg188375 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-05-04 16:01
See also #11489.
msg188437 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-05 11:45
After investigating the problem deeper, I see that new parameter is not needed. RFC 4627 does not make exceptions for the range 0xD800-0xDFFF, and the decoder must accept lone surrogates, both escaped and unescaped. Non-BMP characters may be represented as escaped surrogate pair, so escaped surrogate pair may be decoded as non-BMP character, while unescaped surrogate pair shouldn't.

Here is a patch, with which JSON decoder accepts encoded lone surrogates. Also fixed a bug when Python implementation decodes "\\ud834\\u0079x" as "\U0001d179".
msg188857 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2013-05-10 18:08
The patch that I wrote for simplejson is here (it differs a bit from serhiy's patch): https://github.com/simplejson/simplejson/commit/35816bfe2d0ddeb5ddcc68239683cbb35b7e3ff2

I discovered another bug along the way in the pure-Python scanstring, int(s, 16) will parse '0xNN' when json expects only strings of the form 'NNNN' to work. I fixed that along with this issue by explicitly checking for x or X.
msg188868 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-10 19:25
I forgot about issue11489. After reclassification this issue is it's duplicate.
msg189056 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-12 19:28
Updated patch submitted in issue11489.
History
Date User Action Args
2013-05-12 19:28:20serhiy.storchakasetstatus: open -> closed

messages: + msg189056
2013-05-10 19:25:35serhiy.storchakasetsuperseder: json.dumps not parsable by json.loads (on Linux only)
resolution: duplicate
messages: + msg188868
2013-05-10 18:08:10bob.ippolitosetmessages: + msg188857
2013-05-05 11:45:58serhiy.storchakasetfiles: + json_decode_lone_surrogates-2.7.patch
2013-05-05 11:45:06serhiy.storchakasetfiles: + json_decode_lone_surrogates.patch

title: Add a string error handler to JSON encoder/decoder -> JSON should accept lone surrogates
keywords: + patch
type: enhancement -> behavior
versions: + Python 2.7, Python 3.3
messages: + msg188437
stage: needs patch -> patch review
2013-05-04 16:01:18ezio.melottisetmessages: + msg188375
2013-05-04 15:59:32pitrousetmessages: + msg188374
2013-05-04 14:38:28serhiy.storchakacreate