classification
Title: JSONDecoder.raw_decode breaks on leading whitespace
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Bayard Randel, Mariatta, aalien, ezio.melotti, jcea, r.david.murray, rbcollins
Priority: normal Keywords: easy, patch

Created on 2012-07-19 10:09 by aalien, last changed 2016-10-04 10:56 by Mariatta.

Files
File name Uploaded Description Edit
json-raw-decoder-fix-whitespace.diff aalien, 2012-07-19 11:40 Patch for Python 3.3.0 review
json-raw-decoder-fix-whitespace.diff aalien, 2012-07-20 12:41 Patch for Python 3.3.0, take two review
issue15393_3.6.patch Bayard Randel, 2016-09-13 00:28 review
Messages (13)
msg165829 - (view) Author: Antti Laine (aalien) Date: 2012-07-19 10:09
raw_decode on json.JSONDecoder does not handle leading whitespace. According to RFC 4627, section 2, whitespace can precede an object. With json.loads leading whitespace is handled just fine.

d = json.JSONDecoder()
d.raw_decode(' {}')
ValueError: No JSON object could be decoded

Tested with versions 2.6.7, 2.7.3, 3.1.3 and 3.2.3
msg165830 - (view) Author: Antti Laine (aalien) Date: 2012-07-19 10:16
My coworker just submitted a pull request for a possible fix to simplejson on github.

https://github.com/simplejson/simplejson/pull/38
msg165831 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-07-19 10:37
Please, post a patch for 2.7 and 3.2/3.3, with a test. Seems quite easy.

If you hurry, this could go in 3.3.0.
msg165832 - (view) Author: Antti Laine (aalien) Date: 2012-07-19 11:40
Suggestion for a patch for 3.3.0. I wasn't quite sure how the testcases were supposed to be loaded. Sorry if I made a mess ;)
msg165834 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-07-19 11:54
Antti, you are changing the signature of "decode()" and that would be a compatibility problem. Can you rewrite the patch to be more "compatible"?

In your test, please, use a bit more complicated object than "{}" :-)
msg165835 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-07-19 11:56
IMO this is not a bug, according to the current documentation it is working as designed.  raw_decode says it decodes a string that *starts with* a json document.

To my understanding, raw_decode is designed to be used when parsing a stream containing more than just one json document, including possibly non-json text, in which case it makes sense that the calling application would be responsible for (and want to) handle the whitespace around any such document.

I recommend closing this as invalid.  If the consensus is otherwise and a change is made, I think it would be an enhancement, not a bug fix.
msg165836 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-07-19 12:03
Ah, I see, you are thinking that "json document" includes the possibility of leading whitespace.  That is a reasonable interpretation...just make sure that we don't break backward compatibility.
msg165918 - (view) Author: Antti Laine (aalien) Date: 2012-07-20 12:41
> you are changing the signature of "decode()" and that would be a compatibility problem

I was changing the signature of raw_decode(), by adding a(n almost) private keyword. I really don't see how that would affect compatibility.

> you are thinking that "json document" includes the possibility of leading whitespace

Yes. While JSON does not have a real standard to follow, I believe that going by the RFC is the best thing to do, and the RFC very clearly states, that objects can be surrounded with whitespace.

decode() depends on the functionality of raw_decode(), and its own parameter _w. I have no idea why _w is a parameter. It is not used anywhere in CPython, and, beginning with an underscore, it shouldn't be used from anywhere else. Still it offers the users a possibility to break the behaviour of decode(). If it's a performance hack, making the variable being initialized only once, then I think it's a very poor one.

Now the patch leaves decode() as it is, and checks for whitespace in raw_decode(), which leads to checking for whitespace twice when using decode().

I think that a more extensive cleanup would be in order.
msg168936 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-08-23 12:03
I think skipping preceding whitespace in raw_decode() wouldn't hurt, but skipping following whitespace would be wrong.

David: Any examples of how this could break backwards compatibility?
msg168939 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-08-23 12:56
I didn't have anything specific in mind, just making a general comment about the care that needs to be taken in crafting the fix.
msg276143 - (view) Author: Bayard Randel (Bayard Randel) * Date: 2016-09-13 00:28
I've provided an updated patch for 3.6. Additionally I've updated the docstring for decoder.raw_decode to explain that whitespace at the beginning of the document will be ignored. I believe the concerns around the backwards incompatible change to the function signature were addressed by aalien's second patch.
msg276161 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2016-09-13 04:16
I think the patch should either be rejected, or also handle trailing spaces: if we're taking the RFC definition of whitespace not being structural then we should also eat trailing space, which will change the  check for extra data in decode to just checking end == s.len().

Obviously a test for the trailing case needs to be added as well.
msg276291 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-13 15:24
Robert: as noted, skipping trailing whitespace would be breaking raw_decode's contract.  It is designed to parse a json document and return the position of the last character of that document.  If there is then another json document in the stream, parsing the remainder of the stream would skip that whitespace.  If it is some other data, then the whitespace should be preserved for whatever is parsing the remainder of the stream (if anything).

In any case that behavior cannot be changed because it would break backward compatibility.

I would not object to rejecting the change, but it does make raw_decode slightly more awkward to use if it does not skip the leading whitespace.  So I guess I'm +0.5 on accepting it as is.
History
Date User Action Args
2016-10-04 10:56:05Mariattasetnosy: + Mariatta
2016-09-13 15:24:40r.david.murraysetversions: + Python 3.5, Python 3.6, Python 3.7, - Python 2.7, Python 3.2, Python 3.3, Python 3.4
2016-09-13 15:24:08r.david.murraysetmessages: + msg276291
2016-09-13 07:12:07petri.lehtinensetnosy: - petri.lehtinen
2016-09-13 04:16:10rbcollinssetnosy: + rbcollins
messages: + msg276161
2016-09-13 00:28:28Bayard Randelsetfiles: + issue15393_3.6.patch
nosy: + Bayard Randel
messages: + msg276143

2012-08-23 12:56:01r.david.murraysetmessages: + msg168939
2012-08-23 12:03:38petri.lehtinensetnosy: + petri.lehtinen
messages: + msg168936
2012-07-20 12:41:59aaliensetfiles: + json-raw-decoder-fix-whitespace.diff

messages: + msg165918
2012-07-19 12:03:06r.david.murraysetmessages: + msg165836
2012-07-19 11:56:17r.david.murraysetnosy: + r.david.murray
messages: + msg165835
2012-07-19 11:54:19jceasetmessages: + msg165834
2012-07-19 11:40:55aaliensetfiles: + json-raw-decoder-fix-whitespace.diff
keywords: + patch
messages: + msg165832
2012-07-19 10:37:45jceasetnosy: + jcea
messages: + msg165831

keywords: + easy
stage: needs patch
2012-07-19 10:25:21ezio.melottisetnosy: + ezio.melotti

versions: + Python 3.3, Python 3.4, - Python 2.6, Python 3.1
2012-07-19 10:16:14aaliensetmessages: + msg165830
2012-07-19 10:09:55aaliencreate