This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: json library can't parse large (> 2^31) strings
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: match_start truncates large values
View: 10182
Assigned To: Nosy List: Dustin.Boswell, ezio.melotti, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-11-30 21:40 by Dustin.Boswell, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (14)
msg176722 - (view) Author: Dustin Boswell (Dustin.Boswell) Date: 2012-11-30 21:40
Here's a command-line that parses a json string containing a large array of short strings:

python -c "import simplejson as json; json.loads('[' + '''\"asdfadf\", ''' * 100000000 + '\"asdfasf\"]') "

That works, but if you increase the size a little bit (so the string is > 2^31)

python -c "import simplejson as json; json.loads('[' + '''\"asdfadf\", ''' * 300000000 + '\"asdfasf\"]') "

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/pymodules/python2.6/simplejson/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/pymodules/python2.6/simplejson/decoder.py", line 338, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column -994967285 - line 1 column 3300000011 (char -994967285 - 3300000011)


Here's my version:

$ python
Python 2.6.5 (r265:79063, Oct  1 2012, 22:04:36) 
[GCC 4.4.3] on linux2
>>> import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)
('7fffffffffffffff', True)


Also note that the test above requires at least 20GB of memory (that's not a bug, just a heads-up).
msg176724 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-30 22:40
I saw nothing that could lead to a bug, except a few of the obsolete functions 
for work with size_t (for compatibility with versions <2.6). Here is a patch 
that gets rid of this outdated code. I don't have enough memory to check if 
this will help, but I think that at least for 3.4 it is worth to apply as a 
code cleanup.
msg176726 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-11-30 22:57
Even if the json module can't handle these values, the error message can be improved.

Serhiy, does you patch have any effect on the error message?
msg176727 - (view) Author: Dustin Boswell (Dustin.Boswell) Date: 2012-11-30 22:59
Here's a slightly smaller/cleaner test case that only requires 12GB of ram to run:

python -c "import simplejson as json; json.loads('[' + '''\".......\", ''' * 200000000 + '0]') "

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/pymodules/python2.6/simplejson/__init__.py", line 307, in loads
msg176728 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-30 23:16
Issue16009 has an effect on error messages.

But this error message should not be. scan_once() returns a 32-bit overflowed index (994967285 == 3300000011 - 2**32). However all indices in Modules/_json.c are of type Py_ssize_t.
msg176729 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-30 23:30
Ahem... I just noticed:

    import simplejson as json

Dustin, this is not Python issue, this is simplejson issue. Can you reproduce the bug with standard json module? Try something like '[%*s]' % (2**32, ''), this should require less memory (especially on 3.3+).
msg176733 - (view) Author: Dustin Boswell (Dustin.Boswell) Date: 2012-12-01 00:38
I thought simplejson was a standard module for 2.6, and got renamed to json (replacing the older json module) in later versions.

For instance, I get the same problem with 2.7 (no simplejson):

python2.7 -c "import json; json.loads('[' + '''\".......\", ''' * 200000000 + '0]') "
^AcTraceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/json/decoder.py", line 369, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column -2094967293 - line 1 column 2200000003 (char -2094967293 - 2200000003)


And if I use the "json" module in 2.6 (which is 10x slower, takes over 30 minutes to run) it also fails, but with a difference trace:

python2.6 -c "import json; json.loads('[' + '''\".......\", ''' * 200000000 + '0]') "
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.6/json/decoder.py", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode
    obj, end = self._scanner.iterscan(s, **kw).next()
  File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = action(m, context)
  File "/usr/lib/python2.6/json/decoder.py", line 217, in JSONArray
    value, end = iterscan(s, idx=end, context=context).next()
  File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = action(m, context)
  File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString
    return scanstring(match.string, match.end(), encoding, strict)
ValueError: end is out of bounds
msg176751 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-01 19:23
Dustin, what version of 2.7 do you use? What "python2.7 -V" says?

Please someone run on self-built 64-bit Python 2.7 something like (this should require a little greater than 2GB of memory):

  python -c "import json; json.loads('[%2200000000s' % ']')"

I suspect that this is a build bug.

Does it reproduced on Python 3?
msg176755 - (view) Author: Dustin Boswell (Dustin.Boswell) Date: 2012-12-01 20:44
Python 2.7.3 (default, Aug  3 2012, 20:01:21) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)
('7fffffffffffffff', True)
msg176757 - (view) Author: Dustin Boswell (Dustin.Boswell) Date: 2012-12-01 20:54
Yes, bug exists on 3.1 (gcc build), as well as darwin build of 2.7:

python3.1 -c "import json; json.loads('[%2200000000s' % ']')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.1/json/__init__.py", line 293, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.1/json/decoder.py", line 328, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column -2094967295 - line 1 column 2200000001 (char -2094967295 - 2200000001)

python3.1
Python 3.1.2 (r312:79147, Oct 23 2012, 20:07:42) 
[GCC 4.4.3] on linux2
>>> import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)
7fffffffffffffff True




python2.7 -c "import json; json.loads('[%2200000000s' % ']')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 326, in loads
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
ValueError: Extra data: line 1 column -2094967295 - line 1 column 2200000001 (char -2094967295 - 2200000001)

python2.7
Python 2.7.2 (default, Jun 20 2012, 16:23:33) 
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
>>> import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)
('7fffffffffffffff', True)
msg176758 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-01 21:20
As Antoine Pitrou reported on IRC, this bug exists on 3.x. Sorry, but this bug can't be fixed on 2.6 and 3.1.
msg176760 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-12-01 21:29
Actually, this isn't a problem in _json.c but in the re library: JSONDecoder.raw_decode() works fine, but JSONDecoder.decode() raises:

$ ./python -c "import json.decoder; print(json.decoder.JSONDecoder().raw_decode('[%2200000000s' % ']'))"
([], 2200000001)

$ ./python -c "import json.decoder; print(json.decoder.JSONDecoder().decode('[%2200000000s' % ']'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/antoine/cpython/default/Lib/json/decoder.py", line 347, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column -2094967295 - line 1 column 2200000001 (char -2094967295 - 2200000001)


(decode() is basically raw_decode() followed by a call to WHITESPACE.match() from the end of the JSON object:
http://hg.python.org/cpython/file/2c04d2102534/Lib/json/decoder.py#l339
)
msg176761 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-01 21:36
Uh, this is issue10182.
msg176762 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-12-01 21:38
> Uh, this is issue10182.

Indeed, the patch there seems to fix it.
History
Date User Action Args
2022-04-11 14:57:38adminsetgithub: 60790
2012-12-01 21:38:29pitrousetstatus: open -> closed
superseder: match_start truncates large values
messages: + msg176762

resolution: duplicate
stage: needs patch -> resolved
2012-12-01 21:36:24serhiy.storchakasetmessages: + msg176761
2012-12-01 21:29:28pitrousetnosy: + pitrou

messages: + msg176760
stage: needs patch
2012-12-01 21:20:51serhiy.storchakasetmessages: + msg176758
versions: + Python 3.2, Python 3.3, Python 3.4, - Python 3.1
2012-12-01 20:54:39Dustin.Boswellsetmessages: + msg176757
versions: + Python 3.1
2012-12-01 20:44:45Dustin.Boswellsetmessages: + msg176755
2012-12-01 19:23:03serhiy.storchakasetmessages: + msg176751
versions: - Python 2.6
2012-12-01 18:37:16serhiy.storchakasetfiles: - json_size_t_cleanup-2.7.patch
2012-12-01 18:37:09serhiy.storchakasetfiles: - json_size_t_cleanup.patch
2012-12-01 00:38:43Dustin.Boswellsetmessages: + msg176733
2012-11-30 23:30:49serhiy.storchakasetmessages: + msg176729
2012-11-30 23:16:13serhiy.storchakasetmessages: + msg176728
2012-11-30 22:59:50Dustin.Boswellsetmessages: + msg176727
2012-11-30 22:57:00ezio.melottisetnosy: + ezio.melotti
type: crash -> behavior
messages: + msg176726
2012-11-30 22:40:24serhiy.storchakasetfiles: + json_size_t_cleanup.patch, json_size_t_cleanup-2.7.patch
keywords: + patch
messages: + msg176724
2012-11-30 21:53:41pitrousetnosy: + vstinner, serhiy.storchaka
2012-11-30 21:40:45Dustin.Boswellcreate