msg176722 - (view) |
Author: Dustin Boswell (Dustin.Boswell) |
Date: 2012-11-30 21:40 |
Here's a command-line that parses a json string containing a large array of short strings:
python -c "import simplejson as json; json.loads('[' + '''\"asdfadf\", ''' * 100000000 + '\"asdfasf\"]') "
That works, but if you increase the size a little bit (so the string is > 2^31)
python -c "import simplejson as json; json.loads('[' + '''\"asdfadf\", ''' * 300000000 + '\"asdfasf\"]') "
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/pymodules/python2.6/simplejson/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib/pymodules/python2.6/simplejson/decoder.py", line 338, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column -994967285 - line 1 column 3300000011 (char -994967285 - 3300000011)
Here's my version:
$ python
Python 2.6.5 (r265:79063, Oct 1 2012, 22:04:36)
[GCC 4.4.3] on linux2
>>> import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)
('7fffffffffffffff', True)
Also note that the test above requires at least 20GB of memory (that's not a bug, just a heads-up).
|
msg176724 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-11-30 22:40 |
I saw nothing that could lead to a bug, except a few of the obsolete functions
for work with size_t (for compatibility with versions <2.6). Here is a patch
that gets rid of this outdated code. I don't have enough memory to check if
this will help, but I think that at least for 3.4 it is worth to apply as a
code cleanup.
|
msg176726 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2012-11-30 22:57 |
Even if the json module can't handle these values, the error message can be improved.
Serhiy, does you patch have any effect on the error message?
|
msg176727 - (view) |
Author: Dustin Boswell (Dustin.Boswell) |
Date: 2012-11-30 22:59 |
Here's a slightly smaller/cleaner test case that only requires 12GB of ram to run:
python -c "import simplejson as json; json.loads('[' + '''\".......\", ''' * 200000000 + '0]') "
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/pymodules/python2.6/simplejson/__init__.py", line 307, in loads
|
msg176728 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-11-30 23:16 |
Issue16009 has an effect on error messages.
But this error message should not be. scan_once() returns a 32-bit overflowed index (994967285 == 3300000011 - 2**32). However all indices in Modules/_json.c are of type Py_ssize_t.
|
msg176729 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-11-30 23:30 |
Ahem... I just noticed:
import simplejson as json
Dustin, this is not Python issue, this is simplejson issue. Can you reproduce the bug with standard json module? Try something like '[%*s]' % (2**32, ''), this should require less memory (especially on 3.3+).
|
msg176733 - (view) |
Author: Dustin Boswell (Dustin.Boswell) |
Date: 2012-12-01 00:38 |
I thought simplejson was a standard module for 2.6, and got renamed to json (replacing the older json module) in later versions.
For instance, I get the same problem with 2.7 (no simplejson):
python2.7 -c "import json; json.loads('[' + '''\".......\", ''' * 200000000 + '0]') "
^AcTraceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python2.7/json/__init__.py", line 326, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column -2094967293 - line 1 column 2200000003 (char -2094967293 - 2200000003)
And if I use the "json" module in 2.6 (which is 10x slower, takes over 30 minutes to run) it also fails, but with a difference trace:
python2.6 -c "import json; json.loads('[' + '''\".......\", ''' * 200000000 + '0]') "
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.6/json/decoder.py", line 319, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode
obj, end = self._scanner.iterscan(s, **kw).next()
File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib/python2.6/json/decoder.py", line 217, in JSONArray
value, end = iterscan(s, idx=end, context=context).next()
File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString
return scanstring(match.string, match.end(), encoding, strict)
ValueError: end is out of bounds
|
msg176751 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-01 19:23 |
Dustin, what version of 2.7 do you use? What "python2.7 -V" says?
Please someone run on self-built 64-bit Python 2.7 something like (this should require a little greater than 2GB of memory):
python -c "import json; json.loads('[%2200000000s' % ']')"
I suspect that this is a build bug.
Does it reproduced on Python 3?
|
msg176755 - (view) |
Author: Dustin Boswell (Dustin.Boswell) |
Date: 2012-12-01 20:44 |
Python 2.7.3 (default, Aug 3 2012, 20:01:21)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)
('7fffffffffffffff', True)
|
msg176757 - (view) |
Author: Dustin Boswell (Dustin.Boswell) |
Date: 2012-12-01 20:54 |
Yes, bug exists on 3.1 (gcc build), as well as darwin build of 2.7:
python3.1 -c "import json; json.loads('[%2200000000s' % ']')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.1/json/__init__.py", line 293, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.1/json/decoder.py", line 328, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column -2094967295 - line 1 column 2200000001 (char -2094967295 - 2200000001)
python3.1
Python 3.1.2 (r312:79147, Oct 23 2012, 20:07:42)
[GCC 4.4.3] on linux2
>>> import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)
7fffffffffffffff True
python2.7 -c "import json; json.loads('[%2200000000s' % ']')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 326, in loads
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
ValueError: Extra data: line 1 column -2094967295 - line 1 column 2200000001 (char -2094967295 - 2200000001)
python2.7
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
>>> import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)
('7fffffffffffffff', True)
|
msg176758 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-01 21:20 |
As Antoine Pitrou reported on IRC, this bug exists on 3.x. Sorry, but this bug can't be fixed on 2.6 and 3.1.
|
msg176760 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-12-01 21:29 |
Actually, this isn't a problem in _json.c but in the re library: JSONDecoder.raw_decode() works fine, but JSONDecoder.decode() raises:
$ ./python -c "import json.decoder; print(json.decoder.JSONDecoder().raw_decode('[%2200000000s' % ']'))"
([], 2200000001)
$ ./python -c "import json.decoder; print(json.decoder.JSONDecoder().decode('[%2200000000s' % ']'))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/antoine/cpython/default/Lib/json/decoder.py", line 347, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column -2094967295 - line 1 column 2200000001 (char -2094967295 - 2200000001)
(decode() is basically raw_decode() followed by a call to WHITESPACE.match() from the end of the JSON object:
http://hg.python.org/cpython/file/2c04d2102534/Lib/json/decoder.py#l339
)
|
msg176761 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-01 21:36 |
Uh, this is issue10182.
|
msg176762 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-12-01 21:38 |
> Uh, this is issue10182.
Indeed, the patch there seems to fix it.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:38 | admin | set | github: 60790 |
2012-12-01 21:38:29 | pitrou | set | status: open -> closed superseder: match_start truncates large values messages:
+ msg176762
resolution: duplicate stage: needs patch -> resolved |
2012-12-01 21:36:24 | serhiy.storchaka | set | messages:
+ msg176761 |
2012-12-01 21:29:28 | pitrou | set | nosy:
+ pitrou
messages:
+ msg176760 stage: needs patch |
2012-12-01 21:20:51 | serhiy.storchaka | set | messages:
+ msg176758 versions:
+ Python 3.2, Python 3.3, Python 3.4, - Python 3.1 |
2012-12-01 20:54:39 | Dustin.Boswell | set | messages:
+ msg176757 versions:
+ Python 3.1 |
2012-12-01 20:44:45 | Dustin.Boswell | set | messages:
+ msg176755 |
2012-12-01 19:23:03 | serhiy.storchaka | set | messages:
+ msg176751 versions:
- Python 2.6 |
2012-12-01 18:37:16 | serhiy.storchaka | set | files:
- json_size_t_cleanup-2.7.patch |
2012-12-01 18:37:09 | serhiy.storchaka | set | files:
- json_size_t_cleanup.patch |
2012-12-01 00:38:43 | Dustin.Boswell | set | messages:
+ msg176733 |
2012-11-30 23:30:49 | serhiy.storchaka | set | messages:
+ msg176729 |
2012-11-30 23:16:13 | serhiy.storchaka | set | messages:
+ msg176728 |
2012-11-30 22:59:50 | Dustin.Boswell | set | messages:
+ msg176727 |
2012-11-30 22:57:00 | ezio.melotti | set | nosy:
+ ezio.melotti type: crash -> behavior messages:
+ msg176726
|
2012-11-30 22:40:24 | serhiy.storchaka | set | files:
+ json_size_t_cleanup.patch, json_size_t_cleanup-2.7.patch keywords:
+ patch messages:
+ msg176724
|
2012-11-30 21:53:41 | pitrou | set | nosy:
+ vstinner, serhiy.storchaka
|
2012-11-30 21:40:45 | Dustin.Boswell | create | |