msg197152 - (view) |
Author: Adrián Chaves Fernández (Gallaecio) |
Date: 2013-09-07 12:33 |
Calling json.load() with a file object or json.loads() with a string containing the attached JSON code raises an exception with the message 'No JSON object could be decoded'.
I’ve pasted the JSON code into http://jsonlint.com/ and it reports it as valid JSON.
This JSON code comes from the 0 A.D. game (https://github.com/0ad/0ad/blob/master/binaries/data/mods/public/civs/maur.json), and the game successfully parses it as well (with whatever they use for that). Yet it fails with json.load() and json.loads().
Note also that the rest of the JSON files of the same game folder (https://github.com/0ad/0ad/tree/master/binaries/data/mods/public/civs) do work with json.load() and json.loads().
|
msg197155 - (view) |
Author: Vajrasky Kok (vajrasky) * |
Date: 2013-09-07 13:10 |
>>> a = open('/tmp/input.json')
>>> b = a.read()
>>> b[0]
'\ufeff'
>>> import json
>>> json.loads(b[1:])
loads just fine....
>>> json.loads(b)
chokes.....
Whether python json module should handle '\ufeff' gracefully or not, I am not sure. Let me investigate it.
|
msg197158 - (view) |
Author: Vajrasky Kok (vajrasky) * |
Date: 2013-09-07 13:15 |
The U+FEFF character is related with Byte order mark.
Reference:
http://en.wikipedia.org/wiki/Byte_Order_Mark
|
msg197160 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-09-07 13:35 |
Use the utf-8-sig encoding.
See also issue17909.
|
msg197163 - (view) |
Author: Adrián Chaves Fernández (Gallaecio) |
Date: 2013-09-07 14:42 |
I’ll veave how to address this up to you. Thanks a lot for finding out that the cause was the BOM, I’ve just removed it from the file and now everything works as expected.
|
msg197164 - (view) |
Author: Alyssa Coghlan (ncoghlan) *  |
Date: 2013-09-07 15:01 |
Switching to a docs bug - this won't be fixed in 2.7, but it should probably be documented as a limitation.
|
msg197745 - (view) |
Author: Anoop Thomas Mathew (Anoop.Thomas.Mathew) * |
Date: 2013-09-15 03:59 |
Patch for BOM signature documentation in json.loads
|
msg200360 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-10-19 03:47 |
I'm not sure this should be documented in json.load/loads, and I'm not sure people will look there once they get this exception.
The error is raised because the wrong codec is used (either by open() before passing the file object to json.load or by json.loads), so it's a user error rather than a problem with the json module. The error turns out to be particularly misleading because the decoding is successful even though it produces a wrong result, and the problem becomes apparent only once it reaches json.
ISTM that the documentation is already clear enough that json doesn't auto-detect encodings and uses UTF-8 by default, and that different encodings should be specified explicitly.
I think that a better and backward-compatible solution would be to detect the UTF-8 BOM and provide a better error message hinting at utf-8-sig.
|
msg200361 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-10-19 04:08 |
Here is a proof of concept that raises this error:
>>> import json; json.load(open('input.json'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/wolf/dev/py/2.7/Lib/json/__init__.py", line 290, in load
**kw)
File "/home/wolf/dev/py/2.7/Lib/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/home/wolf/dev/py/2.7/Lib/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/wolf/dev/py/2.7/Lib/json/decoder.py", line 381, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)
If the idea is OK I will add tests and implement it for the Python scanner too (and possibly tweak the error message if you have better suggestions).
|
msg200362 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-10-19 04:09 |
Forgot to add that the patch is for 2.7, and it also needs to be implemented in the unicode scanner.
|
msg200368 - (view) |
Author: Alyssa Coghlan (ncoghlan) *  |
Date: 2013-10-19 04:37 |
I like the new error message as a low-risk immediate improvement that nudges people in the direction of utf8-sig. It also leaves the door open to silently ignoring the BoM in the future without immediately committing to that approach.
|
msg200536 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-10-20 03:23 |
Here is an updated patch with tests.
|
msg200538 - (view) |
Author: Alyssa Coghlan (ncoghlan) *  |
Date: 2013-10-20 03:32 |
Updated patch looks good to me.
|
msg200540 - (view) |
Author: Alyssa Coghlan (ncoghlan) *  |
Date: 2013-10-20 03:52 |
As does the Py3k version :)
|
msg200542 - (view) |
Author: Alyssa Coghlan (ncoghlan) *  |
Date: 2013-10-20 04:12 |
Discussing this with Ezio on IRC, we decided that it probably makes more sense to do this check outside the scanner as preliminary validation of the input passed in via the public API. That will minimise the overhead and also avoids any potential side effects if "idx==0" is ever true in cases we're not currently testing.
The tests from the current patches should be OK, though.
Ezio also found that, for Py3k, adding an explicit check for non-str input and throwing an appropriate error would also be an improvement over the status quo:
>>> import json
>>> json.loads(b'')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ncoghlan/devel/py3k/Lib/json/__init__.py", line 316, in loads
return _default_decoder.decode(s)
File "/home/ncoghlan/devel/py3k/Lib/json/decoder.py", line 344, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: can't use a string pattern on a bytes-like object
|
msg200546 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-10-20 05:24 |
I opened a new issue about improving the error message: #19307.
After further discussion on IRC, we think that both #19307 and this issue should only be applied on 3.4 (the attached patch produces an even more misleading error that would require backporting #19307).
|
msg200560 - (view) |
Author: Alyssa Coghlan (ncoghlan) *  |
Date: 2013-10-20 10:25 |
The patch needs to be rebased on top of the issue 19307 patch, but I like this approach.
I say go ahead and commit it whenever you're ready :)
|
msg200622 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-10-20 19:41 |
LGTM.
|
msg200650 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2013-10-20 23:11 |
New changeset ac016cba7e64 by Ezio Melotti in branch 'default':
#18958: Improve error message for json.load(s) while passing a string that starts with a UTF-8 BOM.
http://hg.python.org/cpython/rev/ac016cba7e64
|
msg200651 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-10-20 23:11 |
Fixed, thanks for the feedback!
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:50 | admin | set | github: 63158 |
2013-10-20 23:11:58 | ezio.melotti | set | status: open -> closed messages:
+ msg200651
assignee: docs@python -> ezio.melotti resolution: fixed stage: patch review -> resolved |
2013-10-20 23:11:30 | python-dev | set | nosy:
+ python-dev messages:
+ msg200650
|
2013-10-20 19:41:21 | serhiy.storchaka | set | messages:
+ msg200622 |
2013-10-20 10:25:11 | ncoghlan | set | messages:
+ msg200560 |
2013-10-20 05:24:07 | ezio.melotti | set | files:
+ issue18958-3.diff
dependencies:
+ Improve TypeError message in json.loads() messages:
+ msg200546 versions:
+ Python 3.4, - Python 2.7 |
2013-10-20 04:12:52 | ncoghlan | set | messages:
+ msg200542 |
2013-10-20 03:52:01 | ncoghlan | set | messages:
+ msg200540 |
2013-10-20 03:44:55 | ezio.melotti | set | files:
+ issue18958-2-py3k.diff |
2013-10-20 03:32:26 | ncoghlan | set | messages:
+ msg200538 |
2013-10-20 03:23:28 | ezio.melotti | set | files:
+ issue18958-2.diff
messages:
+ msg200536 stage: needs patch -> patch review |
2013-10-19 04:37:21 | ncoghlan | set | messages:
+ msg200368 |
2013-10-19 04:09:56 | ezio.melotti | set | messages:
+ msg200362 |
2013-10-19 04:08:43 | ezio.melotti | set | files:
+ issue18958.diff |
2013-10-19 04:08:24 | ezio.melotti | set | messages:
+ msg200361 |
2013-10-19 03:47:01 | ezio.melotti | set | messages:
+ msg200360 |
2013-09-15 03:59:06 | Anoop.Thomas.Mathew | set | files:
+ json_BOM_signature_documentation.patch
nosy:
+ Anoop.Thomas.Mathew messages:
+ msg197745
keywords:
+ patch |
2013-09-13 20:15:15 | ezio.melotti | set | keywords:
+ easy nosy:
+ ezio.melotti
|
2013-09-07 15:01:46 | ncoghlan | set | nosy:
+ docs@python, ncoghlan messages:
+ msg197164
assignee: docs@python components:
+ Documentation, - Extension Modules stage: needs patch |
2013-09-07 14:42:25 | Gallaecio | set | messages:
+ msg197163 |
2013-09-07 13:35:23 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg197160
|
2013-09-07 13:15:24 | vajrasky | set | messages:
+ msg197158 |
2013-09-07 13:10:19 | vajrasky | set | nosy:
+ vajrasky messages:
+ msg197155
|
2013-09-07 12:33:15 | Gallaecio | create | |