This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: json.load fails to read UTF-8 file with (BOM) Byte Order Marks
Type: behavior Stage: resolved
Components: Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Kristian.Benoit, cvrebert, santoso.wijaya, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2014-05-14 20:32 by Kristian.Benoit, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
matieres.json Kristian.Benoit, 2014-05-14 20:32 empty object not parsable by json.load
json.patch Kristian.Benoit, 2014-05-17 15:06 Skip the BOM if present
json.v2.patch Kristian.Benoit, 2014-05-17 16:17 This patch seek at the initial position instead of 0.
Messages (7)
msg218573 - (view) Author: Kristian Benoit (Kristian.Benoit) * Date: 2014-05-14 20:32
I'm trying to parse a json and keep getting ValueError. File reports the file as being "UTF-8 Unicode (with BOM) text", vim reports it as UTF-8, ...

json.load docs says it support UTF-8 out of the box.

Here's a link to the file : http://donnees.ville.sherbrooke.qc.ca/storage/f/2014-03-10T17%3A45%3A18.959Z/matieres-residuelles.json
msg218579 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-05-14 21:49
In Python 2, json.loads() accepts str and unicode types. You can support JSON starting with a UTF-8 BOM using the Python codec "utf-8-sig". Example:

>>> codecs.BOM_UTF8 + b'{\n}'
'\xef\xbb\xbf{\n}'
>>> json.loads(codecs.BOM_UTF8 + b'{\n}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
>>> json.loads((codecs.BOM_UTF8 + b'{\n}').decode('utf-8-sig'))
{}
msg218594 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-05-15 07:08
Currently json.load/loads don't support binary input. See issue17909 and issue19837.
msg218643 - (view) Author: Chris Rebert (cvrebert) * Date: 2014-05-16 04:45
The new JSON RFC now at least mentions BOM handling:
https://tools.ietf.org/html/rfc7159#section-8.1 :
> Implementations MUST NOT add a byte order mark to the beginning of a
> JSON text.  In the interests of interoperability, implementations
> that parse JSON texts MAY ignore the presence of a byte order mark
> rather than treating it as an error.
msg218705 - (view) Author: Kristian Benoit (Kristian.Benoit) * Date: 2014-05-17 15:06
I added code to skip the bom if present when encoding is either None or "utf-8". The problem I have with Victor's solution is that users don't know these files are not plain UTF-8. Most text editor says it's utf-8 encoded, how can a user figure out there 3 hidden bytes at the start of the file ?

Kristian
msg218823 - (view) Author: Santoso Wijaya (santoso.wijaya) * Date: 2014-05-19 22:33
I think you should use codecs.BOM_UTF8 rather than using hardcoded string "\xef\xbb\xbf" directly.

And why special casing UTF-8 while we're at it? What about other encodings and their BOMs?
msg289168 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-07 15:53
This issue is outdated since implementing automatic encoding detecting in issue17909.
History
Date User Action Args
2022-04-11 14:58:03adminsetgithub: 65708
2017-03-07 15:53:47serhiy.storchakasetstatus: open -> closed
resolution: out of date
messages: + msg289168

stage: resolved
2014-05-19 22:33:07santoso.wijayasetnosy: + santoso.wijaya
messages: + msg218823
2014-05-17 16:17:42Kristian.Benoitsetfiles: + json.v2.patch
2014-05-17 15:07:00Kristian.Benoitsetfiles: + json.patch
keywords: + patch
messages: + msg218705
2014-05-16 04:45:26cvrebertsetnosy: + cvrebert
messages: + msg218643
2014-05-15 07:08:40serhiy.storchakasetmessages: + msg218594
2014-05-15 00:52:14pitrousetnosy: + serhiy.storchaka
2014-05-14 21:49:58vstinnersetnosy: + vstinner
messages: + msg218579
2014-05-14 20:32:52Kristian.Benoitcreate