json.load fails to read UTF-8 file with (BOM) Byte Order Marks #65708

KristianBenoit · 2014-05-14T20:32:53Z

BPO	21509
Nosy	@vstinner, @serhiy-storchaka
Files	matieres.json: empty object not parsable by json.load json.patch: Skip the BOM if present json.v2.patch: This patch seek at the initial position instead of 0.

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-03-07.15:53:47.273>
created_at = <Date 2014-05-14.20:32:52.905>
labels = ['type-bug']
title = 'json.load fails to read UTF-8 file with (BOM) Byte Order Marks'
updated_at = <Date 2017-03-07.15:53:47.272>
user = 'https://bugs.python.org/KristianBenoit'

bugs.python.org fields:

activity = <Date 2017-03-07.15:53:47.272>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2017-03-07.15:53:47.273>
closer = 'serhiy.storchaka'
components = []
creation = <Date 2014-05-14.20:32:52.905>
creator = 'Kristian.Benoit'
dependencies = []
files = ['35254', '35269', '35270']
hgrepos = []
issue_num = 21509
keywords = ['patch']
message_count = 7.0
messages = ['218573', '218579', '218594', '218643', '218705', '218823', '289168']
nosy_count = 5.0
nosy_names = ['vstinner', 'cvrebert', 'santoso.wijaya', 'serhiy.storchaka', 'Kristian.Benoit']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue21509'
versions = ['Python 2.7']

KristianBenoit · 2014-05-14T20:32:53Z

I'm trying to parse a json and keep getting ValueError. File reports the file as being "UTF-8 Unicode (with BOM) text", vim reports it as UTF-8, ...

json.load docs says it support UTF-8 out of the box.

Here's a link to the file : http://donnees.ville.sherbrooke.qc.ca/storage/f/2014-03-10T17%3A45%3A18.959Z/matieres-residuelles.json

vstinner · 2014-05-14T21:49:59Z

In Python 2, json.loads() accepts str and unicode types. You can support JSON starting with a UTF-8 BOM using the Python codec "utf-8-sig". Example:

>>> codecs.BOM_UTF8 + b'{\n}'
'\xef\xbb\xbf{\n}'
>>> json.loads(codecs.BOM_UTF8 + b'{\n}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
>>> json.loads((codecs.BOM_UTF8 + b'{\n}').decode('utf-8-sig'))
{}

serhiy-storchaka · 2014-05-15T07:08:40Z

Currently json.load/loads don't support binary input. See bpo-17909 and bpo-19837.

cvrebert · 2014-05-16T04:45:27Z

The new JSON RFC now at least mentions BOM handling:
https://tools.ietf.org/html/rfc7159#section-8.1 :

Implementations MUST NOT add a byte order mark to the beginning of a
JSON text. In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.

KristianBenoit · 2014-05-17T15:06:59Z

I added code to skip the bom if present when encoding is either None or "utf-8". The problem I have with Victor's solution is that users don't know these files are not plain UTF-8. Most text editor says it's utf-8 encoded, how can a user figure out there 3 hidden bytes at the start of the file ?

Kristian

santosowijaya · 2014-05-19T22:33:07Z

I think you should use codecs.BOM_UTF8 rather than using hardcoded string "\xef\xbb\xbf" directly.

And why special casing UTF-8 while we're at it? What about other encodings and their BOMs?

serhiy-storchaka · 2017-03-07T15:53:47Z

This issue is outdated since implementing automatic encoding detecting in bpo-17909.

KristianBenoit mannequin added the type-bug An unexpected behavior, bug, or error label May 14, 2014

serhiy-storchaka closed this as completed Mar 7, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json.load fails to read UTF-8 file with (BOM) Byte Order Marks #65708

json.load fails to read UTF-8 file with (BOM) Byte Order Marks #65708

KristianBenoit mannequin commented May 14, 2014

KristianBenoit mannequin commented May 14, 2014

vstinner commented May 14, 2014

serhiy-storchaka commented May 15, 2014

cvrebert mannequin commented May 16, 2014

KristianBenoit mannequin commented May 17, 2014

santosowijaya mannequin commented May 19, 2014

serhiy-storchaka commented Mar 7, 2017

json.load fails to read UTF-8 file with (BOM) Byte Order Marks #65708

json.load fails to read UTF-8 file with (BOM) Byte Order Marks #65708

Comments

KristianBenoit mannequin commented May 14, 2014

KristianBenoit mannequin commented May 14, 2014

vstinner commented May 14, 2014

serhiy-storchaka commented May 15, 2014

cvrebert mannequin commented May 16, 2014

KristianBenoit mannequin commented May 17, 2014

santosowijaya mannequin commented May 19, 2014

serhiy-storchaka commented Mar 7, 2017