Issue 21509: json.load fails to read UTF-8 file with (BOM) Byte Order Marks

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65708

classification

Title:	json.load fails to read UTF-8 file with (BOM) Byte Order Marks
Type:	behavior	Stage:	resolved
Components:		Versions:	Python 2.7

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:		Nosy List:	Kristian.Benoit, cvrebert, santoso.wijaya, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2014-05-14 20:32 by Kristian.Benoit, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
matieres.json	Kristian.Benoit, 2014-05-14 20:32	empty object not parsable by json.load
json.patch	Kristian.Benoit, 2014-05-17 15:06	Skip the BOM if present
json.v2.patch	Kristian.Benoit, 2014-05-17 16:17	This patch seek at the initial position instead of 0.

Messages (7)
msg218573 - (view)	Author: Kristian Benoit (Kristian.Benoit) *	Date: 2014-05-14 20:32
I'm trying to parse a json and keep getting ValueError. File reports the file as being "UTF-8 Unicode (with BOM) text", vim reports it as UTF-8, ... json.load docs says it support UTF-8 out of the box. Here's a link to the file : http://donnees.ville.sherbrooke.qc.ca/storage/f/2014-03-10T17%3A45%3A18.959Z/matieres-residuelles.json
msg218579 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-14 21:49
In Python 2, json.loads() accepts str and unicode types. You can support JSON starting with a UTF-8 BOM using the Python codec "utf-8-sig". Example: >>> codecs.BOM_UTF8 + b'{\n}' '\xef\xbb\xbf{\n}' >>> json.loads(codecs.BOM_UTF8 + b'{\n}') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded >>> json.loads((codecs.BOM_UTF8 + b'{\n}').decode('utf-8-sig')) {}
msg218594 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-05-15 07:08
Currently json.load/loads don't support binary input. See issue17909 and issue19837.
msg218643 - (view)	Author: Chris Rebert (cvrebert) *	Date: 2014-05-16 04:45
The new JSON RFC now at least mentions BOM handling: https://tools.ietf.org/html/rfc7159#section-8.1 : > Implementations MUST NOT add a byte order mark to the beginning of a > JSON text. In the interests of interoperability, implementations > that parse JSON texts MAY ignore the presence of a byte order mark > rather than treating it as an error.
msg218705 - (view)	Author: Kristian Benoit (Kristian.Benoit) *	Date: 2014-05-17 15:06
I added code to skip the bom if present when encoding is either None or "utf-8". The problem I have with Victor's solution is that users don't know these files are not plain UTF-8. Most text editor says it's utf-8 encoded, how can a user figure out there 3 hidden bytes at the start of the file ? Kristian
msg218823 - (view)	Author: Santoso Wijaya (santoso.wijaya) *	Date: 2014-05-19 22:33
I think you should use codecs.BOM_UTF8 rather than using hardcoded string "\xef\xbb\xbf" directly. And why special casing UTF-8 while we're at it? What about other encodings and their BOMs?
msg289168 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-03-07 15:53
This issue is outdated since implementing automatic encoding detecting in issue17909.

History
Date	User	Action	Args
2022-04-11 14:58:03	admin	set	github: 65708
2017-03-07 15:53:47	serhiy.storchaka	set	status: open -> closed resolution: out of date messages: + msg289168 stage: resolved
2014-05-19 22:33:07	santoso.wijaya	set	nosy: + santoso.wijaya messages: + msg218823
2014-05-17 16:17:42	Kristian.Benoit	set	files: + json.v2.patch
2014-05-17 15:07:00	Kristian.Benoit	set	files: + json.patch keywords: + patch messages: + msg218705
2014-05-16 04:45:26	cvrebert	set	nosy: + cvrebert messages: + msg218643
2014-05-15 07:08:40	serhiy.storchaka	set	messages: + msg218594
2014-05-15 00:52:14	pitrou	set	nosy: + serhiy.storchaka
2014-05-14 21:49:58	vstinner	set	nosy: + vstinner messages: + msg218579
2014-05-14 20:32:52	Kristian.Benoit	create