New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
json.load fails to read UTF-8 file with (BOM) Byte Order Marks #65708
Comments
I'm trying to parse a json and keep getting ValueError. File reports the file as being "UTF-8 Unicode (with BOM) text", vim reports it as UTF-8, ... json.load docs says it support UTF-8 out of the box. Here's a link to the file : http://donnees.ville.sherbrooke.qc.ca/storage/f/2014-03-10T17%3A45%3A18.959Z/matieres-residuelles.json |
In Python 2, json.loads() accepts str and unicode types. You can support JSON starting with a UTF-8 BOM using the Python codec "utf-8-sig". Example: >>> codecs.BOM_UTF8 + b'{\n}'
'\xef\xbb\xbf{\n}'
>>> json.loads(codecs.BOM_UTF8 + b'{\n}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
>>> json.loads((codecs.BOM_UTF8 + b'{\n}').decode('utf-8-sig'))
{} |
The new JSON RFC now at least mentions BOM handling:
|
I added code to skip the bom if present when encoding is either None or "utf-8". The problem I have with Victor's solution is that users don't know these files are not plain UTF-8. Most text editor says it's utf-8 encoded, how can a user figure out there 3 hidden bytes at the start of the file ? Kristian |
I think you should use codecs.BOM_UTF8 rather than using hardcoded string "\xef\xbb\xbf" directly. And why special casing UTF-8 while we're at it? What about other encodings and their BOMs? |
This issue is outdated since implementing automatic encoding detecting in bpo-17909. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: