Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json.load fails to read UTF-8 file with (BOM) Byte Order Marks #65708

Closed
KristianBenoit mannequin opened this issue May 14, 2014 · 7 comments
Closed

json.load fails to read UTF-8 file with (BOM) Byte Order Marks #65708

KristianBenoit mannequin opened this issue May 14, 2014 · 7 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@KristianBenoit
Copy link
Mannequin

KristianBenoit mannequin commented May 14, 2014

BPO 21509
Nosy @vstinner, @serhiy-storchaka
Files
  • matieres.json: empty object not parsable by json.load
  • json.patch: Skip the BOM if present
  • json.v2.patch: This patch seek at the initial position instead of 0.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-03-07.15:53:47.273>
    created_at = <Date 2014-05-14.20:32:52.905>
    labels = ['type-bug']
    title = 'json.load fails to read UTF-8 file with (BOM) Byte Order Marks'
    updated_at = <Date 2017-03-07.15:53:47.272>
    user = 'https://bugs.python.org/KristianBenoit'

    bugs.python.org fields:

    activity = <Date 2017-03-07.15:53:47.272>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-03-07.15:53:47.273>
    closer = 'serhiy.storchaka'
    components = []
    creation = <Date 2014-05-14.20:32:52.905>
    creator = 'Kristian.Benoit'
    dependencies = []
    files = ['35254', '35269', '35270']
    hgrepos = []
    issue_num = 21509
    keywords = ['patch']
    message_count = 7.0
    messages = ['218573', '218579', '218594', '218643', '218705', '218823', '289168']
    nosy_count = 5.0
    nosy_names = ['vstinner', 'cvrebert', 'santoso.wijaya', 'serhiy.storchaka', 'Kristian.Benoit']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue21509'
    versions = ['Python 2.7']

    @KristianBenoit
    Copy link
    Mannequin Author

    KristianBenoit mannequin commented May 14, 2014

    I'm trying to parse a json and keep getting ValueError. File reports the file as being "UTF-8 Unicode (with BOM) text", vim reports it as UTF-8, ...

    json.load docs says it support UTF-8 out of the box.

    Here's a link to the file : http://donnees.ville.sherbrooke.qc.ca/storage/f/2014-03-10T17%3A45%3A18.959Z/matieres-residuelles.json

    @KristianBenoit KristianBenoit mannequin added the type-bug An unexpected behavior, bug, or error label May 14, 2014
    @vstinner
    Copy link
    Member

    In Python 2, json.loads() accepts str and unicode types. You can support JSON starting with a UTF-8 BOM using the Python codec "utf-8-sig". Example:

    >>> codecs.BOM_UTF8 + b'{\n}'
    '\xef\xbb\xbf{\n}'
    >>> json.loads(codecs.BOM_UTF8 + b'{\n}')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
        return _default_decoder.decode(s)
      File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
      File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode
        raise ValueError("No JSON object could be decoded")
    ValueError: No JSON object could be decoded
    >>> json.loads((codecs.BOM_UTF8 + b'{\n}').decode('utf-8-sig'))
    {}

    @serhiy-storchaka
    Copy link
    Member

    Currently json.load/loads don't support binary input. See bpo-17909 and bpo-19837.

    @cvrebert
    Copy link
    Mannequin

    cvrebert mannequin commented May 16, 2014

    The new JSON RFC now at least mentions BOM handling:
    https://tools.ietf.org/html/rfc7159#section-8.1 :

    Implementations MUST NOT add a byte order mark to the beginning of a
    JSON text. In the interests of interoperability, implementations
    that parse JSON texts MAY ignore the presence of a byte order mark
    rather than treating it as an error.

    @KristianBenoit
    Copy link
    Mannequin Author

    KristianBenoit mannequin commented May 17, 2014

    I added code to skip the bom if present when encoding is either None or "utf-8". The problem I have with Victor's solution is that users don't know these files are not plain UTF-8. Most text editor says it's utf-8 encoded, how can a user figure out there 3 hidden bytes at the start of the file ?

    Kristian

    @santosowijaya
    Copy link
    Mannequin

    santosowijaya mannequin commented May 19, 2014

    I think you should use codecs.BOM_UTF8 rather than using hardcoded string "\xef\xbb\xbf" directly.

    And why special casing UTF-8 while we're at it? What about other encodings and their BOMs?

    @serhiy-storchaka
    Copy link
    Member

    This issue is outdated since implementing automatic encoding detecting in bpo-17909.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants