Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autodetecting JSON encoding #62109

Closed
serhiy-storchaka opened this issue May 5, 2013 · 13 comments
Closed

Autodetecting JSON encoding #62109

serhiy-storchaka opened this issue May 5, 2013 · 13 comments
Assignees
Labels
stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

BPO 17909
Nosy @rhettinger, @ncoghlan, @pitrou, @vstinner, @ezio-melotti, @methane, @4kir4, @Julian, @berkerpeksag, @vadmium, @serhiy-storchaka, @jleedev, @matrixise, @gsnedders, @miss-islington
PRs
  • bpo-17909: Document that json.load can accept a binary IO #7366
  • [3.7] bpo-17909: Document that json.load can accept a binary IO (GH-7366) #7474
  • [3.6] bpo-17909: Document that json.load can accept a binary IO (GH-7366) #7475
  • Dependencies
  • bpo-12892: UTF-16 and UTF-32 codecs should reject (lone) surrogates
  • Files
  • json_detect_encoding_2.patch
  • json_detect_encoding_3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ncoghlan'
    closed_at = <Date 2016-09-10.10:24:12.011>
    created_at = <Date 2013-05-05.13:10:30.725>
    labels = ['type-feature', 'library', 'expert-unicode']
    title = 'Autodetecting JSON encoding'
    updated_at = <Date 2018-06-07.10:21:22.636>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2018-06-07.10:21:22.636>
    actor = 'miss-islington'
    assignee = 'ncoghlan'
    closed = True
    closed_date = <Date 2016-09-10.10:24:12.011>
    closer = 'ncoghlan'
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2013-05-05.13:10:30.725>
    creator = 'serhiy.storchaka'
    dependencies = ['12892']
    files = ['35258', '43513']
    hgrepos = []
    issue_num = 17909
    keywords = ['patch']
    message_count = 13.0
    messages = ['188442', '218608', '218616', '218640', '218641', '230053', '273908', '275611', '275612', '275614', '318918', '318920', '318922']
    nosy_count = 17.0
    nosy_names = ['rhettinger', 'ncoghlan', 'pitrou', 'vstinner', 'ezio.melotti', 'cvrebert', 'methane', 'akira', 'Julian', 'python-dev', 'berker.peksag', 'martin.panter', 'serhiy.storchaka', 'jleedev', 'matrixise', 'gsnedders', 'miss-islington']
    pr_nums = ['7366', '7474', '7475']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue17909'
    versions = ['Python 3.6']

    @serhiy-storchaka
    Copy link
    Member Author

    RFC 4627 specifies a method to determine an encoding (one of UTF-8, UTF-16(BE|LE) or UTF-32(BE|LE)) of encoded JSON text. The proposed preliminary patch (it doesn't include the documentation yet) allows load() and loads() functions accept bytes data when it is encoded with standard Unicode encoding. Also accepted data with BOM (this doesn't specified in RFC 4627, but is widely used).

    There is only one case where the method can give a misfire. Serialized string "\x00..." encoded in UTF-16LE may be erroneously detected as encoded in UTF-32LE. This case violates the two rules of RFC 4627: the string was serialized instead of a an object or an array, and the control character U+0000 was not escaped. The standard encoded JSON always detected correctly.

    This patch requires "surrogatepass" error handler for utf-16/32 (see bpo-12892 and bpo-13916).

    @serhiy-storchaka serhiy-storchaka self-assigned this May 5, 2013
    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement labels May 5, 2013
    @serhiy-storchaka
    Copy link
    Member Author

    All dependencies for this issue are resolved now.

    Here is updated patch, synchronized with tip.

    @cvrebert
    Copy link
    Mannequin

    cvrebert mannequin commented May 15, 2014

    You'll need to also update the "Character Encodings" subsection of the json docs.

    @4kir4
    Copy link
    Mannequin

    4kir4 mannequin commented May 16, 2014

    Both json standard (ECMA-404) [1] and the new json rfc 7159 [2] do not mention
    the encoding detection.

    [1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf
    [2] https://tools.ietf.org/html/rfc7159#section-8.1

    From the rfc:

    JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
    encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
    interoperable in the sense that they will be read successfully by the
    maximum number of implementations; there are many implementations
    that cannot successfully read texts in other encodings (such as
    UTF-16 and UTF-32).

    Implementations MUST NOT add a byte order mark to the beginning of a
    JSON text. In the interests of interoperability, implementations
    that parse JSON texts MAY ignore the presence of a byte order mark
    rather than treating it as an error.

    @cvrebert
    Copy link
    Mannequin

    cvrebert mannequin commented May 16, 2014

    I agree that the state of encoding detection in the new RFC seems unclear, given that the old RFC prefaced the part about the encoding detection with:

    Since the first two characters of a JSON text will always be ASCII
    characters

    But in the new RFC:

    Appendix A. Changes from RFC 4627
    [...]
    o Changed the definition of "JSON text" so that it can be any JSON
    value, removing the constraint that it be an object or array.

    Thus,

    "ಠ_ಠ"
    whose 2nd character is decidedly non-ASCII, is now a valid JSON text (i.e. standalone JSON document).

    There seems to have been a thread about encoding detection in the RFC 7159 working group, but I don't have the time to read through it all:

    Re: [Json] JSON: remove gap between Ecma-404 and IETF draft
    http://www.ietf.org/mail-archive/web/json/current/msg01936.html

    It eventually leads to a dedicated sub-thread:

    [Json] Encoding detection (Was: Re: JSON: remove gap between Ecma-404 and IETF draft)
    http://www.ietf.org/mail-archive/web/json/current/msg01959.html

    @vadmium
    Copy link
    Member

    vadmium commented Oct 27, 2014

    If you adjusted the detect_encoding() logic according to Pete Cordell’s table at the bottom of <http://www.ietf.org/mail-archive/web/json/current/msg01959.html\>, it might work for standalone strings.

    However since the RFC encourages UTF-8 for best interoperability, I wonder if any of this autodetection is necessary. It might be simpler to just assume UTF-8, or use the “utf-8-sig” codec. Or are there real cases where detecting UTF-16 or -32 would be useful?

    @matrixise
    Copy link
    Member

    Hi Serhiy,

    I have reviewed your patch, it seems to be ok.

    @ncoghlan
    Copy link
    Contributor

    Having hit the json.loads() problem recently when porting a project to Python 3, I'm keen to see this land for 3.6.

    Accodingly, assigning to myself to review and merge Serhiy's patch - if it proves necessary, we can tweak the details of the encoding detection during beta.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 10, 2016

    New changeset e9e1bf9ec2ac by Nick Coghlan in branch 'default':
    Issue bpo-17909: Accept binary input in json.loads
    https://hg.python.org/cpython/rev/e9e1bf9ec2ac

    @ncoghlan
    Copy link
    Contributor

    Thanks for tackling this Serhiy!

    I removed bpo-13916 from the dependency list, as while that's a reasonable suggestion, I don't think this fix is conditional on that change.

    @methane
    Copy link
    Member

    methane commented Jun 7, 2018

    New changeset bb6366b by INADA Naoki (Anthony Sottile) in branch 'master':
    bpo-17909: Document that json.load can accept a binary IO (GH-7366)
    bb6366b

    @miss-islington
    Copy link
    Contributor

    New changeset f38ace6 by Miss Islington (bot) in branch '3.7':
    bpo-17909: Document that json.load can accept a binary IO (GH-7366)
    f38ace6

    @miss-islington
    Copy link
    Contributor

    New changeset 21f2553 by Miss Islington (bot) in branch '3.6':
    bpo-17909: Document that json.load can accept a binary IO (GH-7366)
    21f2553

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants