classification
Title: Autodetecting JSON encoding
Type: enhancement Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: 12892 Superseder:
Assigned To: ncoghlan Nosy List: Julian, akira, berker.peksag, cvrebert, ezio.melotti, gsnedders, haypo, jleedev, martin.panter, matrixise, ncoghlan, pitrou, python-dev, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-05-05 13:10 by serhiy.storchaka, last changed 2016-09-10 10:24 by ncoghlan. This issue is now closed.

Files
File name Uploaded Description Edit
json_detect_encoding_2.patch serhiy.storchaka, 2014-05-15 12:38 review
json_detect_encoding_3.patch serhiy.storchaka, 2016-06-22 16:57 review
Messages (10)
msg188442 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-05 13:10
RFC 4627 specifies a method to determine an encoding (one of UTF-8, UTF-16(BE|LE) or UTF-32(BE|LE)) of encoded JSON text. The proposed preliminary patch (it doesn't include the documentation yet) allows load() and loads() functions accept bytes data when it is encoded with standard Unicode encoding. Also accepted data with BOM (this doesn't specified in RFC 4627, but is widely used).

There is only one case where the method can give a misfire. Serialized string "\x00..." encoded in UTF-16LE may be erroneously detected as encoded in UTF-32LE. This case violates the two rules of RFC 4627: the string was serialized instead of a an object or an array, and the control character U+0000 was not escaped. The standard encoded JSON always detected correctly.

This patch requires "surrogatepass" error handler for utf-16/32 (see issue12892 and issue13916).
msg218608 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-05-15 12:38
All dependencies for this issue are resolved now.

Here is updated patch, synchronized with tip.
msg218616 - (view) Author: Chris Rebert (cvrebert) * Date: 2014-05-15 16:07
You'll need to also update the "Character Encodings" subsection of the json docs.
msg218640 - (view) Author: Akira Li (akira) * Date: 2014-05-16 02:39
Both json standard (ECMA-404) [1] and the new json rfc 7159 [2] do not mention
the encoding detection.

[1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf
[2] https://tools.ietf.org/html/rfc7159#section-8.1

From the rfc:

> JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.  The default
  encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
  interoperable in the sense that they will be read successfully by the
  maximum number of implementations; there are many implementations
  that cannot successfully read texts in other encodings (such as
  UTF-16 and UTF-32).

  Implementations MUST NOT add a byte order mark to the beginning of a
  JSON text.  In the interests of interoperability, implementations
  that parse JSON texts MAY ignore the presence of a byte order mark
  rather than treating it as an error.
msg218641 - (view) Author: Chris Rebert (cvrebert) * Date: 2014-05-16 04:20
I agree that the state of encoding detection in the new RFC seems unclear, given that the old RFC prefaced the part about the encoding detection with:
> Since the first two characters of a JSON text will always be ASCII
> characters

But in the new RFC:
> Appendix A.  Changes from RFC 4627
[...]
>    o  Changed the definition of "JSON text" so that it can be any JSON
>       value, removing the constraint that it be an object or array.

Thus,
> "ಠ_ಠ"
whose 2nd character is decidedly non-ASCII, is now a valid JSON text (i.e. standalone JSON document).

There seems to have been a thread about encoding detection in the RFC 7159 working group, but I don't have the time to read through it all:

> Re: [Json] JSON: remove gap between Ecma-404 and IETF draft
> http://www.ietf.org/mail-archive/web/json/current/msg01936.html

It eventually leads to a dedicated sub-thread:

> [Json] Encoding detection (Was: Re: JSON: remove gap between Ecma-404 and IETF draft)
> http://www.ietf.org/mail-archive/web/json/current/msg01959.html
msg230053 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-10-27 01:06
If you adjusted the detect_encoding() logic according to Pete Cordell’s table at the bottom of <http://www.ietf.org/mail-archive/web/json/current/msg01959.html>, it might work for standalone strings.

However since the RFC encourages UTF-8 for best interoperability, I wonder if any of this autodetection is necessary. It might be simpler to just assume UTF-8, or use the “utf-8-sig” codec. Or are there real cases where detecting UTF-16 or -32 would be useful?
msg273908 - (view) Author: Stéphane Wirtel (matrixise) * Date: 2016-08-30 10:36
Hi Serhiy,

I have reviewed your patch, it seems to be ok.
msg275611 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-09-10 10:07
Having hit the json.loads() problem recently when porting a project to Python 3, I'm keen to see this land for 3.6.

Accodingly, assigning to myself to review and merge Serhiy's patch - if it proves necessary, we can tweak the details of the encoding detection during beta.
msg275612 - (view) Author: Roundup Robot (python-dev) Date: 2016-09-10 10:16
New changeset e9e1bf9ec2ac by Nick Coghlan in branch 'default':
Issue #17909: Accept binary input in json.loads
https://hg.python.org/cpython/rev/e9e1bf9ec2ac
msg275614 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-09-10 10:18
Thanks for tackling this Serhiy!

I removed issue 13916 from the dependency list, as while that's a reasonable suggestion, I don't think this fix is conditional on that change.
History
Date User Action Args
2016-09-10 10:24:12ncoghlansetstatus: open -> closed
stage: commit review -> resolved
2016-09-10 10:23:39ncoghlanlinkissue22555 dependencies
2016-09-10 10:21:47ncoghlanlinkissue10976 superseder
2016-09-10 10:18:20ncoghlansetresolution: fixed
dependencies: - disallow the "surrogatepass" handler for non utf-* encodings
messages: + msg275614
2016-09-10 10:16:45python-devsetnosy: + python-dev
messages: + msg275612
2016-09-10 10:07:43ncoghlansetassignee: serhiy.storchaka -> ncoghlan
messages: + msg275611
2016-08-30 10:38:52matrixisesetstage: patch review -> commit review
2016-08-30 10:36:24matrixisesetnosy: + matrixise
messages: + msg273908
2016-06-22 16:57:38serhiy.storchakasetfiles: + json_detect_encoding_3.patch
versions: + Python 3.6, - Python 3.5
2016-05-03 19:41:56gsnedderssetnosy: + gsnedders
2015-03-28 03:21:56berker.peksagsetnosy: + berker.peksag
2014-10-27 01:06:14martin.pantersetmessages: + msg230053
2014-10-25 01:14:10martin.pantersetnosy: + martin.panter
2014-05-16 04:20:11cvrebertsetmessages: + msg218641
2014-05-16 02:39:43akirasetnosy: + akira
messages: + msg218640
2014-05-15 16:07:28cvrebertsetmessages: + msg218616
2014-05-15 12:39:43serhiy.storchakasetfiles: - json_detect_encoding.patch
2014-05-15 12:38:58serhiy.storchakasetfiles: + json_detect_encoding_2.patch

messages: + msg218608
2014-05-15 07:26:13hayposetnosy: + haypo
2014-03-29 01:40:25cvrebertsetnosy: + cvrebert
2014-03-04 12:42:50jleedevsetnosy: + jleedev
2013-12-02 03:18:12Juliansetnosy: + Julian
2013-12-01 00:08:35pitrousetversions: + Python 3.5, - Python 3.4
2013-11-30 11:07:13pitrousetnosy: + ncoghlan
2013-08-10 14:31:38serhiy.storchakasetstage: patch review
2013-05-05 13:11:10serhiy.storchakasetdependencies: + UTF-16 and UTF-32 codecs should reject (lone) surrogates, disallow the "surrogatepass" handler for non utf-* encodings
2013-05-05 13:10:30serhiy.storchakacreate