This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Wire protocol encoding for the JSON module
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Clay Gerrard, barry, chrism, cvrebert, eric.araujo, ezio.melotti, gregory.p.smith, jleedev, kdwyer, martin.panter, ncoghlan, pitrou, serhiy.storchaka, socketpair, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2013-11-30 02:30 by ncoghlan, last changed 2022-04-11 14:57 by admin.

Messages (26)
msg204764 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-30 02:30
In the Python 3 transition, we had to make a choice regarding whether we treated the JSON module as a text transform (with load[s] reading Unicode code points and dump[s] producing them), or as a text encoding (with load[s] reading binary sequences and dump[s] producing them).

To minimise the changes to the module API, the decision was made to treat it as a text transform, with the text encoding handled externally.

This API design decision doesn't appear to have worked out that well in the web development context, since JSON is typically encountered as a UTF-8 encoded wire protocol, not as already decoded text.

It also makes the module inconsistent with most of the other modules that offer "dumps" APIs, as those *are* specifically about wire protocols (Python 3.4):

>>> import json, marshal, pickle, plistlib, xmlrpc.client
>>> json.dumps('hello')
'"hello"'
>>> marshal.dumps('hello')
b'\xda\x05hello'
>>> pickle.dumps('hello')
b'\x80\x03X\x05\x00\x00\x00helloq\x00.'
>>> plistlib.dumps('hello')
b'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n<plist version="1.0">\n<string>hello</string>\n</plist>\n'

The only module with a dumps function that (like the json module) returns a string, is the XML-RPC client module:

>>> xmlrpc.client.dumps(('hello',))
'<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n'

And that's nonsensical, since that XML-RPC API *accepts an encoding argument*, which it now silently ignores:

>>> xmlrpc.client.dumps(('hello',), encoding='utf-8')
'<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n'
>>> xmlrpc.client.dumps(('hello',), encoding='utf-16')
'<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n'

I now believe that an "encoding" parameter should have been added to the json.dump API in the Py3k transition (defaulting to UTF-8), allowing all of the dump/load APIs in the standard library to be consistently about converting to and from a binary wire protocol.

Unfortunately, I don't have a solution to offer at this point (since backwards compatibility concerns rule out the simple solution of just changing the return type). I just wanted to get it on record as a problem (and internal inconsistency within the standard library for dump/load protocols) with the current API.
msg204765 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-30 02:35
The other simple solution would be to add <name>b variants of the affected APIs. That's a bit ugly though, especially since it still has the problem of making it difficult to write correct cross-version code (although that problem is likely to exist regardless)
msg204799 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-11-30 11:07
Still, JSON itself is not a wire protocol; HTTP is. http://www.json.org states that "JSON is a text format" and the grammar description talks "UNICODE characters", not bytes. The ECMA spec states that "JSON text is a sequence of Unicode code points".

RFC 4627 is a bit more affirmative, though, and says that "JSON text SHALL be encoded in Unicode [sic]. The default encoding is UTF-8".

Related issues:
- issue #10976: json.loads() raises TypeError on bytes object
- issue #17909 (+ patch!): autodetecting JSON encoding

> The other simple solution would be to add <name>b variants of the affected APIs.

"dumpb" is not very pretty and can easily be misread as "dumb" :-)
"dump_bytes" looks better to me.
msg204805 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-30 11:59
I propose close this issue as a duplicate of issue10976.
msg204811 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-30 14:08
Not sure yet if we should merge the two issues, although they're the serialisation and deserialisation sides of the same problem.

Haskell seems to have gone with the approach of a separate "jsonb" API for the case where you want the wire protocol behaviour, such a solution may work for us as well.
msg204864 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-01 00:24
I'm -1 for a new module doing almost the same thing. Let's add distinct APIs in the existing json module.
msg204873 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-12-01 01:55
The problem with adding new APIs with different names to the JSON module is that it breaks symmetry with other wire protocols. The quartet of module level load, loads, dump and dumps functions has become a de facto standard API for wire protocols.

If it wasn't for that API convention, the status quo would be substantially less annoying (and confusing) than it currently is.

The advantage of a separate "jsonb" module is that it becomes easy to say "json is the text transform that dumps and loads from a Unicode string, jsonb is the wire protocol that dumps and loads a UTF encoded byte sequence".

Backporting as simplejsonb would also work in a straightforward fashion (since one PyPI package can include multiple top level Python modules).

The same approach would also extend to fixing the xmlrpc module to handle the encoding step properly (if anyone was so inclined).
msg204904 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-01 10:36
> The problem with adding new APIs with different names to the JSON
> module is that it breaks symmetry with other wire protocols. The
> quartet of module level load, loads, dump and dumps functions has
> become a de facto standard API for wire protocols.

Breaking symmetry is terribly less silly than having a second module
doing almost the same thing, though.

> The advantage of a separate "jsonb" module is that it becomes easy to
> say "json is the text transform that dumps and loads from a Unicode
> string, jsonb is the wire protocol that dumps and loads a UTF encoded
> byte sequence".

This is a terribly lousy design.
msg204939 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-01 15:55
I agree that adding a new module is very bad idea.

I think that the reviving the encoding parameter is a lest wrong way. json.dumps() should return bytes when the encoding argument is specifiead and str otherwise. json.dump() should write binary data when the encoding argument is specifiead and a text otherwise. This is not perfect design, but it has precendences in XML modules.
msg204960 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-12-01 21:03
Changing return type based on argument *values* is still a bad idea in
general.

It also makes it hard to plug the API in to generic code that is designed
to work with any dump/load based serialisation protocol.

MvL suggested a json.bytes submodule (rather than a separate top level
module) in the other issue and that sounds reasonable to me, especially
since json is already implemented as a package.
msg204963 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-01 21:21
> MvL suggested a json.bytes submodule (rather than a separate top level
> module) in the other issue and that sounds reasonable to me, especially
> since json is already implemented as a package.

I don't really find it reasonable to add a phantom module entirely for
the purpose of exposing an API more similar to the Python 2 one. I don't
think this design pattern has already been used.

If we add a json_bytes method, it will be simple enough for folks to add
the appropriate rules in their compat module (and/or for six to expose
it).
msg204976 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-12-01 23:09
The parallel API would have to be:

json.dump_bytes
json.dumps_bytes
json.load_bytes
json.loads_bytes

That is hardly an improvement over:

json.bytes.dump
json.bytes.dumps
json.bytes.load
json.bytes.loads

It doesn't need to be documented as a completely separate module, it can
just be a subsection in the json module docs with a reference to the
relevant RFC.

The confusion is inherent in the way the RFC was written, this is just an
expedient way to resolve that: the json module implements the standard, the
bytes submodule implements the RFC.

"Namespaces are a honking great idea; let's do more of those"
msg204978 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-01 23:19
> The parallel API would have to be:
> 
> json.dump_bytes
> json.dumps_bytes
> json.load_bytes
> json.loads_bytes

No, only one function dump_bytes() is needed, and it would return a
bytes object ("dumps" meaning "dump string", already). loads() can be
polymorphic without creating a new function.

I don't think the functions taking file objects are used often enough to
warrant a second API to deal with binary files.

> It doesn't need to be documented as a completely separate module, it can
> just be a subsection in the json module docs with a reference to the
> relevant RFC.

It's still completely weird and unusual.

> "Namespaces are a honking great idea; let's do more of those"

And also "flat is better than nested".

Especially when you're proposing than one API be at level N, and the
other, closely related API be at level N+1.
msg205023 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-02 16:19
> Changing return type based on argument *values* is still a bad idea in
> general.

However load() and loads() do this. ;)

> It also makes it hard to plug the API in to generic code that is designed
> to work with any dump/load based serialisation protocol.

For dumps() it will be simple -- `lambda x: json.dumps(x, encoding='utf-8')`. For loads() it will be even simpler -- loads() will accept both strings and bytes.

Note that dumps() with the encoding parameter will be more 2.x compatible than current implementation. This will help in writing compatible code.
msg205415 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-12-07 00:08
> Changing return type based on argument *values* is still a bad idea in
general.

I understand the proposal to be changing the return based on argument *presence*. It strikes me a a convenient abbreviation for making a separate encoding call and definitely (specifically?) less bad than a separate module or separate functions.
msg205416 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-07 00:11
To give another data point: returning a different type based on argument value is also what the open() functions does, more or less.

(that said, I would slightly favour a separate dump_bytes(), myself)
msg205530 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-12-08 08:55
upstream simplejson (of which json is an earlier snapshot of) has an encoding parameter on its dump and dumps method.  Lets NOT break compatibility with that API.

Our users use these modules interchangeably today, upgrading from stdlib json to simplejson when they need more features or speed without having to change their code.

simplejson's dumps(encoding=) parameter tells the module what encoding to decode bytes objects found within the data structure as (whereas Python 3.3's builtin json module being older doesn't even support that use case and raises a TypeError when bytes are encountered within the structure being serialized).

http://simplejson.readthedocs.org/en/latest/

A json.dump_bytes() function implemented as:

def dump_bytes(*args, **kwargs):
  return dumps(*args, **kwargs).encode('utf-8')

makes some sense.. but it is really trivial for anyone to write that .encode(...) themselves.

a dump_bytes_to_file method that acts like dump() and calls .encode('utf-8') on all str's before passing them to the write call is also doable... but it seems easier to just let people use an existing io wrapper to do that for them as they already are.

As for load/loads, it is easy to allow that to accept bytes as input and assume it comes utf-8 encoded.  simplejson already does this.  json does not.
msg205531 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-12-08 09:00
So why not put a dump_bytes into upstream simplejson first, then pull in a modern simplejson?

There might be some default flag values pertaining to new features that need changing for stdlib backwards compatible behavior but otherwise I expect it's a good idea.
msg271700 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-07-30 18:12
One of the problem, that decodeing JSON is FSM, where input is one symbol rather than one byte. AFAIK, Python still does not have FSM for decoding UTF-8 sequence, so iterative decoding of JSON will require more changes than expected.
msg271701 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-07-30 18:32
In real life, I can confirm, that porting from Python2 to Python3 is almost automatic except JSON-related fixes.
msg271775 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-08-01 08:17
I'm currently migrating a project that predates requests, and ended up needing to replace several "json.loads" calls with a "_load_json" helper that is just an alias for json.loads in Python 2, and defined as this in Python 3:

    def _load_json(data):
        return json.loads(data.decode())

To get that case to "just work", all I would have needed is for json.loads to accept bytes input, and assume it is UTF-8 encoded, that same way simplejson does. Since there aren't any type ambiguities associated with that, I think it would make sense for us to go ahead and implement at least that much for Python 3.6.

By contrast, if I'd been doing *encoding*, I don't think there's anything the Python 3 standard library could have changed *on its own* to make things just work - I would have needed to change my code somehow.

However, a new "dump_bytes" API could still be beneficial on that front as long as it was also added to simplejson: code that needed to run in the common Python 2/3 subset could use "simplejson.dump_bytes", while 3.6+ only code could just use the standard library version.

Having dump_bytes() next to dumps() in the documentation would also provide a better hook for explaining the difference between JSON-as-text-encoding (with "str" output) and JSON-as-wire-encoding (with "bytes" output after encoding the str representation as UTF-8).

In both cases, I think it would make sense to leave the non-UTF-8 support to simplejson and have the stdlib version be UTF-8 only.
msg271776 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-08-01 08:25
Does dump_bytes() return bytes (similar to dumps()) or write to binary stream (similar to dump())?
msg271778 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-08-01 10:24
dump_bytes() would be a binary counterpart to dumps()

The dump() case is already handled more gracefully, as the implicit encoding to UTF-8 can live on the file-like object, rather than needing to be handled by the JSON encoder.

I'm still not 100% sure on its utility though - it's only "json.loads assuming binary input is UTF-8 encoded text would be way more helpful than the current behaviour" that I'm confident about. If the assumption is wrong, you'll likely fail JSON deserialisation anyway, and when it's right, the common subset of Python 2 & 3 has been expanded in a useful way.

So perhaps we should split the question into two issues? A new one for accepting binary data as an input to json.loads, and make this one purely about whether or not to offer a combined serialise-and-encode operation for the wire protocol use case?
msg272726 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-08-15 07:21
After hitting this problem again in another nominally single-source compatible Python 2/3 project, I created #27765 to specifically cover accepting UTF-8 encoded bytes in json.loads()
msg275617 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-09-10 10:25
For 3.6, the decoding case has been handled via Serhiy's autodetection patch in issue 17909
msg289203 - (view) Author: Clay Gerrard (Clay Gerrard) Date: 2017-03-08 05:59
and for *encoding* case?  Can you just add the encoding argument back to json.dumps?  Have it default to None because of backwards compatibility in python3 and continue to return strings by default... 

... and then *everyone* that ever wants to *serialize* an object to json because they want to put it on a wire or w/e will hopefully someday learn when you call json.dumps you *always* set encoding='utf-8' and it will always return utf-8 encoded bytes (which is the same thing it would have done py2 regardless)?

Is it confusing for the py3 encoding argument to mean something different than py2?  Probably?  The encoding argument in py2 was there to tell the Encoder how to decode keys and values who's strings were acctually utf-8 encoded bytes.  But w/e py3 doesn't have that problem - so py3 can unambiguously hijack dumps' encoding param to mean bytes!  Then, sure, maybe the fact I can write:

    sock.send(json.dumps(obj, encoding='utf-8'))

... in either language is just a happy coincidence - but it'd be useful nevertheless.

Or I could be wrong.  I've not been thinking about this for 3 years.  But I have bumped into this a couple of times in the years since starting to dream of python 3.2^H4^H5^H6^H7 support - but until then I do seem to frequently forget json.dumps(obj).decode('utf-8') so maybe my suggestion isn't really any better!?
History
Date User Action Args
2022-04-11 14:57:54adminsetgithub: 64036
2017-03-08 05:59:28Clay Gerrardsetnosy: + Clay Gerrard
messages: + msg289203
2016-09-10 10:25:28ncoghlansetmessages: + msg275617
2016-08-15 07:21:41ncoghlansetmessages: + msg272726
2016-08-01 10:24:05ncoghlansetmessages: + msg271778
2016-08-01 08:25:29serhiy.storchakasetmessages: + msg271776
2016-08-01 08:17:33ncoghlansetmessages: + msg271775
2016-07-30 18:32:41socketpairsetmessages: + msg271701
2016-07-30 18:12:50socketpairsetnosy: + socketpair
messages: + msg271700
2016-07-29 21:45:31kdwyersetnosy: + kdwyer
2014-10-25 01:03:22martin.pantersetnosy: + martin.panter
2014-05-15 07:26:10vstinnersetnosy: + vstinner
2014-03-29 01:36:51cvrebertsetnosy: + cvrebert
2014-03-04 12:46:40jleedevsetnosy: + jleedev
2014-02-15 14:33:14ezio.melottisetnosy: + ezio.melotti
type: enhancement
2013-12-08 09:00:11gregory.p.smithsetmessages: + msg205531
2013-12-08 08:55:30gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg205530
2013-12-07 00:11:27pitrousetmessages: + msg205416
2013-12-07 00:08:44terry.reedysetnosy: + terry.reedy
messages: + msg205415
2013-12-06 17:46:00eric.araujosetnosy: + eric.araujo
2013-12-02 16:19:03serhiy.storchakasetmessages: + msg205023
2013-12-01 23:19:29pitrousetmessages: + msg204978
2013-12-01 23:09:55ncoghlansetmessages: + msg204976
2013-12-01 21:21:32pitrousetmessages: + msg204963
2013-12-01 21:03:44ncoghlansetmessages: + msg204960
2013-12-01 15:55:34serhiy.storchakasetmessages: + msg204939
2013-12-01 10:36:17pitrousetmessages: + msg204904
2013-12-01 01:55:02ncoghlansetmessages: + msg204873
2013-12-01 00:24:14pitrousetmessages: + msg204864
2013-11-30 15:22:30barrysetnosy: + barry
2013-11-30 14:08:10ncoghlansetmessages: + msg204811
2013-11-30 11:59:43serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg204805
2013-11-30 11:07:56pitrousetnosy: + pitrou
messages: + msg204799
2013-11-30 02:35:33ncoghlansetmessages: + msg204765
2013-11-30 02:30:45ncoghlancreate