Issue 19837: Wire protocol encoding for the JSON module

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64036

classification

Title:	Wire protocol encoding for the JSON module
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Clay Gerrard, barry, chrism, cvrebert, eric.araujo, ezio.melotti, gregory.p.smith, jleedev, kdwyer, martin.panter, ncoghlan, pitrou, serhiy.storchaka, socketpair, terry.reedy, vstinner
Priority:	normal	Keywords:

Created on 2013-11-30 02:30 by ncoghlan, last changed 2022-04-11 14:57 by admin.

Messages (26)
msg204764 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-11-30 02:30
In the Python 3 transition, we had to make a choice regarding whether we treated the JSON module as a text transform (with load[s] reading Unicode code points and dump[s] producing them), or as a text encoding (with load[s] reading binary sequences and dump[s] producing them). To minimise the changes to the module API, the decision was made to treat it as a text transform, with the text encoding handled externally. This API design decision doesn't appear to have worked out that well in the web development context, since JSON is typically encountered as a UTF-8 encoded wire protocol, not as already decoded text. It also makes the module inconsistent with most of the other modules that offer "dumps" APIs, as those are specifically about wire protocols (Python 3.4): >>> import json, marshal, pickle, plistlib, xmlrpc.client >>> json.dumps('hello') '"hello"' >>> marshal.dumps('hello') b'\xda\x05hello' >>> pickle.dumps('hello') b'\x80\x03X\x05\x00\x00\x00helloq\x00.' >>> plistlib.dumps('hello') b'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n<plist version="1.0">\n<string>hello</string>\n</plist>\n' The only module with a dumps function that (like the json module) returns a string, is the XML-RPC client module: >>> xmlrpc.client.dumps(('hello',)) '<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n' And that's nonsensical, since that XML-RPC API accepts an encoding argument, which it now silently ignores: >>> xmlrpc.client.dumps(('hello',), encoding='utf-8') '<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n' >>> xmlrpc.client.dumps(('hello',), encoding='utf-16') '<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n' I now believe that an "encoding" parameter should have been added to the json.dump API in the Py3k transition (defaulting to UTF-8), allowing all of the dump/load APIs in the standard library to be consistently about converting to and from a binary wire protocol. Unfortunately, I don't have a solution to offer at this point (since backwards compatibility concerns rule out the simple solution of just changing the return type). I just wanted to get it on record as a problem (and internal inconsistency within the standard library for dump/load protocols) with the current API.
msg204765 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-11-30 02:35
The other simple solution would be to add <name>b variants of the affected APIs. That's a bit ugly though, especially since it still has the problem of making it difficult to write correct cross-version code (although that problem is likely to exist regardless)
msg204799 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-11-30 11:07
Still, JSON itself is not a wire protocol; HTTP is. http://www.json.org states that "JSON is a text format" and the grammar description talks "UNICODE characters", not bytes. The ECMA spec states that "JSON text is a sequence of Unicode code points". RFC 4627 is a bit more affirmative, though, and says that "JSON text SHALL be encoded in Unicode [sic]. The default encoding is UTF-8". Related issues: - issue #10976: json.loads() raises TypeError on bytes object - issue #17909 (+ patch!): autodetecting JSON encoding > The other simple solution would be to add <name>b variants of the affected APIs. "dumpb" is not very pretty and can easily be misread as "dumb" :-) "dump_bytes" looks better to me.
msg204805 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-11-30 11:59
I propose close this issue as a duplicate of issue10976.
msg204811 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-11-30 14:08
Not sure yet if we should merge the two issues, although they're the serialisation and deserialisation sides of the same problem. Haskell seems to have gone with the approach of a separate "jsonb" API for the case where you want the wire protocol behaviour, such a solution may work for us as well.
msg204864 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-01 00:24
I'm -1 for a new module doing almost the same thing. Let's add distinct APIs in the existing json module.
msg204873 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-01 01:55
The problem with adding new APIs with different names to the JSON module is that it breaks symmetry with other wire protocols. The quartet of module level load, loads, dump and dumps functions has become a de facto standard API for wire protocols. If it wasn't for that API convention, the status quo would be substantially less annoying (and confusing) than it currently is. The advantage of a separate "jsonb" module is that it becomes easy to say "json is the text transform that dumps and loads from a Unicode string, jsonb is the wire protocol that dumps and loads a UTF encoded byte sequence". Backporting as simplejsonb would also work in a straightforward fashion (since one PyPI package can include multiple top level Python modules). The same approach would also extend to fixing the xmlrpc module to handle the encoding step properly (if anyone was so inclined).
msg204904 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-01 10:36
> The problem with adding new APIs with different names to the JSON > module is that it breaks symmetry with other wire protocols. The > quartet of module level load, loads, dump and dumps functions has > become a de facto standard API for wire protocols. Breaking symmetry is terribly less silly than having a second module doing almost the same thing, though. > The advantage of a separate "jsonb" module is that it becomes easy to > say "json is the text transform that dumps and loads from a Unicode > string, jsonb is the wire protocol that dumps and loads a UTF encoded > byte sequence". This is a terribly lousy design.
msg204939 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-01 15:55
I agree that adding a new module is very bad idea. I think that the reviving the encoding parameter is a lest wrong way. json.dumps() should return bytes when the encoding argument is specifiead and str otherwise. json.dump() should write binary data when the encoding argument is specifiead and a text otherwise. This is not perfect design, but it has precendences in XML modules.
msg204960 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-01 21:03
Changing return type based on argument values is still a bad idea in general. It also makes it hard to plug the API in to generic code that is designed to work with any dump/load based serialisation protocol. MvL suggested a json.bytes submodule (rather than a separate top level module) in the other issue and that sounds reasonable to me, especially since json is already implemented as a package.
msg204963 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-01 21:21
> MvL suggested a json.bytes submodule (rather than a separate top level > module) in the other issue and that sounds reasonable to me, especially > since json is already implemented as a package. I don't really find it reasonable to add a phantom module entirely for the purpose of exposing an API more similar to the Python 2 one. I don't think this design pattern has already been used. If we add a json_bytes method, it will be simple enough for folks to add the appropriate rules in their compat module (and/or for six to expose it).
msg204976 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-01 23:09
The parallel API would have to be: json.dump_bytes json.dumps_bytes json.load_bytes json.loads_bytes That is hardly an improvement over: json.bytes.dump json.bytes.dumps json.bytes.load json.bytes.loads It doesn't need to be documented as a completely separate module, it can just be a subsection in the json module docs with a reference to the relevant RFC. The confusion is inherent in the way the RFC was written, this is just an expedient way to resolve that: the json module implements the standard, the bytes submodule implements the RFC. "Namespaces are a honking great idea; let's do more of those"
msg204978 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-01 23:19
> The parallel API would have to be: > > json.dump_bytes > json.dumps_bytes > json.load_bytes > json.loads_bytes No, only one function dump_bytes() is needed, and it would return a bytes object ("dumps" meaning "dump string", already). loads() can be polymorphic without creating a new function. I don't think the functions taking file objects are used often enough to warrant a second API to deal with binary files. > It doesn't need to be documented as a completely separate module, it can > just be a subsection in the json module docs with a reference to the > relevant RFC. It's still completely weird and unusual. > "Namespaces are a honking great idea; let's do more of those" And also "flat is better than nested". Especially when you're proposing than one API be at level N, and the other, closely related API be at level N+1.
msg205023 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-02 16:19
> Changing return type based on argument values is still a bad idea in > general. However load() and loads() do this. ;) > It also makes it hard to plug the API in to generic code that is designed > to work with any dump/load based serialisation protocol. For dumps() it will be simple -- `lambda x: json.dumps(x, encoding='utf-8')`. For loads() it will be even simpler -- loads() will accept both strings and bytes. Note that dumps() with the encoding parameter will be more 2.x compatible than current implementation. This will help in writing compatible code.
msg205415 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-12-07 00:08
> Changing return type based on argument values is still a bad idea in general. I understand the proposal to be changing the return based on argument presence. It strikes me a a convenient abbreviation for making a separate encoding call and definitely (specifically?) less bad than a separate module or separate functions.
msg205416 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-07 00:11
To give another data point: returning a different type based on argument value is also what the open() functions does, more or less. (that said, I would slightly favour a separate dump_bytes(), myself)
msg205530 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2013-12-08 08:55
upstream simplejson (of which json is an earlier snapshot of) has an encoding parameter on its dump and dumps method. Lets NOT break compatibility with that API. Our users use these modules interchangeably today, upgrading from stdlib json to simplejson when they need more features or speed without having to change their code. simplejson's dumps(encoding=) parameter tells the module what encoding to decode bytes objects found within the data structure as (whereas Python 3.3's builtin json module being older doesn't even support that use case and raises a TypeError when bytes are encountered within the structure being serialized). http://simplejson.readthedocs.org/en/latest/ A json.dump_bytes() function implemented as: def dump_bytes(args, kwargs): return dumps(args, **kwargs).encode('utf-8') makes some sense.. but it is really trivial for anyone to write that .encode(...) themselves. a dump_bytes_to_file method that acts like dump() and calls .encode('utf-8') on all str's before passing them to the write call is also doable... but it seems easier to just let people use an existing io wrapper to do that for them as they already are. As for load/loads, it is easy to allow that to accept bytes as input and assume it comes utf-8 encoded. simplejson already does this. json does not.
msg205531 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2013-12-08 09:00
So why not put a dump_bytes into upstream simplejson first, then pull in a modern simplejson? There might be some default flag values pertaining to new features that need changing for stdlib backwards compatible behavior but otherwise I expect it's a good idea.
msg271700 - (view)	Author: Марк Коренберг (socketpair) *	Date: 2016-07-30 18:12
One of the problem, that decodeing JSON is FSM, where input is one symbol rather than one byte. AFAIK, Python still does not have FSM for decoding UTF-8 sequence, so iterative decoding of JSON will require more changes than expected.
msg271701 - (view)	Author: Марк Коренберг (socketpair) *	Date: 2016-07-30 18:32
In real life, I can confirm, that porting from Python2 to Python3 is almost automatic except JSON-related fixes.
msg271775 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-08-01 08:17
I'm currently migrating a project that predates requests, and ended up needing to replace several "json.loads" calls with a "_load_json" helper that is just an alias for json.loads in Python 2, and defined as this in Python 3: def _load_json(data): return json.loads(data.decode()) To get that case to "just work", all I would have needed is for json.loads to accept bytes input, and assume it is UTF-8 encoded, that same way simplejson does. Since there aren't any type ambiguities associated with that, I think it would make sense for us to go ahead and implement at least that much for Python 3.6. By contrast, if I'd been doing encoding, I don't think there's anything the Python 3 standard library could have changed on its own to make things just work - I would have needed to change my code somehow. However, a new "dump_bytes" API could still be beneficial on that front as long as it was also added to simplejson: code that needed to run in the common Python 2/3 subset could use "simplejson.dump_bytes", while 3.6+ only code could just use the standard library version. Having dump_bytes() next to dumps() in the documentation would also provide a better hook for explaining the difference between JSON-as-text-encoding (with "str" output) and JSON-as-wire-encoding (with "bytes" output after encoding the str representation as UTF-8). In both cases, I think it would make sense to leave the non-UTF-8 support to simplejson and have the stdlib version be UTF-8 only.
msg271776 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-08-01 08:25
Does dump_bytes() return bytes (similar to dumps()) or write to binary stream (similar to dump())?
msg271778 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-08-01 10:24
dump_bytes() would be a binary counterpart to dumps() The dump() case is already handled more gracefully, as the implicit encoding to UTF-8 can live on the file-like object, rather than needing to be handled by the JSON encoder. I'm still not 100% sure on its utility though - it's only "json.loads assuming binary input is UTF-8 encoded text would be way more helpful than the current behaviour" that I'm confident about. If the assumption is wrong, you'll likely fail JSON deserialisation anyway, and when it's right, the common subset of Python 2 & 3 has been expanded in a useful way. So perhaps we should split the question into two issues? A new one for accepting binary data as an input to json.loads, and make this one purely about whether or not to offer a combined serialise-and-encode operation for the wire protocol use case?
msg272726 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-08-15 07:21
After hitting this problem again in another nominally single-source compatible Python 2/3 project, I created #27765 to specifically cover accepting UTF-8 encoded bytes in json.loads()
msg275617 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-09-10 10:25
For 3.6, the decoding case has been handled via Serhiy's autodetection patch in issue 17909
msg289203 - (view)	Author: Clay Gerrard (Clay Gerrard)	Date: 2017-03-08 05:59
and for encoding case? Can you just add the encoding argument back to json.dumps? Have it default to None because of backwards compatibility in python3 and continue to return strings by default... ... and then everyone that ever wants to serialize an object to json because they want to put it on a wire or w/e will hopefully someday learn when you call json.dumps you always set encoding='utf-8' and it will always return utf-8 encoded bytes (which is the same thing it would have done py2 regardless)? Is it confusing for the py3 encoding argument to mean something different than py2? Probably? The encoding argument in py2 was there to tell the Encoder how to decode keys and values who's strings were acctually utf-8 encoded bytes. But w/e py3 doesn't have that problem - so py3 can unambiguously hijack dumps' encoding param to mean bytes! Then, sure, maybe the fact I can write: sock.send(json.dumps(obj, encoding='utf-8')) ... in either language is just a happy coincidence - but it'd be useful nevertheless. Or I could be wrong. I've not been thinking about this for 3 years. But I have bumped into this a couple of times in the years since starting to dream of python 3.2^H4^H5^H6^H7 support - but until then I do seem to frequently forget json.dumps(obj).decode('utf-8') so maybe my suggestion isn't really any better!?

History
Date	User	Action	Args
2022-04-11 14:57:54	admin	set	github: 64036
2017-03-08 05:59:28	Clay Gerrard	set	nosy: + Clay Gerrard messages: + msg289203
2016-09-10 10:25:28	ncoghlan	set	messages: + msg275617
2016-08-15 07:21:41	ncoghlan	set	messages: + msg272726
2016-08-01 10:24:05	ncoghlan	set	messages: + msg271778
2016-08-01 08:25:29	serhiy.storchaka	set	messages: + msg271776
2016-08-01 08:17:33	ncoghlan	set	messages: + msg271775
2016-07-30 18:32:41	socketpair	set	messages: + msg271701
2016-07-30 18:12:50	socketpair	set	nosy: + socketpair messages: + msg271700
2016-07-29 21:45:31	kdwyer	set	nosy: + kdwyer
2014-10-25 01:03:22	martin.panter	set	nosy: + martin.panter
2014-05-15 07:26:10	vstinner	set	nosy: + vstinner
2014-03-29 01:36:51	cvrebert	set	nosy: + cvrebert
2014-03-04 12:46:40	jleedev	set	nosy: + jleedev
2014-02-15 14:33:14	ezio.melotti	set	nosy: + ezio.melotti type: enhancement
2013-12-08 09:00:11	gregory.p.smith	set	messages: + msg205531
2013-12-08 08:55:30	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg205530
2013-12-07 00:11:27	pitrou	set	messages: + msg205416
2013-12-07 00:08:44	terry.reedy	set	nosy: + terry.reedy messages: + msg205415
2013-12-06 17:46:00	eric.araujo	set	nosy: + eric.araujo
2013-12-02 16:19:03	serhiy.storchaka	set	messages: + msg205023
2013-12-01 23:19:29	pitrou	set	messages: + msg204978
2013-12-01 23:09:55	ncoghlan	set	messages: + msg204976
2013-12-01 21:21:32	pitrou	set	messages: + msg204963
2013-12-01 21:03:44	ncoghlan	set	messages: + msg204960
2013-12-01 15:55:34	serhiy.storchaka	set	messages: + msg204939
2013-12-01 10:36:17	pitrou	set	messages: + msg204904
2013-12-01 01:55:02	ncoghlan	set	messages: + msg204873
2013-12-01 00:24:14	pitrou	set	messages: + msg204864
2013-11-30 15:22:30	barry	set	nosy: + barry
2013-11-30 14:08:10	ncoghlan	set	messages: + msg204811
2013-11-30 11:59:43	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg204805
2013-11-30 11:07:56	pitrou	set	nosy: + pitrou messages: + msg204799
2013-11-30 02:35:33	ncoghlan	set	messages: + msg204765
2013-11-30 02:30:45	ncoghlan	create