classification
Title: `separators` argument to json.dumps() behaves unexpectedly across 2.x vs 3.x
Type: behavior Stage: resolved
Components: Versions: Python 3.4, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Tom.Christie, georg.brandl, r.david.murray
Priority: normal Keywords:

Created on 2014-10-30 16:35 by Tom.Christie, last changed 2014-10-30 20:23 by r.david.murray. This issue is now closed.

Messages (11)
msg230274 - (view) Author: Tom Christie (Tom.Christie) Date: 2014-10-30 16:35
This is one of those behavioural issues that is a borderline bug.

The seperators argument to `json.dumps()` behaves differently across python 2 and 3.

* In python 2 it should be provided as a bytestring, and can cause a UnicodeDecodeError otherwise.
* In python 3 it should be provided as unicode,and can cause a TypeError otherwise.

Examples:

    Python 2.7.2
    >>> print json.dumps({'snowman': '☃'}, separators=(':', ','), ensure_ascii=False)
    {"snowman","☃"}
    >>> print json.dumps({'snowman': '☃'}, separators=(u':', u','), ensure_ascii=False)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

And:

    Python 3.4.0
    >>> print(json.dumps({'snowman': '☃'}, separators=(':', ','), ensure_ascii=False))
    {"snowman","☃"}
    >>> print(json.dumps({'snowman': '☃'}, separators=(b':', b','), ensure_ascii=False))
    <...>
    TypeError: sequence item 2: expected str instance, bytes found

Technically this isn't out of line with the documentation - in both cases it uses `separators=(':', ',')` which is indeed the correct type in both v2 and v3. However it's unexpected behaviour that it changes types between versions, without being called out.

Working on a codebase with `from __future__ import unicode_literals` this is particularly unexpected because we get a `UnicodeDecodeError` when running code that otherwise looks correct.

It's also slightly awkward to fix because it's a bit of a weird branch condition.

The fix would probably be to forcibly coerce it to the correct type regardless of if it is supplied as unicode or a bytestring, or at least to do so for python 2.7.

Possibly related to http://bugs.python.org/issue22701 but wasn't able to understand if that ticket was in fact a different user error.
msg230275 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-10-30 17:32
IMO the snowman should be a Unicode string in the second example for Python 2.7.
msg230276 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-10-30 17:33
> in the second example

or even, in both examples.
msg230279 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-10-30 18:03
And that works, including with the future import.  I don't remember if this is a bug we've fixed since 2.7.2, but I don't think so.

In Python3, json explicitly does not support bytes.
msg230289 - (view) Author: Tom Christie (Tom.Christie) Date: 2014-10-30 19:12
Not too fussed if this is addressed or not, but I think this is closed a little prematurely.

I don't think there's a problem under Python 3, that's entirely reasonable.

However under Python 2, `json.dumps()` will normally handle *either* bytes or unicode transparently for you (just altering the return type accordingly).

If you happen to be using unicode separators, then the normally lax behaviour of "either unicode or bytes" that stops being the case.
msg230291 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-10-30 19:20
But only if you use non-ascii in the binary input, in which case you get an encoding error, which is a correct error.
msg230296 - (view) Author: Tom Christie (Tom.Christie) Date: 2014-10-30 19:38
> But only if you use non-ascii in the binary input, in which case you get an encoding error, which is a correct error.

Kind of, except that this (python 2.7) works just fine:

    >>> data = {'snowman': '☃'}
    >>> json.dumps(data, ensure_ascii=False)
    '{"snowman": "\xe2\x98\x83"}'

Whereas this raises an exception:

    >>> json.dumps(data, separators=(u':', u','), ensure_ascii=False)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

If it was the same in both cases then I wouldn't consider it a problem.
As it is, introducing the `seperators` parameter changes the behaviour.

Anyways, I'll get off my high horse now. :)
msg230298 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-10-30 19:49
No, it is introducing the unicode that is the problem.  Your first example is entirely binary.  It is only when you *mix* binary and unicode that you have encoding problems (because python doesn't know the encoding of the binary data...well, more precisely it doesn't have one).

This confusion is a large part of why python3 exists :)
msg230299 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-10-30 20:00
Or, to put it another way, we agree with you that both cases should behave the same: using binary data in a json dumps call should raise an error.  And in python3 they do.  But in python2 there is a confusion as to what is text and what is binary, and so sometimes things work that shouldn't.  In python2 a binary string with non-ascii characters is accepted by the dumps call...it shouldn't be since json is defined as a text protocol.  But it is baked into the python2 string model that it such binary does work, because in python2 it was assumed that the programmer was responsible for making sure that the encoding of all their binary strings was consistent.   But to mix unicode and binary, you *must* make the encoding of the binary strings explicit, otherwise there's no way to correctly compose the binary data with the text data.
So, as soon as (but only as soon as) you mix unicode with your non-ascii data, your program blows up.

Thus python3.
msg230300 - (view) Author: Tom Christie (Tom.Christie) Date: 2014-10-30 20:16
> So, as soon as (but only as soon as) you mix unicode with your non-ascii data, your program blows up.

Indeed. For context tho my example of running into this the unicode literals used as seperators weren't even in the same package as the non-ASCII binary strings. (JSONRenderer in Django REST framework, being excersized by some third party test code. End result non-obvious exception.

Anyways, okay with this resolution, although I am now using a compat branch to ensure that we use binary seperators in py2 to continue to get the more lax rendering style.
msg230301 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-10-30 20:23
Yes, that third party problem is a prime example of exactly why this needed to be fixed, but it required python3 to fix it.
History
Date User Action Args
2014-10-30 20:23:40r.david.murraysetmessages: + msg230301
2014-10-30 20:16:09Tom.Christiesetmessages: + msg230300
2014-10-30 20:00:07r.david.murraysetmessages: + msg230299
2014-10-30 19:49:58r.david.murraysetmessages: + msg230298
2014-10-30 19:38:27Tom.Christiesetmessages: + msg230296
2014-10-30 19:20:18r.david.murraysetmessages: + msg230291
2014-10-30 19:12:29Tom.Christiesetmessages: + msg230289
2014-10-30 18:03:49r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg230279

resolution: not a bug
stage: resolved
2014-10-30 17:33:16georg.brandlsetmessages: + msg230276
2014-10-30 17:32:40georg.brandlsetnosy: + georg.brandl
messages: + msg230275
2014-10-30 16:35:50Tom.Christiecreate