classification
Title: json.dumps has different behaviour if encoding='utf-8' or encoding='utf8'
Type: behavior Stage: patch review
Components: Unicode Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ivan.Pozdeev, benjamin.peterson, bob.ippolito, ezio.melotti, mcepl, nhatcher, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2018-04-10 09:21 by nhatcher, last changed 2018-08-15 13:57 by mcepl.

Pull Requests
URL Status Linked Edit
PR 6523 open nhatcher, 2018-04-18 19:44
Messages (5)
msg315164 - (view) Author: Nicolás Hatcher (nhatcher) * Date: 2018-04-10 09:21
Hey I'm new here, so please let me know what incorrect things I am doing!

I _think_ `json.dumps(o, ensure_ascii=False)` is doing the wrong thing when `o` has both unicode and str keys/values. For instance:

```
import json
o = {u"greeting": "hi", "currency": "€"}
json.dumps(o, ensure_ascii=False, encoding="utf8")
json.dumps(o, ensure_ascii=False)
```

The first `dumps` will work while the second will fail. the reason is:

https://github.com/python/cpython/blob/2.7/Lib/json/encoder.py#L198

This will decode any str if the encoding is not 'utf-8'. In the mixed case (unicode and str) this will blow. I workaround is to use any of the aliases for 'utf-8' like 'utf8' or 'u8'.

I would be crazy happy to provide a PR if this is really an issue.
Let me know if extra clarification is needed.
Nicolás
msg315270 - (view) Author: Ivan Pozdeev (Ivan.Pozdeev) * Date: 2018-04-13 22:20
Treating 'utf-8' and its aliases differently (when they specifically mean the Python's, rather than something else's, encoding) is definitely as issue.

You shouldn't hardcode a list of aliases though; rather use existing facilities to resolve them. From quick googling, e.g. `codecs.lookup(<encoding>).name` can get the canonical name.


Make sure to follow https://devguide.python.org/pullrequest when doing the PR; a test case will likely be needed, too.
msg315478 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-04-19 05:47
In simplejson:

>>> simplejson.dumps({u"greeting": "hi", "currency": "€"}, ensure_ascii=False, encoding="utf8")
u'{"currency": "\u20ac", "greeting": "hi"}'
>>> simplejson.dumps({u"greeting": "hi", "currency": "€"}, ensure_ascii=False)
u'{"currency": "\u20ac", "greeting": "hi"}'

I think it makes sense to fix the case for "utf-8".
msg315890 - (view) Author: Nicolás Hatcher (nhatcher) * Date: 2018-04-29 12:08
Hi Sehriy,

I am ok with that change. I think it makes much more sense, but I also think it will break people's codes. At least with the simplest fix in which:

>>> json.dumps({"g"}, ensure_ascii=False)
u'"g"'

Which is again compatible with simplejson.
Although the documentation is not clear in this point there might be code out there relaying on this behaviour.
Is that acceptable?
msg315895 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-04-29 13:36
You could decode only non-ascii strings.

But I'm not sure that it is worth to change something in 2.7. This could be treated aa a new feature. Left this on to Benjamin, the release manager of 2.7.
History
Date User Action Args
2018-08-15 13:57:33mceplsetnosy: + mcepl
2018-04-29 13:36:39serhiy.storchakasetnosy: + benjamin.peterson
messages: + msg315895
2018-04-29 12:08:30nhatchersetmessages: + msg315890
2018-04-19 05:47:35serhiy.storchakasetnosy: + bob.ippolito, serhiy.storchaka
messages: + msg315478
2018-04-18 19:44:34nhatchersetkeywords: + patch
stage: patch review
pull_requests: + pull_request6217
2018-04-13 22:20:39Ivan.Pozdeevsetnosy: + Ivan.Pozdeev
messages: + msg315270
2018-04-10 09:21:14nhatchercreate