Title: json.dumps has different behaviour if encoding='utf-8' or encoding='utf8'
Type: behavior Stage: patch review
Components: Unicode Versions: Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ivan.Pozdeev, bob.ippolito, ezio.melotti, nhatcher, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2018-04-10 09:21 by nhatcher, last changed 2018-04-19 05:47 by serhiy.storchaka.

Pull Requests
URL Status Linked Edit
PR 6523 open nhatcher, 2018-04-18 19:44
Messages (3)
msg315164 - (view) Author: Nicolás Hatcher (nhatcher) * Date: 2018-04-10 09:21
Hey I'm new here, so please let me know what incorrect things I am doing!

I _think_ `json.dumps(o, ensure_ascii=False)` is doing the wrong thing when `o` has both unicode and str keys/values. For instance:

import json
o = {u"greeting": "hi", "currency": "€"}
json.dumps(o, ensure_ascii=False, encoding="utf8")
json.dumps(o, ensure_ascii=False)

The first `dumps` will work while the second will fail. the reason is:

This will decode any str if the encoding is not 'utf-8'. In the mixed case (unicode and str) this will blow. I workaround is to use any of the aliases for 'utf-8' like 'utf8' or 'u8'.

I would be crazy happy to provide a PR if this is really an issue.
Let me know if extra clarification is needed.
msg315270 - (view) Author: Ivan Pozdeev (Ivan.Pozdeev) * Date: 2018-04-13 22:20
Treating 'utf-8' and its aliases differently (when they specifically mean the Python's, rather than something else's, encoding) is definitely as issue.

You shouldn't hardcode a list of aliases though; rather use existing facilities to resolve them. From quick googling, e.g. `codecs.lookup(<encoding>).name` can get the canonical name.

Make sure to follow when doing the PR; a test case will likely be needed, too.
msg315478 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-04-19 05:47
In simplejson:

>>> simplejson.dumps({u"greeting": "hi", "currency": "€"}, ensure_ascii=False, encoding="utf8")
u'{"currency": "\u20ac", "greeting": "hi"}'
>>> simplejson.dumps({u"greeting": "hi", "currency": "€"}, ensure_ascii=False)
u'{"currency": "\u20ac", "greeting": "hi"}'

I think it makes sense to fix the case for "utf-8".
Date User Action Args
2018-04-19 05:47:35serhiy.storchakasetnosy: + bob.ippolito, serhiy.storchaka
messages: + msg315478
2018-04-18 19:44:34nhatchersetkeywords: + patch
stage: patch review
pull_requests: + pull_request6217
2018-04-13 22:20:39Ivan.Pozdeevsetnosy: + Ivan.Pozdeev
messages: + msg315270
2018-04-10 09:21:14nhatchercreate