classification
Title: json dump fails for mixed-type keys when sort_keys is specified
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Aaron Hall, jedwards, josh.r, naught101, r.david.murray, tanzer@swing.co.at, zachrahan
Priority: normal Keywords: patch

Created on 2015-10-22 08:01 by tanzer@swing.co.at, last changed 2021-01-08 04:58 by naught101.

Pull Requests
URL Status Linked Edit
PR 8011 open jedwards, 2018-06-29 17:24
PR 15691 closed python-dev, 2019-09-08 02:36
Messages (15)
msg253324 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2015-10-22 08:01
In Python 3, trying to json-dump a dict with keys of different types fails with a TypeError when sort_keys is specified:

python2.7
===========

Python 2.7.10 (default, May 29 2015, 10:02:30) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.dumps({1 : 42, "foo" : "bar", None : "nada"}, sort_keys = True)
'{"null": "nada", "1": 42, "foo": "bar"}'

python3.5
============

Python 3.5.0 (default, Oct  5 2015, 12:03:13) 
[GCC 4.8.5] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.dumps({1 : 42, "foo" : "bar", None : "nada"}, sort_keys = True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.5/json/__init__.py", line 237, in dumps
    **kw).encode(obj)
  File "/usr/lib64/python3.5/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.5/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
TypeError: unorderable types: str() < int()

Note that the documentation explicitly allows keys of different, if basic, types:

  If skipkeys is True (default: False), then dict keys that are not of a basic type (str, int, float, bool, None) will be skipped instead of raising a TypeError.

As all they keys are dumped as strings, a simple solution would be to sort after converting to strings. Looking closely at the output of Python 2, the sort order is a bit strange!
msg253360 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-10-23 02:45
The Python 2 sort order is a result of the "arbitrary but consistent fallback comparison" (omitting details, it's comparing the names of the types), thus the "strange" sort order. Python 3 (justifiably) said that incomparable types should be incomparable rather than silently behaving in non-intuitive ways, hiding errors.

Python is being rather generous by allowing non-string keys, because the  JSON spec ( http://json.org/ ) only allows the keys ("names" in JSON parlance) to be strings. So you're already a bit in the weeds as far as compliant JSON goes if you have non-string keys.

Since mixed type keys lack meaningful sort order, I'm not sure it's wrong to reject attempts to sort them. Converting to string is as arbitrary and full of potential for silently incorrect comparisons as the Python 2 behavior, and reintroducing it seems like a bad idea.
msg253362 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-10-23 02:56
As a workaround (should you absolutely need to sort keys by some arbitrary criteria), you can initialize a collections.OrderedDict from the sorted items of your original dict (using whatever key function you like), then dump without using sort_keys=True. For example, your suggested behavior (treat all keys as str) could be achieved by the user by replacing:

    json.dumps(mydict, sort_keys=True)

with:

    json.dumps(collections.OrderedDict(sorted(mydict.items(), key=str)))

Particularly in 3.5 (where OrderedDict is a C builtin), this shouldn't incur too much additional overhead (`sort_keys` has to make a sorted list intermediate anyway), and the output is the same, without introducing implicit hiding of errors.
msg253363 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-10-23 03:01
Oops, minor flaw with that. It's str-ifying the tuples, not the keys, which could (in some cases) cause issues with keys whose reprs have different quoting. So you'd end up with lambdas. Boo. Anyway, corrected version (which would probably not be one-lined in real code):

json.dumps(collections.OrderedDict(sorted(mydict.items(), key=lambda x: str(x[0]))))
msg253368 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2015-10-23 08:09
Josh Rosenberg wrote at Fri, 23 Oct 2015 02:45:51 +0000:

> The Python 2 sort order is a result of the "arbitrary but consistent
> fallback comparison" (omitting details, it's comparing the names of
> the types), thus the "strange" sort order.

Thanks. I knew that.

> Python 3 (justifiably) said that incomparable types should be
> incomparable rather than silently behaving in non-intuitive ways,
> hiding errors.

"justifiably" is debatable. I consider the change ill-conveived.

Displaying a dictionary (or just its keys) in a readable, or just
reproducible, way is useful in many contexts. Python 3 broke that for
very little, INMNSHO, gain.

I consider "hiding errors" a myth, to say it politely!

> Python is being rather generous by allowing non-string keys, because
> the  JSON spec ( http://json.org/ ) only allows the keys ("names" in
> JSON parlance) to be strings. So you're already a bit in the weeds as
> far as compliant JSON goes if you have non-string keys.

There are two possibilities:

1) Accepting non-string keys is intended. Then `sort_keys` shouldn't
   break like it does.

   As far as JSON goes, the output of `json.dump[s]` contains string keys.

2) Accepting non-string keys is a bug. Then `json.dump[s]` should be
   changed to not accept them.

Mixing both approaches is the worst of all worlds.

> Since mixed type keys lack meaningful sort order, I'm not sure it's
> wrong to reject attempts to sort them.

The documentation says:

    If sort_keys is True (default False), then the output of dictionaries
    will be sorted by key; this is useful for regression tests to ensure
    that JSON serializations can be compared on a day-to-day basis.

**Reproducible** is the keyword here.

**Readability** is another one. Even if the sort order is "strange",
it is much better than random order, if you are looking for a specific
key.

For the record, it was a test failing under Python 3.5 that triggered
this bug report.

> > As all they keys are dumped as strings, a simple solution would be to
> > sort after converting to strings.
> Converting to string is as
> arbitrary and full of potential for silently incorrect comparisons as
> the Python 2 behavior, and reintroducing it seems like a bad idea.

json.dumps already does the conversion::

    >>> json.dumps({1 : 42, "foo" : "bar", None : "nada"})
    '{"foo": "bar", "1": 42, "null": "nada"}'

Another run::

    >>> json.dumps({1 : 42, "foo" : "bar", None : "nada"})
    '{"1": 42, "foo": "bar", "null": "nada"}'

That difference is exactly the reason for `sort_keys`.
msg253369 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2015-10-23 08:14
Josh Rosenberg wrote at Fri, 23 Oct 2015 02:56:30 +0000:

> As a workaround (should you absolutely need to sort keys by some
> arbitrary criteria), you can initialize a collections.OrderedDict from
> the sorted items of your original dict (using whatever key function
> you like), then dump without using sort_keys=True.

Sigh...

I already implemented a workaround but it's not as simple as you
think — the dictionary in question is nested.

The problem is that this is just another unnecessary difficulty when
trying to move to Python 3.x.
msg297246 - (view) Author: (zachrahan) Date: 2017-06-29 02:50
This one just bit me too. It seems that if JSON serialization accepts non-string dict keys, it should make sure to accept them in all circumstances. Currently, there is an error *only* with mixed-type dicts, *only* when sort_keys=True.

In addition, the error raised in such cases is especially unhelpful. Running the following:
json.dumps({3:1, 'foo':'bar'}, sort_keys=True)

produces a stack trace that terminates in a function defined in C, with this error:
TypeError: '<' not supported between instances of 'str' and 'int'

That error doesn't give non-experts very much to go on...!

The fix is reasonably simple: coerce dict keys to strings *before* trying to sort the keys, not after. The only fuss in making such a patch is that the behavior has to be fixed in both _json.c and in json/encode.py.

The only other consistent behavior would be to disallow non-string keys, but that behavior is at this point very well entrenched. So it only makes sense that encoding should be patched to not fail in corner cases.
msg317187 - (view) Author: Aaron Hall (Aaron Hall) * Date: 2018-05-20 16:49
Now that dicts are sortable, does that make the sort_keys argument redundant?

Should this bug be changed to "won't fix"?
msg317216 - (view) Author: (zachrahan) Date: 2018-05-21 07:29
Well, "wontfix" would be appropriate in the context of deprecating the sort_keys option (over the course of however many releases) and documenting that the new procedure for getting JSON output in a specific order is to ensure that the input dict was created in that order.

Certainly for regression testing, sort_keys is no longer needed, but that's not the only reason people are using that option. (It's certainly not why I use the option -- my use stems from sort_keys improving human readability of the JSON.)

But outside of deprecating sort_keys wholesale, it is still a bug that sort_keys=True can cause an error on input that would otherwise be valid for json.dump[s].
msg317240 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2018-05-21 14:52
Aaron Hall wrote at Sun, 20 May 2018 16:49:06 +0000:

> Now that dicts are sortable, does that make the sort_keys argument redundant?
>
> Should this bug be changed to "won't fix"?

https://bugs.python.org/issue25457#msg317216 is as good an answer as I
could give.

Considering that I openend the bug more than 2.5 years ago, it doesn't
really matter though.
msg317241 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-05-21 15:26
I'm fairly certain (though not 100%, obviously :) that coercing first and then sorting would be accepted if someone wants to create a PR for this.
msg317243 - (view) Author: Aaron Hall (Aaron Hall) * Date: 2018-05-21 17:26
From a design standpoint, I'm fairly certain the sort_keys argument was created due to Python's dicts being arbitrarily ordered.

Coercing to strings before sorting is unsatisfactory because, e.g. numbers sort lexicographically instead of by numeric value when strings.

>>> import json
>>> json.dumps({i:i**2 for i in range(15)}, sort_keys=True)
'{"0": 0, "1": 1, "2": 4, "3": 9, "4": 16, "5": 25, "6": 36, "7": 49, "8": 64, "9": 81, "10": 100, "11": 121, "12": 144, "13": 169, "14": 196}'
>>> json.dumps({str(i):i**2 for i in range(15)}, sort_keys=True)
'{"0": 0, "1": 1, "10": 100, "11": 121, "12": 144, "13": 169, "14": 196, "2": 4, "3": 9, "4": 16, "5": 25, "6": 36, "7": 49, "8": 64, "9": 81}'

Changing the order of operations is just going to create more issues, IMHO.

Now that users can sort their dicts prior to providing them to the function, e.g.:

>>> json.dumps({str(i):i**2 for i in range(15)})
'{"0": 0, "1": 1, "2": 4, "3": 9, "4": 16, "5": 25, "6": 36, "7": 49, "8": 64, "9": 81, "10": 100, "11": 121, "12": 144, "13": 169, "14": 196}'

we could deprecate the argument, or just keep it as-is for hysterical raisins.

Regardless, I'd close this as "won't fix".
msg317246 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-05-21 18:21
json keys *are* strings, so the fact that we support other object types as keys and coerce them to strings is an "extra feature" of python, and is actually a somewhat questionable feature.  The reproducible use case is solved by the fact that dicts are now ordered, with no extra work on the part of the programmer.  Likewise, if you want custom sorting you can ensure your dict is ordered the way you want it to be, as you indicate.  The remaining use case for sort_keys, then (and one for which it is *commonly* used) is sorting the keys lexicographically, and for that, sorting the coereced strings is correct per the json standard (in which all keys are required to be strings).

Note that sort_keys will not be removed for backward compatibility reasons, so the only question is whether or not the increased functionality of coercing first is worth the trouble to implement.  I'm actually only +0 on it, since I don't consider it good practice to json-ize dicts that have non-string keys.  The reason I'm + is because it would increase backward compatibility with python2 (not the ordering of the output, we can't make that work, but in the fact that it would no longer raise an error in python3).

We'll see if other core developers agree or disagree.
msg320706 - (view) Author: James Edwards (jedwards) * Date: 2018-06-29 07:08
This came up in a StackOverflow question[1] today, so I took a stab at addressing the error.  The changes don't restore the 2.x behavior, but just do as R. David Murray suggested and coerce the keys to strings prior to sorting to prevent the error.

The changes in _json.c and json.decoder are handled slightly differently in the case of skipkeys.

Both create a list of (coerced_key, value) pairs, sorts it (when specified), and uses that in place of the PyDict_Items / .items().

When skipkeys=True and invalid (uncoercible) keys are found, the c code will just not append that item to the coerced_items list while the python code uses None to signal that item should be filtered out.

(That being said, I'm not a huge fan of the approach I used in the Python code and may rewrite using .append instead of a generator.

The c code could definitely use a review when it comes to reference counts.

Fork commit: https://github.com/jheiv/cpython/commit/8d3612f56a137da0d26b83d00507ff2f11bca9bb

[1] https://stackoverflow.com/questions/51093268/why-am-i-getting-typeerror-unorderable-types-str-int-in-this-code
msg384635 - (view) Author: (naught101) Date: 2021-01-08 04:58
I want to do something like this:

    hashlib.md5(json.dumps(d, sort_keys=True))

So I can check if a dict's contents are the same as a DB version, but I can't guarantee that all the keys are strings, so it breaks, annoyingly. I would very much like the apply-default-function-then-sort approach. Until then, my work-around is this:

    def deep_stringize_dict_keys(item):
        """Converts all keys to strings in a nested dictionary"""
        if isinstance(item, dict):
            return {str(k): deep_stringize_dict_keys(v) for k, v in item.items()}

        if isinstance(item, list):
            # This will check only ONE layer deep for nested dictionaries inside lists.
            # If you need deeper than that, you're probably doing something stupid.
            if any(isinstance(v, dict) for v in item):
                return [deep_stringize_dict_keys(v) if isinstance(v, dict) else v
                        for v in item]

        # I don't care about tuples, since they don't exist in JSON

        return item

Maybe it can be of some use for others.
History
Date User Action Args
2021-01-08 04:58:36naught101setnosy: + naught101
messages: + msg384635
2019-09-08 02:36:31python-devsetpull_requests: + pull_request15382
2019-09-06 20:13:55josh.rlinkissue38046 superseder
2018-06-29 17:24:02jedwardssetkeywords: + patch
stage: patch review
pull_requests: + pull_request7617
2018-06-29 07:08:41jedwardssetnosy: + jedwards
messages: + msg320706
2018-05-21 18:21:42r.david.murraysetmessages: + msg317246
2018-05-21 17:26:33Aaron Hallsetmessages: + msg317243
2018-05-21 15:26:37r.david.murraysetnosy: + r.david.murray

messages: + msg317241
versions: + Python 3.8
2018-05-21 14:52:37tanzer@swing.co.atsetmessages: + msg317240
2018-05-21 07:29:20zachrahansetmessages: + msg317216
2018-05-20 16:49:06Aaron Hallsetnosy: + Aaron Hall
messages: + msg317187
2017-06-29 02:50:49zachrahansetnosy: + zachrahan

messages: + msg297246
versions: + Python 3.7, - Python 3.5
2015-10-23 08:14:22tanzer@swing.co.atsetmessages: + msg253369
2015-10-23 08:09:53tanzer@swing.co.atsetmessages: + msg253368
2015-10-23 03:01:57josh.rsetmessages: + msg253363
2015-10-23 02:56:30josh.rsetmessages: + msg253362
2015-10-23 02:45:51josh.rsetnosy: + josh.r
messages: + msg253360
2015-10-22 08:01:01tanzer@swing.co.atcreate