classification
Title: json.dumps not parsable by json.loads (on Linux only)
Type: behavior Stage:
Components: Library (Lib), Unicode, Windows Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Brian.Merrell, belopolsky, ezio.melotti, haypo, merrellb, pitrou, rhettinger, tchrist
Priority: normal Keywords:

Created on 2011-03-13 23:17 by Brian.Merrell, last changed 2011-10-09 23:24 by rhettinger.

Messages (7)
msg130779 - (view) Author: Brian Merrell (Brian.Merrell) Date: 2011-03-13 23:17
The following works on Win7x64 Python 2.6.5 and breaks on Ubuntu 10.04x64-2.6.5.  This raises three issues:

1)  Shouldn't anything generated by json.dumps be parsed by json.loads?
2)  It appears this is an invalid unicode character.  Shouldn't this be caught by decode("utf8")
3)  Why does Windows raise no issue with this and Linux does?

import json
unicode_bytes = '\xed\xa8\x80'
unicode_string = unicode_bytes.decode("utf8")
json_encoded = json.dumps("my_key":unicode_string)
json.loads(json_encoded)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.6/json/decoder.py", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode
    obj, end = self._scanner.iterscan(s, **kw).next()
  File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = action(m, context)
  File "/usr/lib/python2.6/json/decoder.py", line 183, in JSONObject
    value, end = iterscan(s, idx=end, context=context).next()
  File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = action(m, context)
  File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString
    return scanstring(match.string, match.end(), encoding, strict)
ValueError: Invalid \uXXXX escape: line 1 column 14 (char 14)
msg130846 - (view) Author: Alexander Belopolsky (belopolsky) (Python committer) Date: 2011-03-14 16:19
> It appears this is an invalid unicode character.
> Shouldn't this be caught by decode("utf8")

It should and it is in Python 3.x:

>>> b'\xed\xa8\x80'.decode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

Python 2.7 behavior seems to be a bug.

>>> '\xed\xa8\x80'.decode("utf8")
u'\uda00'

Note also the following difference:

In 3.x:

>>> b'\xed\xa8\x80'.decode("utf8", 'replace')
'��'

In 2.7:

>>> '\xed\xa8\x80'.decode("utf8", 'replace')
u'\uda00'

I am not sure this should be fixed in 2.x. Lone surrogates seem to round-trip just fine in 2.x and there likely to be existing code that relies on this.

>  Shouldn't anything generated by json.dumps be parsed by json.loads?

This on the other hand should probably be fixed by either rejecting lone surrogates in json.dumps or accepting them in json.loads or both.  The last alternative would be consistent with the common wisdom of being conservative in what you produce but liberal in what you accept.
msg130862 - (view) Author: Brian Merrell (Brian.Merrell) Date: 2011-03-14 17:31
>I am not sure this should be fixed in 2.x. Lone surrogates seem to >round-trip just fine in 2.x and there likely to be existing code that >relies on this.

I generally agree but am then at a loss as to how to detect and deal with lone surrogates(eg "ignore", "replace", etc) in 2.x when interacting with services/libraries (such as Python's own json.loads) that take a stricter view.

>>  Shouldn't anything generated by json.dumps be parsed by json.loads?

>This on the other hand should probably be fixed by either rejecting >lone surrogates in json.dumps or accepting them in json.loads or both.  >The last alternative would be consistent with the common wisdom of >being conservative in what you produce but liberal in what you accept.

We seem to be in the worst of both worlds right now as I've generated and stored a lot of json that can not be read back in.  Could the JSON library simply leverage Python's Unicode interpreter instead of performing its own validation?  We could pass it "ignore", "replace", etc.  Regardless, I think we certainly need to remove the strict JSON loads() validation especially when it isn't enforced by dumps().
msg130889 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-03-14 20:09
> We seem to be in the worst of both worlds right now 
> as I've generated and stored a lot of json that can 
> not be read back in

This is unfortunate.  The dumps() should have never worked in the first place.

I don't think that loads() should be changed to accommodate the dumps() error though.  JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load.

To fix the data you've already created (one that other compliant JSON readers wouldn't be able to parse), I think you need to repreprocess those file to make them valid:

   bs.decode('utf-8', errors='ignore').encode('utf-8')

Then we need to fix dumps so that it doesn't silently create invalid JSON.

> This on the other hand should probably be 
> fixed by either rejecting lone surrogates 
> in json.dumps or accepting them in json.loads or both.

Rejection is the right way to go.  For the most part,
it is never helpful to create invalid JSON files that
other readers can't and shouldn't read.
msg130891 - (view) Author: Brian (merrellb) Date: 2011-03-14 20:21
On Mon, Mar 14, 2011 at 4:09 PM, Raymond Hettinger
<report@bugs.python.org>wrote:

>
> Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:
>
> > We seem to be in the worst of both worlds right now
> > as I've generated and stored a lot of json that can
> > not be read back in
>
> This is unfortunate.  The dumps() should have never worked in the first
> place.
>
> I don't think that loads() should be changed to accommodate the dumps()
> error though.  JSON is UTF-8 by definition and it is a useful feature that
> invalid UTF-8 won't load.
>

I may be wrong but it appeared that json actually encoded the data as the
string "u\da00" ie (6-bytes) which is slightly different than the encoding
of the utf-8 encoding of the json itself.  Not sure if this is relevant but
it seems less severe than actually invalid utf-8 coding in the bytes.

Unfortunately I don't believe this does anything on python 2.x as only
python 3.x encode/decode flags this as invalid.

> ----------
> nosy: +rhettinger
> priority: normal -> high
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue11489>
> _______________________________________
>
msg133662 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-13 12:37
print(repr(json.loads(json.dumps({u"my_key": u'\uda00'}))['my_key'])):
 - displays u'\uda00' in Python 2.7, 3.2 and 3.3
 - raises a ValueError('Invalid \uXXXX escape: ...') on loads() in Python 2.6
 - raises a ValueError('Unpaired high surrogate: ...') on loads() in Python 3.1

json version changed in Python 2.7: see the issue #4136.

See also this important change in simplejson:
http://code.google.com/p/simplejson/source/detail?r=113

We only fix security bugs in Python 2.6, not bugs. I don't think that this issue is a security bug in Python 2.6.

We might change Python 3.1 behaviour.
msg144646 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-29 22:22
RFC 4627 doesn't say much about lone surrogates:
   A string is a sequence of zero or more Unicode characters [UNICODE].
   [...]

   All Unicode characters may be placed within the
   quotation marks except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

   Any character may be escaped.  If the character is in the Basic
   Multilingual Plane (U+0000 through U+FFFF), then it may be
   represented as a six-character sequence: a reverse solidus, followed
   by the lowercase letter u, followed by four hexadecimal digits that
   encode the character's code point.  The hexadecimal letters A though
   F can be upper or lowercase.  So, for example, a string containing
   only a single reverse solidus character may be represented as
   "\u005C".
   [...]

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".


Raymond> JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load.

Even if the input strings are not encodable in UTF-8 because they contain lone surrogates, they can still be converted to an \uXXXX escape, and the resulting JSON document will be valid UTF-8.
AFAIK json always uses \uXXXX, so it doesn't produce invalid UTF-8 documents.

While decoding, both json.loads('"\xed\xa0\x80"') and json.loads('"\ud800"') result in u'\ud800', but the first is not a valid UTF-8 document because it contains an invalid UTF-8 byte sequence that represent a lone surrogate, whereas the second one contains only ASCII bytes and it's therefore valid.
Python 2.7 should probably reject '"\xed\xa0\x80"', but since its UTF-8 codec is somewhat permissive already, I'm not sure it makes much sense changing the behavior now.  Python 3 doesn't have this problem because it works only with unicode strings, so you can't pass invalid UTF-8 byte sequences.

OTOH the Unicode standard says that lone surrogates shouldn't be passed around, so we might decide to replace them with the replacement char U+FFFD, raise an error, or even provide a way to decide what should be done with them (something like the errors argument of codecs).
History
Date User Action Args
2011-10-09 23:24:10rhettingersetpriority: high -> normal
assignee: rhettinger
2011-10-09 23:20:43ezio.melottisetnosy: + pitrou, tchrist

versions: - Python 2.6
2011-09-29 22:22:22ezio.melottisetmessages: + msg144646
2011-04-13 12:37:44hayposetmessages: + msg133662
2011-04-13 08:34:13ezio.melottisetnosy: + ezio.melotti
2011-04-13 08:30:49ezio.melottisetfiles: - unnamed
2011-03-14 20:21:05merrellbsetfiles: + unnamed

messages: + msg130891
nosy: + merrellb
2011-03-14 20:09:35rhettingersetpriority: normal -> high
nosy: + rhettinger
messages: + msg130889

2011-03-14 17:31:41Brian.Merrellsetnosy: belopolsky, haypo, Brian.Merrell
messages: + msg130862
2011-03-14 16:19:06belopolskysetnosy: + haypo, belopolsky

messages: + msg130846
versions: + Python 2.7
2011-03-13 23:17:19Brian.Merrellcreate