This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author belopolsky
Recipients Brian.Merrell, belopolsky, vstinner
Date 2011-03-14.16:19:06
SpamBayes Score 1.5957236e-12
Marked as misclassified No
Message-id <1300119547.45.0.353310353078.issue11489@psf.upfronthosting.co.za>
In-reply-to
Content
> It appears this is an invalid unicode character.
> Shouldn't this be caught by decode("utf8")

It should and it is in Python 3.x:

>>> b'\xed\xa8\x80'.decode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

Python 2.7 behavior seems to be a bug.

>>> '\xed\xa8\x80'.decode("utf8")
u'\uda00'

Note also the following difference:

In 3.x:

>>> b'\xed\xa8\x80'.decode("utf8", 'replace')
'��'

In 2.7:

>>> '\xed\xa8\x80'.decode("utf8", 'replace')
u'\uda00'

I am not sure this should be fixed in 2.x. Lone surrogates seem to round-trip just fine in 2.x and there likely to be existing code that relies on this.

>  Shouldn't anything generated by json.dumps be parsed by json.loads?

This on the other hand should probably be fixed by either rejecting lone surrogates in json.dumps or accepting them in json.loads or both.  The last alternative would be consistent with the common wisdom of being conservative in what you produce but liberal in what you accept.
History
Date User Action Args
2011-03-14 16:19:07belopolskysetrecipients: + belopolsky, vstinner, Brian.Merrell
2011-03-14 16:19:07belopolskysetmessageid: <1300119547.45.0.353310353078.issue11489@psf.upfronthosting.co.za>
2011-03-14 16:19:06belopolskylinkissue11489 messages
2011-03-14 16:19:06belopolskycreate