Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:

> We seem to be in the worst of both worlds right now
> as I've generated and stored a lot of json that can
> not be read back in

This is unfortunate. The dumps() should have never worked in the first place.

I don't think that loads() should be changed to accommodate the dumps() error though. JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load.

I may be wrong but it appeared that json actually encoded the data as the string "u\da00" ie (6-bytes) which is slightly different than the encoding of the utf-8 encoding of the json itself. Not sure if this is relevant but it seems less severe than actually invalid utf-8 coding in the bytes.

To fix the data you've already created (one that other compliant JSON readers wouldn't be able to parse), I think you need to repreprocess those file to make them valid:

bs.decode('utf-8', errors='ignore').encode('utf-8')

Unfortunately I don't believe this does anything on python 2.x as only python 3.x encode/decode flags this as invalid.

----------
nosy: +rhettinger
priority: normal -> high

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue11489>
_______________________________________