This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Eric Appelt
Recipients Eric Appelt, ezio.melotti, rhettinger, serhiy.storchaka, zach.ware
Date 2016-10-28.02:28:58
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1477621743.45.0.713585514803.issue28541@psf.upfronthosting.co.za>
In-reply-to
Content
Thanks for the feedback. I agree that the comment is incorrect for several iterations of the loop that really don't need to be tested at all for '5'. I read the previous issue 17909 more carefully along with RFC 4627, 7159, and EMCA 404 to properly understand the context here.

I notice that although the issue 17909 patch was written in order to implement the patterns in RFC 4627, it is also valid for the additional pattern introduced for utf-16-le in RFC 7159 as described in [1].

Specifically, in RFC 4627 the only valid pattern for utf-16-le begins with XX 00 XX 00 since the first two characters must represent ASCII codepoints. With strings the second character can be higher codepoints allowing for XX 00 XX XX or XX 00 00 XX. In the implementation from issue 17909 the pattern XX 00 XX is first detected and results in utf-16-le, and then the 4th byte is checked for the pattern XX 00 00 XX or XX 00 00 00 to result in utf-16-le or utf-32 respectively.

In the issue 17909 patch the special case of a single character JSON document (i.e. '5') is also specifically accounted for. This case is not mentioned in [1].

So for everything I can try (or think of), this implementation can correctly determine the encoding of any valid JSON document according to RFC 7159. This is good since the documentation [2] makes no distinction that json.loads() would only accept JSON documents in bytes if they adhere to 4627.

The encoding failure mode described in issue 17909 is still an invalid JSON document according to RFC 7159 as the control character \u0000 is not escaped.

So I think overall the implementation is perfectly valid for RFC 7159 / EMCA 404 as suggested by the standard library documentation [2].

To me it seems reasonable to have a unit test to specifically check that the pattern XX 00 00 XX works. For the goal of hitting 100% test coverage as measured by line, I believe that the 2-byte case still needs to be tested.

I have attached a patch that covers these cases along with (maybe too) verbose comments explaining the edge cases. I also changed the comments in the json/__init__.py module itself to clarify the detection patterns as implemented.

I'm not sure how helpful this is to the improvement in of the standard library, but it has been very educational to be so far.

[1] http://www.ietf.org/mail-archive/web/json/current/msg01959.html
[2] https://docs.python.org/3.7/library/json.html
History
Date User Action Args
2016-10-28 02:29:03Eric Appeltsetrecipients: + Eric Appelt, rhettinger, ezio.melotti, zach.ware, serhiy.storchaka
2016-10-28 02:29:03Eric Appeltsetmessageid: <1477621743.45.0.713585514803.issue28541@psf.upfronthosting.co.za>
2016-10-28 02:29:03Eric Appeltlinkissue28541 messages
2016-10-28 02:29:00Eric Appeltcreate