classification
Title: Improve test coverage for json library - loading bytes
Type: enhancement Stage: resolved
Components: Tests Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Eric Appelt, ezio.melotti, python-dev, rhettinger, serhiy.storchaka, zach.ware
Priority: normal Keywords: patch

Created on 2016-10-27 03:06 by Eric Appelt, last changed 2016-10-30 21:10 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
mywork.patch Eric Appelt, 2016-10-27 03:06 review
mywork2.patch Eric Appelt, 2016-10-27 12:29 review
mywork3.patch Eric Appelt, 2016-10-28 02:28 review
mywork4.patch Eric Appelt, 2016-10-28 21:16 review
Messages (9)
msg279521 - (view) Author: Eric Appelt (Eric Appelt) * Date: 2016-10-27 03:06
Increase test coverage of the json library, specifically the detect_encoding() function in the __init__ module, which is used to handle the autodetection of the encoding of a bytes object passed to json.loads().

This function was added in issue 17909 which extended the json.loads() function to accept bytes.

Note that this is a small patch as I am following section 5 of the developer guide and I am trying to acquaint myself with the process as a first time contributor. I found this missing coverage just by semi-randomly looking at individual modules of interest. Please let me know if I have made any general mistakes in constructing this ticket. Thanks!
msg279531 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-27 08:40
Isn't the test added in issue17909 enough?
msg279537 - (view) Author: Eric Appelt (Eric Appelt) * Date: 2016-10-27 12:29
I looked back and something is clearly wrong with my coverage reporting setup, sorry :(

When I move the test introduced in issue 17909, 'test_bytes_decode' from the module Lib/test/test_json/test_unicode.py to Lib/test/test_json/test_decode.py that particular test is properly traced and I see that the function is mostly covered, so I need to figure out what is going wrong in my setup.

The test already present is more comprehensive as it includes BOM, so I believe what I wrote is generally not helpful.

I did notice that the existing test does not cover the edge-case of a 2-byte utf-16 sequence that can be generated for a valid JSON object represented by a single codepoint, for example '5' or '7'. I created a new patch to augment the existing test to cover this special case rather than adding a test.
msg279543 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-27 15:55
The patch in issue17909 was written to implement encoding detecting described in RFC 4627 [1]. And the test uses RFC 4627 conforming data. A single codepoint "5" is not valid in RFC 4627, but is valid in RFC 7159 [2].

The comment in your patch is not accurate since '5' is encoded to 1 byte with utf-8 and 4 bytes with utf-32*.

[1] https://tools.ietf.org/html/rfc4627
[2] https://tools.ietf.org/html/rfc7159
msg279573 - (view) Author: Eric Appelt (Eric Appelt) * Date: 2016-10-28 02:28
Thanks for the feedback. I agree that the comment is incorrect for several iterations of the loop that really don't need to be tested at all for '5'. I read the previous issue 17909 more carefully along with RFC 4627, 7159, and EMCA 404 to properly understand the context here.

I notice that although the issue 17909 patch was written in order to implement the patterns in RFC 4627, it is also valid for the additional pattern introduced for utf-16-le in RFC 7159 as described in [1].

Specifically, in RFC 4627 the only valid pattern for utf-16-le begins with XX 00 XX 00 since the first two characters must represent ASCII codepoints. With strings the second character can be higher codepoints allowing for XX 00 XX XX or XX 00 00 XX. In the implementation from issue 17909 the pattern XX 00 XX is first detected and results in utf-16-le, and then the 4th byte is checked for the pattern XX 00 00 XX or XX 00 00 00 to result in utf-16-le or utf-32 respectively.

In the issue 17909 patch the special case of a single character JSON document (i.e. '5') is also specifically accounted for. This case is not mentioned in [1].

So for everything I can try (or think of), this implementation can correctly determine the encoding of any valid JSON document according to RFC 7159. This is good since the documentation [2] makes no distinction that json.loads() would only accept JSON documents in bytes if they adhere to 4627.

The encoding failure mode described in issue 17909 is still an invalid JSON document according to RFC 7159 as the control character \u0000 is not escaped.

So I think overall the implementation is perfectly valid for RFC 7159 / EMCA 404 as suggested by the standard library documentation [2].

To me it seems reasonable to have a unit test to specifically check that the pattern XX 00 00 XX works. For the goal of hitting 100% test coverage as measured by line, I believe that the 2-byte case still needs to be tested.

I have attached a patch that covers these cases along with (maybe too) verbose comments explaining the edge cases. I also changed the comments in the json/__init__.py module itself to clarify the detection patterns as implemented.

I'm not sure how helpful this is to the improvement in of the standard library, but it has been very educational to be so far.

[1] http://www.ietf.org/mail-archive/web/json/current/msg01959.html
[2] https://docs.python.org/3.7/library/json.html
msg279591 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-28 07:37
In general LGTM, but I added few minor comments on Rietveld. Click on the "review" near your patch. You should also receive an email notification, but it can be moved to your Spam folder.
msg279628 - (view) Author: Eric Appelt (Eric Appelt) * Date: 2016-10-28 21:16
I was able to inspect the review, and implemented your suggestions. This is attached as a new patch. Please let me know if this is not the correct workflow.

Thank you for the prompt review and help as I learn the python tracking and review process!
msg279748 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-10-30 21:00
New changeset ea4cc65fc0fc by Serhiy Storchaka in branch '3.6':
Issue #28541: Improve test coverage for encoding detection in json library.
https://hg.python.org/cpython/rev/ea4cc65fc0fc

New changeset e9169a8c0692 by Serhiy Storchaka in branch 'default':
Issue #28541: Improve test coverage for encoding detection in json library.
https://hg.python.org/cpython/rev/e9169a8c0692
msg279749 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-30 21:01
Thank you for your contribution Eric.
History
Date User Action Args
2016-10-30 21:10:18serhiy.storchakasetstatus: open -> closed
assignee: serhiy.storchaka
resolution: fixed
stage: patch review -> resolved
2016-10-30 21:01:17serhiy.storchakasetmessages: + msg279749
2016-10-30 21:00:39python-devsetnosy: + python-dev
messages: + msg279748
2016-10-28 21:16:07Eric Appeltsetfiles: + mywork4.patch

messages: + msg279628
2016-10-28 07:37:31serhiy.storchakasetmessages: + msg279591
2016-10-28 02:29:03Eric Appeltsetfiles: + mywork3.patch

messages: + msg279573
2016-10-27 15:55:39serhiy.storchakasetmessages: + msg279543
2016-10-27 12:29:40Eric Appeltsetfiles: + mywork2.patch

messages: + msg279537
2016-10-27 08:40:39serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg279531
2016-10-27 03:20:28zach.waresetnosy: + rhettinger, ezio.melotti, zach.ware
stage: patch review
type: enhancement

versions: + Python 3.7
2016-10-27 03:06:57Eric Appeltcreate