Title: json.load() can raise UnicodeDecodeError, but this is not documented
Type: behavior Stage: patch review
Components: Documentation Versions: Python 3.10, Python 3.9, Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: docs@python, eric.smith, kamilturek, mattheww, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2021-02-27 18:24 by mattheww, last changed 2021-04-08 06:12 by rhettinger.

Pull Requests
URL Status Linked Edit
PR 25173 closed rhettinger, 2021-04-04 00:52
Messages (5)
msg387780 - (view) Author: Matthew Woodcraft (mattheww) Date: 2021-02-27 18:24
The documentation for json.load() and json.loads() says:

« If the data being deserialized is not a valid JSON document, a JSONDecodeError will be raised. »

But this is not currently entirely true: if the data is provided in bytes form and is not properly encoded in one of the three accepted encodings, UnicodeDecodeError is raised instead.

(I have no opinion on whether the documentation or the behaviour should be changed.)
msg387794 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2021-02-27 23:25
As a rule we don't try and document every exception that can be raised. I could go either way on documenting encoding errors with the json module, although it seems pretty clear that an encoding error would be possible in this case.
msg387795 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-02-27 23:43
Normally, we don't (or can't) enumerate all possible exceptions.  But
in this case, it is worth expanding the documentation so that person can know which of two common input errors they need to catch:

"If the data being deserialized is not valid UTF-8 a UnicodeDecodeError will be raised, and if the decoded file is not 
a valid JSON document, a JSONDecodeError will be raised".
msg387851 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-03-01 10:26
json.loads() accepts also data encoded with UTF-16 and UTF-32.
msg387879 - (view) Author: Matthew Woodcraft (mattheww) Date: 2021-03-01 20:10
Further, "is not valid UTF-8" isn't quite true because the decoding is done with 'surrogatepass' set.

In practice I don't think many users will care which of the two exceptions they get for which inputs, but it's useful to know how broad your catch has to be if you're using load() on possibly-invalid inputs.
Date User Action Args
2021-04-08 06:12:00rhettingersetassignee: rhettinger ->
2021-04-04 00:52:34rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request23914
2021-03-03 21:45:25kamiltureksetnosy: + kamilturek
2021-03-01 20:10:48matthewwsetmessages: + msg387879
2021-03-01 10:26:57serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg387851
2021-02-27 23:43:36rhettingersetassignee: docs@python -> rhettinger

messages: + msg387795
nosy: + rhettinger
2021-02-27 23:25:43eric.smithsetversions: - Python 3.6, Python 3.7
nosy: + eric.smith, docs@python

messages: + msg387794

assignee: docs@python
components: + Documentation, - Library (Lib)
2021-02-27 18:24:21matthewwcreate