classification
Title: json.loads() raises TypeError on bytes object
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Balthazar.Rouberol, antlong, barry, docs@python, eric.araujo, ezio.melotti, georg.brandl, hhas, pitrou, r.david.murray, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2011-01-21 19:01 by hhas, last changed 2012-04-27 15:28 by serhiy.storchaka.

Files
File name Uploaded Description Edit
json.diff hhas, 2011-01-21 19:01
Messages (22)
msg126772 - (view) Author: (hhas) Date: 2011-01-21 19:01
json.loads() accepts strings but errors on bytes objects. Documentation and API indicate that both should work. Review of json/__init__.py code shows that the loads() function's 'encoding' arg is ignored and no decoding takes place before the object is passed to JSONDecoder.decode()

Tested on Python 3.1.2 and Python 3.2rc1; fails on both. 

Example:

#################################################

#!/usr/local/bin/python3.2

import json

print(json.loads('123'))
# 123

print(json.loads(b'123'))
# /Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/json/decoder.py:325:  
#   TypeError: can't use a string pattern on a bytes-like object

print(json.loads(b'123', encoding='utf-8'))
# /Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/json/decoder.py:325:  
#   TypeError: can't use a string pattern on a bytes-like object

#################################################

Patch attached.
msg126782 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-21 20:35
Hmm.  According to issue 4136, all bytes support was supposed to have been removed.
msg126785 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-21 20:46
Indeed, the documentation (and function docstring) needs fixing instead. It's a pity we didn't remove the useless `encoding` parameter.
msg126786 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-01-21 20:54
Georg: Is it still time to deprecate the encoding parameter in 3.2?
msg126788 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-21 21:38
I've committed a doc fix in r88137.
msg126831 - (view) Author: (hhas) Date: 2011-01-22 12:28
Doc fix works for me.
msg126986 - (view) Author: Anthony Long (antlong) Date: 2011-01-25 03:38
Works for me, py2.7 on snow leopard.
msg126997 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-25 11:42
anthony: this is python3-only problem.
msg133645 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-04-13 07:23
Now it's too late for 3.2, should this be done for 3.3?
msg133672 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-04-13 15:40
If you’re talking about deprecating the obsolete encoding argument (maybe it’s time for a new bug report), +1.
msg145343 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2011-10-11 13:44
I'll just mention that the elimination of bytes handling is a bit unfortunate, since this idiom which works in Python 2 no longer works:

fp = urlopen(url)
json_data = json.load(fp)

/me sad
msg145345 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-11 13:51
> I'll just mention that the elimination of bytes handling is a bit
> unfortunate, since this idiom which works in Python 2 no longer works:
> 
> fp = urlopen(url)
> json_data = json.load(fp)

What if the returned JSON uses a charset other than utf-8 ?
msg159359 - (view) Author: Balthazar Rouberol (Balthazar.Rouberol) Date: 2012-04-26 08:20
I know this does not fix anything at the core, but it would allow you to use json.loads() with python 3.2 (maybe 3.1?):

Replace 
json.loads(raw_data)

by

raw_data = raw_data.decode('utf-8') # Or any other ISO format
json.loads(raw_data)
msg159360 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-26 08:34
> What if the returned JSON uses a charset other than utf-8 ?

According to RFC 4627: "JSON text SHALL be encoded in Unicode.  The default encoding is UTF-8." RFC 4627 also offers a way to autodetect other Unicode encodings.
msg159364 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-26 13:03
Well, adding support for bytes objects using the spec from RFC 4627 (or at least with utf-8 as a default) may be an enhancement for 3.3.
msg159366 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-26 14:07
Things are a little more complicated. '123' is not a valid JSON according to RFC 4627 (the top-level element can only be an object or an array). This means that the autodetection algorithm will not always work for such non-standard data.

If we can parse binary data, then there must be a way to generate binary data in at least one of the Unicode encodings.

By the way, the documentation should give a link to RFC 4627 and explain the current implementation is different from it.
msg159368 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-26 14:21
> Things are a little more complicated. '123' is not a valid JSON
> according to RFC 4627 (the top-level element can only be an object or
> an array). This means that the autodetection algorithm will not always
> work for such non-standard data.

The autodetection algorithm needn't examine all 4 first bytes. If the 2
first bytes are non-zero, you have UTF-8 data. Otherwise, the JSON text
will be at least 4 bytes long (since it's either UTF-16 or UTF-32).
msg159388 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-26 15:48
I mean a string that starts with '\u0000'. b'"\x00...'.
msg159391 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-26 16:12
Le jeudi 26 avril 2012 à 15:48 +0000, Serhiy Storchaka a écrit :
> 
> I mean a string that starts with '\u0000'. b'"\x00...'.

According to the RFC, that should be escaped:

   All Unicode characters may be placed within the
   quotation marks except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

And indeed:

>>> json.loads('"\u0000"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/opt/lib/python3.2/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/home/antoine/opt/lib/python3.2/json/decoder.py", line 351, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/antoine/opt/lib/python3.2/json/decoder.py", line 367, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1 column 1 (char 1)
>>> json.loads('"\\u0000"')
'\x00'
msg159395 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-26 16:21
According to current implementation this is acceptable.

>>> json.loads('"\u0000"', strict=False)
'\x00'
msg159454 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-27 14:06
> According to current implementation this is acceptable.

Then perhaps auto-detection can be restricted to strict mode? Non-strict mode would always use utf-8.
Or we can just skip auto-detection altogether (I don't think many people produce utf-16 or utf-32 JSON; that would be a waste of bandwidth for no obvious benefit).
msg159469 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-27 15:28
Related to this question is a question about errors. How to inform the user, if an error occurred in the decoding with detected encoding? Leave UnicodeDecodeError or convert it to ValueError? If there is a syntax error in JSON -- exception will refer to the position in the decoded string, we should to translate it to the position in the original binary string?
History
Date User Action Args
2012-04-27 15:28:06serhiy.storchakasetmessages: + msg159469
2012-04-27 14:06:12pitrousetmessages: + msg159454
2012-04-26 16:21:34serhiy.storchakasetmessages: + msg159395
2012-04-26 16:12:44pitrousetmessages: + msg159391
2012-04-26 15:48:23serhiy.storchakasetmessages: + msg159388
2012-04-26 15:09:07eric.araujosettitle: json.loads() throws TypeError on bytes object -> json.loads() raises TypeError on bytes object
2012-04-26 14:21:40pitrousetmessages: + msg159368
2012-04-26 14:07:45serhiy.storchakasetmessages: + msg159366
2012-04-26 13:03:55pitrousetversions: + Python 3.3, - Python 3.2
messages: + msg159364

assignee: docs@python ->
components: + Library (Lib), - Documentation
type: behavior -> enhancement
stage: needs patch
2012-04-26 08:34:31serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg159360
2012-04-26 08:20:56Balthazar.Rouberolsetnosy: + Balthazar.Rouberol
messages: + msg159359
2011-10-11 13:51:37pitrousetmessages: + msg145345
2011-10-11 13:44:47barrysetnosy: + barry
messages: + msg145343
2011-04-13 15:40:46eric.araujosetmessages: + msg133672
versions: - Python 3.1
2011-04-13 07:23:28ezio.melottisetnosy: + ezio.melotti
messages: + msg133645
2011-01-25 11:42:30r.david.murraysetnosy: georg.brandl, hhas, pitrou, eric.araujo, r.david.murray, docs@python, antlong
messages: + msg126997
2011-01-25 03:38:49antlongsetnosy: + antlong
messages: + msg126986
2011-01-22 12:28:33hhassetnosy: georg.brandl, hhas, pitrou, eric.araujo, r.david.murray, docs@python
messages: + msg126831
2011-01-21 21:38:06pitrousetnosy: georg.brandl, hhas, pitrou, eric.araujo, r.david.murray, docs@python
messages: + msg126788
2011-01-21 20:54:35eric.araujosetnosy: + eric.araujo, georg.brandl
messages: + msg126786
2011-01-21 20:46:48pitrousetnosy: + docs@python
messages: + msg126785

assignee: docs@python
components: + Documentation, - Library (Lib)
2011-01-21 20:35:32r.david.murraysetnosy: + r.david.murray, pitrou
messages: + msg126782
2011-01-21 19:01:47hhascreate