Issue 29992: Expose parse_string in JSONDecoder

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/74178

classification

Title:	Expose parse_string in JSONDecoder
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	bob.ippolito	Nosy List:	Adrián Orive, Levi Cameron, bob.ippolito, ezio.melotti, methane, oberstet, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2017-04-05 08:22 by oberstet, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
scanner.py	Adrián Orive, 2018-01-10 13:52	Modified JSON module's scanner.py
decoder.py	Adrián Orive, 2018-01-10 13:53	Modified JSON module's decoder.py
__init__.py	Adrián Orive, 2018-01-10 13:54	Modified JSON module's __init__.py

Messages (14)
msg291167 - (view)	Author: Tobias Oberstein (oberstet)	Date: 2017-04-05 08:22
Though the JSONDecoder already has all the hooks internally to allow for a custom parse_string (https://github.com/python/cpython/blob/master/Lib/json/decoder.py#L330), this currently is not exposed in the constructor JSONDecoder.__init__. It would be nice to expose it. Currently, I need to do hack it: https://gist.github.com/oberstet/fa8b8e04b8d532912bd616d9db65101a
msg291172 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-04-05 09:38
JSONDecoder constructor already has too much parameters. Adding new parameters will decrease usability. For such uncommon case I think overriding a method in a subclass is the best solution.
msg291173 - (view)	Author: Tobias Oberstein (oberstet)	Date: 2017-04-05 09:58
I agree, my use case is probably exotic: transparent roundtripping of binaries over JSON using a beginning \0 byte marker to distinguish plain string and base64 encoded binaries. FWIW, I do think however that adding "parse_string" kw param to the ctor of JSONDecoder would at least fit the current approach: there are parse_xxx parameters for all the other things already. If overriding string parsing would be via subclassing, while all the others stay with the kw parameter approach, that could be slightly confusing too, because it looses on consistency. Switching everything to subclassing/overriding for _all_ parse_XXX is I guess a no go, because it breaks existing stuff? > For me in my situation, it'll be messy anyways, because I need to support Py2 and 3, and CPy and PyPy .. I just filed the issue for "completeness".
msg291180 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2017-04-05 15:10
I agree with Serhiy that the JSON module is already complex enough that adding more features will have a negative net effect on usability. Bob, what do you think?
msg291186 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2017-04-05 16:00
I agree with that sentiment. If we were to want to support this use case I would rather put together a coherent way to augment the parsing/encoding of anything than bolt it on to what we have.
msg292707 - (view)	Author: Levi Cameron (Levi Cameron)	Date: 2017-05-02 00:27
A less exotic use case than oberstet's: Converting code from python 2 -> 3 and trying to maintain python 2 behaviour of returning all strings as bytes rather than unicode strings. obsertet's solution works but is very much tied to the current implementation.
msg292782 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2017-05-02 15:46
That's not a very convincing argument. Python 2 only returns byte strings if the input is a byte string and the contents of the string are all ASCII. Facilitating that sort of behavior in 3 would probably cause more issues than it solves.
msg309763 - (view)	Author: Adrián Orive (Adrián Orive)	Date: 2018-01-10 13:52
I found the same problem. My case seems to be less exotic, as what I'm trying to do is parse some of these strings into decimal.Decimal or datetime.datetime formats. Returning a decimal as a string is becoming quite common in REST APIs to ensure there is no floating point errors. This is not a simple "a parameter is lacking problem": 1) JSONDecoder has 6 parse_XXX attributes (parse_int, parse_float, parse_constant, parse_string, parse_object, parse_array) and only first 3 of those are offered as parameters. The three last ones fall into a different category as they are not actually parsers but part of the scanner logic, but the first 3 are simple JSON types so, why keep only 3 parsers plus the 2 additional object hooks instead of providing a full set of parsers (arrays, strings, keys)? 2) JSONDecoder.__init__ method calls json.scanner.make_scanner function, so even when subclassing JSONDecoder and modifying some attributes after calling super().__init__ it will not work, the scanner needs to be reseted. 3) make_scanner is implementented in both C (c_make_scanner) and Python (py_make_scanner), the later is used as backup in case the former could not be imported. The C and Python versions behaviour IS NOT CONSISTENT. - c_make_scanner IGNORES JSONDecoder's parse_string attribute. This also applies to parse_array and parse_object attributes. - py_make_scanner ONLY uses it for JSON object values, keys have json.decoder.scanstring hardcoded. 4) ONLY make_scanner IS BEING "EXPORTED" (__all__ = ['make_scanner']) so knowing the existence of the two versions requires getting deep into json's code. This also applies to json.decoder's scanstring, JSONObject and JSONArray. The second point would be solved by providing all the needed params, as that would mean that you don't need to modify the attribute after calling JSONDecoder.__init__. This makes more sense than mnoving the make_scanner call out of the __init__ method as it is clearly part of the initialization. Has to be noted, however, that moving the make_scanner call from the __init__ to the raw_decode methods, despite making less sense, would only be a performance degradation for the default JSONDecover as the rest are only used once. The forth point would be solved if both the first and the third point are solved, as these methods (c_make_scanner, py_make_scanner, scanstring, JSONObject and JSONArray) would be implementation details and would not be needed by the user, so not exporting them would be the right choice. So my proposal focuses on fixing the first and third point, keeping in mind that it needs to be backwards compatible: The process of decoding a JSON string into a Python object can be conceptually divided into two steps, interpretting the characters and then transforming it into the corresponding Python object. The first step is what the scanner is doing with the character matching, the number regex, scanstring, JSONObject and JSONArray. The second step is what parse_int, parse_float, parse_constant, object_hook and object_pairs_hook attributes are for. Dividing this two steps its important as the first one is an implementation detail so it can stay hardcoded (keeping the consistency of both C and Python versions), while the second one is the one where the user is given some hooks to slightly modify its behaviour. Adding additional hooks for arrays, strings and objects' keys will give the users every customization tool available. This change plus refactoring the first steps to use names that do not get confused with these hooks or parsers will solve all the points described above. The following files represent an operational version of the json module with these changes applies. encoder.py and tool.py have not been modified. It has to be taken into account that some C aceletations have been disabled as the C _json module hasn't been modified and thus differ in either operation or method signature with the new version. If these changes seem to get the communities aproval and are thus gonna be applied to the standard library, in addition to the C _json module modifications to adapt to this new version, lines 123 and 311, marked with '# SWAP:' need to be also modified in order to use the C acelerations.
msg309765 - (view)	Author: Adrián Orive (Adrián Orive)	Date: 2018-01-10 13:53
Second file
msg309766 - (view)	Author: Adrián Orive (Adrián Orive)	Date: 2018-01-10 13:54
Third file
msg309819 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2018-01-11 18:21
Generally speaking, parsing some things as decimal or datetime are schema dependent. It's unlikely that you would want to parse every string that looks enough like a decimal as a decimal, or that you would want to pay the cost of checking every string in the whole document to see if it's a decimal. This use case is probably better served using something like object_pairs_hook where you have some context available. Ultimate flexibility is not the goal of this interface. It's grown a bit too much of that over time. At this point I'm a lot more interested in proposals that remove options rather than add them. In order to provide maximal flexibility it would be much nicer to have a streaming interface available (like SAX for XML parsing), but that is not what this is.
msg309820 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-11 18:34
I concur with Bob.
msg413338 - (view)	Author: Tobias Oberstein (oberstet)	Date: 2022-02-16 16:09
> It's unlikely that you would want to parse every string that looks enough like a decimal as a decimal, or that you would want to pay the cost of checking every string in the whole document to see if it's a decimal. fwiw, yes, that's what I do, and yes, it needs to check every string https://github.com/crossbario/autobahn-python/blob/bc98e4ea5a2a81e41209ea22d9acc53258fb96be/autobahn/wamp/serializer.py#L410 > Returning a decimal as a string is becoming quite common in REST APIs to ensure there is no floating point errors. exactly. it is simply required if money values are involved. since JSON doesn't have a native Decimal, strings need to be used (the only scalar type in JSON that allows one to encode the needed arbitrary precision decimals) CBOR has tagged decimal fraction encoding, as described in RFC7049 section 2.4.3. fwiw, we've added roundtrip and crosstrip testing between CBOR <=> JSON in our hacked Python JSON, and it works https://github.com/crossbario/autobahn-python/blob/bc98e4ea5a2a81e41209ea22d9acc53258fb96be/autobahn/wamp/test/test_wamp_serializer.py#L235
msg413378 - (view)	Author: Inada Naoki (methane) *	Date: 2022-02-17 02:41
> Generally speaking, parsing some things as decimal or datetime are schema dependent. Totally agree with this. > In order to provide maximal flexibility it would be much nicer to have a streaming interface available (like SAX for XML parsing), but that is not what this is. I think it is too difficult and complicated. I think post-processing approach (e.g. dataclass_json, pydantic) is enough.

History
Date	User	Action	Args
2022-04-11 14:58:44	admin	set	github: 74178
2022-02-17 02:41:38	methane	set	nosy: + methane messages: + msg413378
2022-02-16 16:09:44	oberstet	set	messages: + msg413338
2018-01-11 18:34:01	serhiy.storchaka	set	messages: + msg309820
2018-01-11 18:21:12	bob.ippolito	set	messages: + msg309819
2018-01-10 13:54:12	Adrián Orive	set	files: + __init__.py messages: + msg309766
2018-01-10 13:53:47	Adrián Orive	set	files: + decoder.py messages: + msg309765
2018-01-10 13:52:34	Adrián Orive	set	files: + scanner.py nosy: + Adrián Orive messages: + msg309763
2017-05-02 15:46:42	bob.ippolito	set	messages: + msg292782
2017-05-02 00:27:25	Levi Cameron	set	nosy: + Levi Cameron messages: + msg292707
2017-04-05 16:00:18	bob.ippolito	set	messages: + msg291186
2017-04-05 15:25:10	rhettinger	set	components: + Library (Lib) versions: + Python 3.7
2017-04-05 15:10:03	rhettinger	set	assignee: bob.ippolito messages: + msg291180 nosy: + bob.ippolito
2017-04-05 09:58:50	oberstet	set	messages: + msg291173
2017-04-05 09:38:40	serhiy.storchaka	set	nosy: + rhettinger, ezio.melotti, serhiy.storchaka messages: + msg291172
2017-04-05 08:22:05	oberstet	create