This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: JSON streaming
Type: enhancement Stage:
Components: Library (Lib) Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: phr, serhiy.storchaka
Priority: normal Keywords:

Created on 2020-05-14 10:35 by phr, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
jsonstream.py phr, 2020-05-14 10:35 json stream reading function
jsonstream.py phr, 2020-05-14 10:44 same as above but with explanatory comment added
Messages (5)
msg368823 - (view) Author: paul rubin (phr) Date: 2020-05-14 10:35
This is a well-explored issue in other contexts: https://en.wikipedia.org/wiki/JSON_streaming

There is also a patch for it in json.tool, for release in 3.9: https://bugs.python.org/issue31553

Basically it's often convenient to have a file containing a list of json docs, one per line.  However, there is no convenient way to read them back in one by one, since json.load(filehandle) barfs when it sees the unexpected newline at the end of the first doc.

It would be great if the json module itself had a function to handle this.  I have an awful hack that I use myself, that is not suitable for a production library, but I'll attach it to show what functionality I'm suggesting.  I hope this is simple enough to not need a PEP.  Thanks!
msg368824 - (view) Author: paul rubin (phr) Date: 2020-05-14 10:49
Note: the function in my attached file wants no separation at all between the json docs (rather than a newline between them), but that was ok for the application I wrote it for some time back.  I forgot about that when first writing this rfe so thought I better clarify.
msg368826 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-05-14 11:09
If you want to read json objects encoded one per line (JSON Lines or NDJSON), you can do this with just two lines of code:

    for line in file:
        yield json.loads(line)

This format is not formally standardized, but it is popular because its support in any programming language is trivial.

If you want to use more complex format, I afraid it is not popular enough to be supported in the stdlib. You can try to search third-party library which supports your flavour of multi-object JSON format or write your own code if this format is specific for your application.
msg368827 - (view) Author: paul rubin (phr) Date: 2020-05-14 11:17
It's coming back to me, I think I used the no-separator format because I made the multi-document input files by using json.dump after opening the file in append mode.  That seems pretty natural.  I figured the wikipedia article and the json.tool patch just released were evidence that there is interest in this.  The approach of writing newlines between the docs and iterating through lines is probably workable though.  I don't know why I didn't do that before.  I might not have been sure that json docs never contain newlines.

Really it would be nice if json.load could read in anything that json.dump could write out (including with the indent parameter), but that's potentially more complicated and might conflict with the json spec.
msg368828 - (view) Author: paul rubin (phr) Date: 2020-05-14 11:21
Also I didn't know about ndjson (I just looked at it, ndjson.org) but its existence and formalization is even more evidence that this is useful.  I'll check what the two different python modules linked from that site do that's different from your example of iterating through the file by lines.
History
Date User Action Args
2022-04-11 14:59:31adminsetgithub: 84803
2020-05-14 11:21:22phrsetmessages: + msg368828
2020-05-14 11:17:58phrsetmessages: + msg368827
2020-05-14 11:09:31serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg368826
2020-05-14 10:49:42phrsetmessages: + msg368824
2020-05-14 10:44:42phrsetfiles: + jsonstream.py
2020-05-14 10:35:58phrcreate