classification
Title: extend json.tool --json-lines to ignore empty rows
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ZeD, bob.ippolito, ezio.melotti, lukasz.langa, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2021-11-29 15:49 by ZeD, last changed 2021-12-11 10:59 by serhiy.storchaka.

Pull Requests
URL Status Linked Edit
PR 29858 open ZeD, 2021-11-30 07:10
Messages (6)
msg407289 - (view) Author: Vito De Tullio (ZeD) * Date: 2021-11-29 15:49
It would be useful to let json.tool support empty rows during handling of json lines
generally speaking, this tolerance is already present in parsers like srsly and jsonlines

actual behavior:

# happy scenario
$ echo -e '{"foo":1}\n{"bar":2}' | python3.10 -mjson.tool --json-lines
{
    "foo": 1
}
{
    "bar": 2
}
$

# spurious EOL at EOF
$ echo -e '{"foo":1}\n{"bar":2}\n' | python3.10 -mjson.tool --json-lines
{
    "foo": 1
}
{
    "bar": 2
}
Expecting value: line 2 column 1 (char 1)
$

# two groups of "rows" in jsonl <- my current usecase
$ echo -e '{"foo":1}\n\n{"bar":2}' | python3.10 -mjson.tool --json-lines
{
    "foo": 1
}
Expecting value: line 2 column 1 (char 1)
$


my desired outcome is to preserve the EOLs, so to have something like:

# happy scenario
$ echo -e '{"foo":1}\n{"bar":2}' | python3.10 -mjson.tool --json-lines
{
    "foo": 1
}
{
    "bar": 2
}
$

# spurious EOL at EOF
$ echo -e '{"foo":1}\n{"bar":2}\n' | python3.10 -mjson.tool --json-lines
{
    "foo": 1
}
{
    "bar": 2
}

$

# two groups of "rows" in jsonl
$ echo -e '{"foo":1}\n\n{"bar":2}' | python3.10 -mjson.tool --json-lines
{
    "foo": 1
}

{
    "bar": 2
}
$
msg407898 - (view) Author: Alex Waygood (AlexWaygood) * (Python triager) Date: 2021-12-07 08:17
I am changing the "version" field to 3.11, as enhancement proposals are generally only considered for unreleased versions of Python.
msg407904 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-12-07 09:40
Both JSON Lines (https://jsonlines.org/) and Newline Delimited JSON (http://ndjson.org/) formats require that Each Line is a Valid JSON Value.

If you want to ignore empty lines you can filter them out with `sed /^$/d`.
msg408005 - (view) Author: Vito De Tullio (ZeD) * Date: 2021-12-08 09:44
My final goal is to preserve the empty lines - I think I can do some bash magic, but maybe something more complex that a simple sed call.

FWIW on https://jsonlines.org/#line-separator-is-n I see "The last character in the file may be a line separator, and it will be treated the same as if there was no line separator present.".
And on https://github.com/ndjson/ndjson-spec#32-parsing I see "The parser MAY silently ignore empty lines, e.g. \n\n. This behavior MUST be documented and SHOULD be configurable by the user of the parser.".

While I get this choice can be "on a grey area", I think this is a known "dialect" of the jsonl specs.
msg408273 - (view) Author: Ɓukasz Langa (lukasz.langa) * (Python committer) Date: 2021-12-10 23:59
I agree with Serhiy that in general the fact that json.tool crashes here is useful: similarly to an exception in Python code, it can inform the user that some data they feed to json.tool is invalid.

At the same time, I find it a bit obnoxious that the current implementation doesn't allow for the *final character* of the input to be a newline (or "characters" in case of \r\n... but it should still only ignore *a single effective newline*).

I mean, if the user starts spewing newlines in the middle of the file... that might easily be an error. If the file ends with 5 empty lines, that might easily be an error. But, if the file really is:

{'line': 1, 'data': ...}\n
{'line': 2, 'data': ...}\n
{'line': 3, 'data': ...}\n

I think that should be pragmatically accepted by json.tool, especially since many text editors now add newline characters at file ends.
msg408293 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-12-11 10:59
The current implementation allows for the final character of the input to be a newline. It does not allow double newlines. In the original example 

   echo -e '{"foo":1}\n{"bar":2}\n'

the echo command adds a newline to the output (which already contains the trailing newline), so the result ends with two newlines. Use option -n to disable adding newline in echo.

I afraid that if we add support of empty lines, soon we will get requests for supporting comments, encoding cookies, single-quote strings, non-quoted keys, hexadecimal integers and other possible JSON extensions.
History
Date User Action Args
2021-12-11 10:59:28serhiy.storchakasetmessages: + msg408293
2021-12-10 23:59:03lukasz.langasetnosy: + lukasz.langa
messages: + msg408273
2021-12-08 09:54:22AlexWaygoodsetnosy: - AlexWaygood
2021-12-08 09:44:53ZeDsetmessages: + msg408005
2021-12-07 09:40:01serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg407904
2021-12-07 08:17:48AlexWaygoodsetnosy: + rhettinger, AlexWaygood, ezio.melotti, bob.ippolito

messages: + msg407898
versions: + Python 3.11, - Python 3.10
2021-11-30 07:10:36ZeDsetkeywords: + patch
stage: patch review
pull_requests: + pull_request28086
2021-11-29 15:49:18ZeDcreate