This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: json.dumps - allow compression
Type: enhancement Stage: resolved
Components: Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, liad100, rushter
Priority: normal Keywords:

Created on 2018-08-13 11:34 by liad100, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (8)
msg323475 - (view) Author: liad (liad100) Date: 2018-08-13 11:34
The list of arguments of json.dump() can be seen here: https://docs.python.org/2/library/json.html

Notice that there is no way to make compression.

For example pandas allows you to do:
        df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
                  compression='gzip',
                  quoting=QUOTE_NONNUMERIC)

I want to be able to compress when I do:
    with open('products.json', 'w') as outfile:
        json.dump(data, outfile, sort_keys=True)


Please add the ability to compress using json.dump()
msg323476 - (view) Author: (rushter) Date: 2018-08-13 11:49
You can use the gzip module.

with gzip.GzipFile('products.json', 'w') as outfile:
        outfile.write(json.dumps(data, outfile, sort_keys=True))
msg323477 - (view) Author: liad (liad100) Date: 2018-08-13 12:24
The gzip module may work for saving file localy but for example:

This upload json to Google Storage:

import datalab.storage as storage
storage.Bucket('mybucket').item(path).write_to(json.dumps(response), 'application/json')

Your won't work here unless I save the file locally and only then upload it... It's a bit of a problem when your files are  100 GBs+

I still think the json.dump() should support compression
msg323515 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2018-08-14 12:49
There are too many types of compression for this to be built-in to json. There's zlib, gzip, bzip, zip, and no doubt dozens of others. How would we chose one, or several? Which to leave out? How to pass in the many parameters to all of these compression algorithms?

As rushter points out, if your current compression library only compresses a file on disk, you should switch to a streaming compressor, like zlib.compressobj or gzip.GzipFile.

If Google Storage won't let you pass a streaming compressor as a parameter, then that should be a feature request to Google. I suspect you can actually pass a streaming compressor to it, but I have't investigated their API.

In any event, compression does not belong in the json library.
msg323517 - (view) Author: liad (liad100) Date: 2018-08-14 13:16
True there are endless versions of compression just like there are endless version of file formats. Still there are some build-ins like conversion from string to json. For example you don't support of json to orc file. Same argument could have been raise here : how would we choose which conversions to also? Still a choice has been made and some basic conversion behavior is supported.

You are claiming that it's all or nothing which I don't think is the right approach.

Many are now moving their storage into cloud platforms. The storage is as it sound - storage. It doesn't offer any programming service what you stream is what you will have. Streaming huge files without compression = bleeding money for no reason. Saving the files to disk, compress them and then upload them might be very slow and also the idea is having machine with big memory and low storage - if you have to save huge files localy you'll also need big storage which costs more money.

Regarding google there is a pending request for who chooses to use GoogleCloudPlatform package but not all use that. 

https://github.com/GoogleCloudPlatform/google-cloud-python/issues/5791

Not to mention that there are dozes of other service providers. So even if Google will support it - this doesn't give answer to storage service providers 

I still claim that this is a basic legit request and can be handled by the json.dump() function.

gzip is fine. It also supported by pandas extension and is well known.
msg323520 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2018-08-14 13:43
If you really want to see this added to Python, then I suggest you put together a proposal on what the API would be, what options you would or wouldn't support, and post that on python-ideas. But I'll warn you that your chances of success are low.

Your better chance of success is to write a wrapper around json and whatever compression library you want to use, and post that to PyPI to see if it gets traction. I believe you can do what you want without a temporary disk file.
msg323522 - (view) Author: liad (liad100) Date: 2018-08-14 14:09
I'm sure I will find a work-around.
I posted it for other who will face the same issue as me.
There are many who uses cloud storage but not many work with PB size files. This will likely to change in the near future as more and more company start to process huge amount of data.

I'm not sure what you mean by designing an API. I think you sale it up for no need. It simply add of optional parameter which will trigger compression of gzip. That's it. Nothing sophisticated. 

Something like:

json.dumps(data, outfile, sort_keys=True,compression='gzip')

compression - Optional. A string representing the compression to use in the output. Allowed values are ‘gzip’.
msg323873 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2018-08-22 11:32
I'm not asking to be difficult, I'm asking because a full specification would be required in order to implement this.

For example, you're excluding the "compresslevel" parameter, so presumably you'd want the default of 9, which is the slowest option. I'm not sure this is a good idea. Shouldn't the caller be able to specify this? If not, why not?

In any event, I think the best thing to do is to let the caller use any compression they want, and allow them full control over the compression parameters.

Unless, of course, there's any documented standard for json compression (even a de facto standard would be interesting to know about).
History
Date User Action Args
2022-04-11 14:59:04adminsetgithub: 78574
2018-08-22 11:32:52eric.smithsetmessages: + msg323873
2018-08-14 14:09:12liad100setmessages: + msg323522
2018-08-14 13:43:58eric.smithsetmessages: + msg323520
2018-08-14 13:16:41liad100setmessages: + msg323517
2018-08-14 12:49:15eric.smithsetstatus: open -> closed

nosy: + eric.smith
messages: + msg323515

resolution: wont fix
stage: resolved
2018-08-13 12:24:08liad100setmessages: + msg323477
2018-08-13 11:49:24rushtersetnosy: + rushter
messages: + msg323476
2018-08-13 11:34:23liad100create