This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: GzipFile.write should be buffered
Type: performance Stage:
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9, Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: marcelm, rhpvorderman
Priority: normal Keywords:

Created on 2021-10-06 08:42 by rhpvorderman, last changed 2022-04-11 14:59 by admin.

Messages (1)
msg403289 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-10-06 08:42
Please consider the following code snippet:

    import gzip
    import sys

    with gzip.open(sys.argv[1], "rt") as in_file_h:
        with gzip.open(sys.argv[2], "wt", compresslevel=1) as out_file_h:
            for line in in_file_h:
                # Do processing on line here
                modified_line = line
                # End processing
                out_file_h.write(modified_line)

This is very slow, due to write being called for every line. This is the current implementation of write:
https://github.com/python/cpython/blob/c379bc5ec9012cf66424ef3d80612cf13ec51006/Lib/gzip.py#L272

It:
- Checks if the file is not closed
- Checks if the correct mode is set
- Checks if the file is not closed (again, but in a different way)
- Checks if the data is bytes, bytearray or something that supports the buffer protocol
- Gets the length
- Compresses the data
- updates the size and offset
- updates the checksum

Doing this for every line written is very costly and creates a lot of overhead in Python calls. We spent a lot of time in Python and a lot less in the fast C zlib code that does the actual compression.

This problem is already solved on the read side. A _GzipReader object is used for reading. This is put in an io.BufferedReader which is used as the underlying buffer for GzipFile.read. This way, lines  are read quite fast from a GzipFile without the checksum etc. being updated on every line read.

A similar solution should be written for write.
I volunteer (I have done some other work on gzip.py already), although I cannot give an ETA at this time.
History
Date User Action Args
2022-04-11 14:59:50adminsetgithub: 89550
2021-10-06 08:59:49marcelmsetnosy: + marcelm
2021-10-06 08:42:56rhpvordermansettype: performance
components: + Library (Lib)
versions: + Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10, Python 3.11
2021-10-06 08:42:34rhpvordermancreate