classification
Title: Python 3 shelve.DbfilenameShelf is generating 164 times larger files than Python 2.7 when storing dicts
Type: resource usage Stage:
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jeffrey.Kintscher, Paweł Miech
Priority: normal Keywords:

Created on 2020-07-08 07:32 by Paweł Miech, last changed 2020-08-07 04:50 by Jeffrey.Kintscher.

Files
File name Uploaded Description Edit
test_anydbm.py Paweł Miech, 2020-07-08 07:32 script to reproduce problem
test_anydbm.py Paweł Miech, 2020-07-08 10:51
Messages (2)
msg373284 - (view) Author: Paweł Miech (Paweł Miech) Date: 2020-07-08 07:32
I'm porting some code from Python 2.7 to Python 3.8. There is some code that is using shelve.DbfilenameShelf to store some nested dictionaries with sets. I found out that compared with Python 2.7 Python 3.8 shelve generates files that are approximately 164 larger on disk. Python 3.8 file is 2 027 520 size, when Python 2.7 size is 12 288.

Code sample:
Filename: test_anydbm.py

#!/usr/bin/env python
import datetime
import shelve
import sys
import time
from os import path


def main():
    print(sys.version)
    fname = 'shelf_test_{}'.format(datetime.datetime.now().isoformat())
    bucket = shelve.DbfilenameShelf(fname, "n")
    now = time.time()
    limit = 1000
    key = 'some key > some key > other'
    top_dict = {}
    to_store = {
        1: {
            'page_item_numbers': set(),
            'products_on_page': None
        }
    }
    for i in range(limit):
        to_store[1]['page_item_numbers'].add(i)
        top_dict[key] = to_store
        bucket[key] = top_dict
    end = time.time()
    db_file = False
    try:
        fsize = path.getsize(fname)
    except Exception as e:
        print("file not found? {}".format(e))
        try:
            fsize = path.getsize(fname + '.db')
            db_file = True
        except Exception as e:
            print("file not found? {}".format(e))
            fsize = None
    print("Stored {} in {} filesize {}".format(limit, end - now, fsize))
    print(fname)
    bucket.close()
    bucket = shelve.DbfilenameShelf(fname, flag="r")
    if db_file:
        fname += '.db'
    print("In file {} {}".format(fname, len(list(bucket.items()))))

Output of running it in docker image:

Dockerfile:
FROM python:2-jessie
VOLUME /scripts
CMD scripts/test_anydbm.py

2.7.16 (default, Jul 10 2019, 03:39:20) 
[GCC 4.9.2]
Stored 1000 in 0.0814290046692 filesize 12288
shelf_test_2020-07-08T07:26:23.778769
In file shelf_test_2020-07-08T07:26:23.778769 1


So you can see file size: 12 288

And now running same thing in Python 3

Dockerfile:

FROM python:3.8-slim-buster
VOLUME /scripts
CMD scripts/test_anydbm.py

3.8.3 (default, Jun  9 2020, 17:49:41) 
[GCC 8.3.0]
Stored 1000 in 0.02681446075439453 filesize 2027520
shelf_test_2020-07-08T07:27:18.068638
In file shelf_test_2020-07-08T07:27:18.068638 1

Notice file size: 2 027 520

Why is this happening? Is this a bug? If I'd like to fix it, do you have some ideas about causes of this?
msg373306 - (view) Author: Paweł Miech (Paweł Miech) Date: 2020-07-08 10:51
Ok so I see this is an issue that involves the way Pickle pickles Python set objects. Updated script to reproduce appended. Apparently, sets are becoming much larger when stored in Python3 pickle.
History
Date User Action Args
2020-08-07 04:50:12Jeffrey.Kintschersetnosy: + Jeffrey.Kintscher
2020-07-08 10:51:18Paweł Miechsetfiles: + test_anydbm.py

messages: + msg373306
2020-07-08 07:32:46Paweł Miechcreate