This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: intern filenames in bytecode
Type: resource usage Stage:
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Mike.Solomon, benjamin.peterson, jcea, nadeem.vawda, python-dev, vstinner
Priority: normal Keywords: patch

Created on 2011-05-26 19:30 by Mike.Solomon, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
codegen.patch Mike.Solomon, 2011-05-26 19:30 review
unnamed Mike.Solomon, 2011-05-28 00:51
Messages (8)
msg136997 - (view) Author: Mike Solomon (Mike.Solomon) Date: 2011-05-26 19:30
I work on a large app and we noticed that a surprising portion of our heap was filenames embedded the the bytecode.

This one-line patch to intern filenames reduces our on-disk size about ~15% and brings down our heap and in-memory object count by a similar percentage.
msg137000 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-05-26 19:47
How exactly does it bring down your disk space?
msg137010 - (view) Author: Mike Solomon (Mike.Solomon) Date: 2011-05-26 22:17
If you have a file with say a hundred functions, and each function contains
the full path of that file on disk, your pyc file will contain about
(100*(path_size+overhead)) bytes. In some cases, this is pretty
significant.

On Thu, May 26, 2011 at 12:47 PM, Benjamin Peterson
<report@bugs.python.org>wrote:

>
> Benjamin Peterson <benjamin@python.org> added the comment:
>
> How exactly does it bring down your disk space?
>
> ----------
> nosy: +benjamin.peterson
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue12190>
> _______________________________________
>
msg137024 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-05-27 03:47
2011/5/26 Mike Solomon <report@bugs.python.org>:
>
> Mike Solomon <msolo@gmail.com> added the comment:
>
> If you have a file with say a hundred functions, and each function contains
> the full path of that file on disk, your pyc file will contain about
> (100*(path_size+overhead)) bytes. In some cases, this is pretty
> significant.

I see. The support for saving interned strings no longer exists in Py3
where this "feature" would have to be added.
msg137052 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-05-27 14:07
New changeset 27359a4e0f8c by Benjamin Peterson in branch 'default':
try to use the same str object for all code filenames when compiling or unmarshalling (#12190)
http://hg.python.org/cpython/rev/27359a4e0f8c
msg137053 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-05-27 14:10
As you can see, I've implemented a similar solution in 3.3. It should have the same memory savings but not disk space saving. (This would require reintroducing the marshal feature for interned strings.)
msg137100 - (view) Author: Mike Solomon (Mike.Solomon) Date: 2011-05-28 00:51
The in-memory fix is really the most important - the disk space was a bonus
and an easy metric to gather.

Unfortunately, our app won't be upgrading to python 3.x.

On Fri, May 27, 2011 at 7:10 AM, Benjamin Peterson
<report@bugs.python.org>wrote:

>
> Benjamin Peterson <benjamin@python.org> added the comment:
>
> As you can see, I've implemented a similar solution in 3.3. It should have
> the same memory savings but not disk space saving. (This would require
> reintroducing the marshal feature for interned strings.)
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue12190>
> _______________________________________
>
msg137101 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-05-28 00:56
Okay, I'll close.
History
Date User Action Args
2022-04-11 14:57:17adminsetgithub: 56399
2011-05-28 00:56:40benjamin.petersonsetstatus: open -> closed
resolution: fixed
messages: + msg137101
2011-05-28 00:51:35Mike.Solomonsetfiles: + unnamed

messages: + msg137100
2011-05-27 14:10:03benjamin.petersonsetmessages: + msg137053
2011-05-27 14:07:47python-devsetnosy: + python-dev
messages: + msg137052
2011-05-27 03:47:27benjamin.petersonsetmessages: + msg137024
2011-05-26 22:20:13neologixsetfiles: - unnamed
2011-05-26 22:17:38Mike.Solomonsetfiles: + unnamed

messages: + msg137010
2011-05-26 21:41:45jceasetnosy: + jcea
2011-05-26 19:51:34pitrousetnosy: + vstinner

type: performance -> resource usage
versions: + Python 3.3, - Python 2.6, Python 2.7
2011-05-26 19:47:42benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg137000
2011-05-26 19:40:27nadeem.vawdasetnosy: + nadeem.vawda
2011-05-26 19:30:12Mike.Solomoncreate