Message 154645 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Mark.Shannon
Recipients	Jim.Jewett, Mark.Shannon, benjamin.peterson, giampaolo.rodola, gregory.p.smith, jcea, jcon, pitrou, pjenvey, rhettinger, terry.reedy, vstinner
Date	2012-02-29.16:24:25
SpamBayes Score	2.220446e-16
Marked as misclassified	No
Message-id	<4F4E513C.2010906@hotpy.org>
In-reply-to	<1330528537.29.0.642752292673.issue13903@psf.upfronthosting.co.za>

Content
Jim Jewett wrote: > Jim Jewett <jimjjewett@gmail.com> added the comment: > > As of Feb 28, 2012, the PEP mentions an additional optimization of storing the values in an array indexed by (effectively) key insertion order, rather than key position. ("Alternative Implementation") > > It states that this would reduce memory usage for the values array by 1/3. 1/3 is a worst-case measurement; average is 1/2. (At savings of less than 1/3, the keys would resize, to initial savings of 2/3. And yes, that means in practice, the average savings would be greater than half, because the frequency of dicts of size N decreases with N.) > I presume you mean to allocate a values array of size == actual keys rather than size == usable keys. Making the values array smaller than the the number of usable keys causes a number of issues: 1. The values_size will need to be kept in the dict, not in the keys, which would make the dict larger for size 3 or 5, the same size for size 2 or 4, but it would save memory for sizes 6-8 (see below). So it might not save memory at all, it depends on the distribution of the sizes of shared-key dicts. 2. dk_usable would need to be reverted to dk_fill (c.f ma_fill) in order to quickly allocate values arrays. This would slow down the resizing check, which is now done before insertion, so might be slower. (not really a problem, but it might slow down inserts) 3. An extra check needs to be performed for each read to ensure the index is in-bounds (again not really a problem, but it might slow down reads) Comparative sizes of values array (including size field): Items Proposed Alternative Other Alternative Delta (size == usuable) (size == keys (+1)) 1 4 3 2 -1 2 4 3 3 0 3 4 3 4 +1 4 8 5 5 0 5 8 5 6 +1 6 16 10 7 -3 7 16 10 8 -2 8 16 10 9 -1 9 16 10 10 0 10 16 10 11 +1 12 32 21 13 -8 14 32 21 15 -6 The memory savings of the two alternative implementations are broadly similar. Clearly, either of the alternatives are attractive in terms of memory use. I think it is definitely worth investigating further, but I would like to get the proposed implementation into CPython first. > It states that the keys table would need an additional "values_size" field, but in the absence of dummies, this is just ma_used. values_size != ma_used values_size is the size of the values array, not the number of values. Don't forget deletions or the case where items are inserted in a different order from that of another dict sharing the same keys. Although there are no dummies, (key != NULL, value == NULL) is a legal pair representing a missing or deleted value. Cheers, Mark.

Jim Jewett wrote:
> Jim Jewett <jimjjewett@gmail.com> added the comment:
> 
> As of Feb 28, 2012, the PEP mentions an additional optimization of storing the values in an array indexed by (effectively) key insertion order, rather than key position. ("Alternative Implementation")
> 
> It states that this would reduce memory usage for the values array by 1/3.  1/3 is a worst-case measurement; average is 1/2.  (At savings of less than 1/3, the keys would resize, to initial savings of 2/3.  And yes, that means in practice, the average savings would be greater than half, because the frequency of dicts of size N decreases with N.)
> 

I presume you mean to allocate a values array of size == actual keys
rather than size == usable keys.

Making the  values array smaller than the the number of usable keys
causes a number of issues:

1. The values_size will need to be kept in the dict, not in the keys, 
which would make the dict larger for size 3 or 5, the same size for
size 2 or 4, but it would save memory for sizes 6-8 (see below).
So it might not save memory at all, it depends on the distribution of 
the sizes of shared-key dicts.
2. dk_usable would need to be reverted to dk_fill (c.f ma_fill) in order
to quickly allocate values arrays. This would slow down the resizing 
check, which is now done before insertion, so might be slower.
(not really a problem, but it might slow down inserts)
3. An extra check needs to be performed for each read to ensure the 
index is in-bounds
(again not really a problem, but it might slow down reads)

Comparative sizes of values array (including size field):

Items Proposed    Alternative    Other Alternative    Delta
                (size == usuable) (size == keys (+1))
   1      4           3                 2              -1	
   2      4           3                 3               0
   3      4           3                 4              +1
   4      8           5                 5               0
   5      8           5                 6              +1
   6     16          10                 7              -3
   7     16          10                 8              -2
   8     16          10                 9              -1
   9     16          10                10               0
  10     16          10                11              +1
  12     32          21                13              -8
  14     32          21                15              -6

The memory savings of the two alternative implementations are broadly
similar.

Clearly, either of the alternatives are attractive in terms of memory 
use. I think it is definitely worth investigating further, but I would 
like to get the proposed implementation into CPython first.

> It states that the keys table would need an additional "values_size" field, but in the absence of dummies, this is just ma_used.

values_size != ma_used
values_size is the size of the values array, not the number of values.

Don't forget deletions or the case where items are inserted in a 
different order from that of another dict sharing the same keys.
Although there are no dummies, (key != NULL, value == NULL) is a legal
pair representing a missing or deleted value.

Cheers,
Mark.

History
Date	User	Action	Args
2012-02-29 16:24:26	Mark.Shannon	set	recipients: + Mark.Shannon, rhettinger, terry.reedy, gregory.p.smith, jcea, pitrou, vstinner, giampaolo.rodola, pjenvey, benjamin.peterson, jcon, Jim.Jewett
2012-02-29 16:24:26	Mark.Shannon	link	issue13903 messages
2012-02-29 16:24:25	Mark.Shannon	create