Correct sizeof support for StringIO #59695

serhiy-storchaka · 2012-07-29T19:53:37Z

BPO	15490
Nosy	@loewis, @jcea, @pitrou, @benjaminp, @serhiy-storchaka
Files	stringio_sizeof-3.3_3.patch: Patch for 3.3 stringio_sizeof-2.7_3.patch: Patch for 2.7

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = None
created_at = <Date 2012-07-29.19:53:36.961>
labels = ['type-bug', 'library', 'expert-IO']
title = 'Correct __sizeof__ support for StringIO'
updated_at = <Date 2013-11-30.20:16:31.239>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2013-11-30.20:16:31.239>
actor = 'alexandre.vassalotti'
assignee = 'serhiy.storchaka'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'IO']
creation = <Date 2012-07-29.19:53:36.961>
creator = 'serhiy.storchaka'
dependencies = []
files = ['27240', '27242']
hgrepos = []
issue_num = 15490
keywords = ['patch', 'needs review']
message_count = 16.0
messages = ['166807', '166808', '166812', '166813', '166814', '166865', '166868', '167010', '167034', '168242', '170565', '170569', '170594', '170599', '170604', '170863']
nosy_count = 7.0
nosy_names = ['loewis', 'jcea', 'pitrou', 'benjamin.peterson', 'stutzbach', 'aliles', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue15490'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

serhiy-storchaka · 2012-07-29T19:53:37Z

Here is a patch that implements correct __sizeof__ for C implementation of io.StringIO.

I haven't tested the patch on narrow build.

pitrou · 2012-07-29T19:58:29Z

Here is a patch that implements correct __sizeof__ for C implementation of io.StringIO.

For some value of "correct", since the internal accumulator could hold
alive some unicode strings.

I haven't tested the patch on narrow build.

There's no narrow build anymore on 3.3.

serhiy-storchaka · 2012-07-29T20:54:42Z

For some value of "correct", since the internal accumulator could hold
alive some unicode strings.

This is not a concern of __sizeof__, because these lists and strings are
managed by GC.

There's no narrow build anymore on 3.3.

I mean 2.7 and 3.2. I have written test for such case, but can't check if it
correct.

pitrou · 2012-07-29T21:04:07Z

> For some value of "correct", since the internal accumulator could hold
> alive some unicode strings.

This is not a concern of __sizeof__, because these lists and strings are
managed by GC.

It is, since they are private and not known by the user.

loewis · 2012-07-29T21:11:23Z

The question really is what memory blocks can "leak out" of the object, and those don't really belong to the container. Those shouldn't be accounted for, since somebody else may get hold of them, and pass them to sys.sizeof.

For the PyAccu, AFAICT, objects cannot leak out of it (except for gc.getobjects in debug mode). So I think the memory allocated by the accu really belongs to the StringIO, and needs to be accounted there. The same goes for readnl, writenl, and weakreflist.

On a related note, I wonder why the tp_clear and tp_traverse functions for stringio are so short. Shouldn't it also visit "decoder"?

serhiy-storchaka · 2012-07-30T08:14:11Z

For the PyAccu, AFAICT, objects cannot leak out of it (except for gc.getobjects in debug mode).

Not only in debug mode.

>>> import io, gc
>>> s=io.StringIO()
>>> s.write('12345')
5
>>> s.write('qwerty')
6
>>> for o in gc.get_objects():
...     if "'123" in repr(o) and len(repr(o)) < 1000:
...         print(type(o), repr(o))
... 
<class 'list'> ['12345', 'qwerty']
<class 'list'> ['o', 'gc', 'get_objects', 'repr', 'o', "'123", 1000, 'len', 'repr', 'o', 'print', 'type', 'o', 'repr', 'o']
<class 'tuple'> ("'123", 1000, None)

Someone can summarize sys.getsizeof() for all managed by GC objects and therefore count internal objects twice.

I think the standard library should provide a method for recursive calculation of memory (in some sense, perhaps using different strategies), but that should wait up to 3.4. __sizeof__ should count non-object memory that is not available for GC. As I understand it.

loewis · 2012-07-30T09:01:17Z

> For the PyAccu, AFAICT, objects cannot leak out of it (except for
> gc.getobjects in debug mode).

Not only in debug mode.

I see. I meant sys.getobjects, which is in debug mode only, but
I now see that gc.get_objects will get the list (but not the strings)
of the Accu.

That still leaves readnl and writenl.

I think the standard library should provide a method for recursive
calculation of memory (in some sense, perhaps using different
strategies), but that should wait up to 3.4.

Actually, this is (and should be) a separate project:

http://guppy-pe.sourceforge.net/

__sizeof__ should count non-object memory that is not available for
GC. As I understand it.

I think you misunderstand slightly; the GC relevance is only a side
effect. __sizeof__ should only account for memory that isn't separately
accessible. The notion of "separately accessible" is somewhat weak, since
it may depend on the container.

Non-PyObject blocks clearly need to be accounted for.

PyObject blocks normally don't need to be accounted for, as they are typically
accessible separately, by the following means:

gc.get_objects (for GC objects)
gc.get_referents (for contained non-GC objects)

There are also irregular ways to get objects:

in debug mode only: sys.getobjects
customer access functions or attributes

For memory accounting, it would really be best if the latter two categories
wouldn't exist, i.e. if all non-GC objects were still visible through
tp_traverse.

serhiy-storchaka · 2012-07-31T15:52:33Z

Do I understand correctly that for each Python subobject must be one of the following:

or it is visited in tp_traverse;
or it is accounted in __sizeof__ together with all its subobjects (if they cannot leak out)?

PyAccu can contains not only Unicode objects, but at least an instance of a Unicode subtype, which can has a reference to StringIO object, creating a cycle. Does this mean that PyAccu should be visited in tp_traverse and not be accounted in __sizeof__? Or should we tighten the control and ensure that PyAccu contains only exact Unicode objects?

pitrou · 2012-07-31T19:34:54Z

Does this mean that PyAccu should be visited in tp_traverse and not be
accounted in __sizeof__? Or should we tighten the control and ensure
that PyAccu contains only exact Unicode objects?

IMO, the latter.

serhiy-storchaka · 2012-08-14T20:57:32Z

Patch updated. Now private pyobjects (readnl, accu) counted.

Note all three patches rather different.

serhiy-storchaka · 2012-09-16T18:40:01Z

It would be good if someone looked at the patches. I'm not sure that they are good enough. Perhaps they are too complicated and we shouldn't worry about duplicates?

pitrou · 2012-09-16T18:52:08Z

Well, I do think that it's not very important to make __sizeof__ strictly exact according to some specification. I'm also not sure we want to complicate maintenance when the calculation becomes tedious or difficult.

To me __sizeof__ is mostly an educational tool which gives a hint at the memory impact of an object, but cannot really convey a precise information except for trivial atomic types.

loewis · 2012-09-17T11:18:29Z

I disagree that sizeof cannot work well for variable-sized types. It works very well for strings, lists, tuple, dicts, and other "regular" containers. I agree that it is not important that it is absolutely correct (in some sense) for every object, but it shouldn't lose "big" chunks of data. A bug where it misses four bytes is much less important than a bug where it misses N bytes (for an object-specific value N that can grow indefinitely).

As for the specific patch, I don't think any action should be taken before the 3.3 release. I would personally prefer if the computations where done in Py_ssize_t, not PyObject* (i.e. the result of the recursive call should be unwrapped).

serhiy-storchaka · 2012-09-17T12:26:17Z

It can't work well if we count internal Python objects that can be shared. This is because the "work well" concept is not well defined. And because the implementation of a certain defined calculation algorithm can be completely unmaintainable, that is not well.

If we wrote in the StringIO the same 1 MB string twice, should the __sizeof__() return about 1) 2 MB, 2) 1 MB or 3) size of empty stream if there are external links to this shared string? Patch implements the second strategy, it can be simplified to the first one or complicated even more to a third one. Even more complication will need to take into account the sharing of eol string ('\r' and '\n' always shared, '\r\n' may be).

I would personally prefer if the computations where done in Py_ssize_t, not PyObject*

I too. But on platforms with 64-bit pointers and 32-bit sizes we can allocate total more than PY_SIZE_MAX bytes (hey, I remember the DOS memory models with 16-bit size_t and 32-bit pointers). Even faster we get an overflow if allow the repeated counting of shared objects. What to do with overflow? Return PY_SIZE_MAX or ignore the possibility of errors?

loewis · 2012-09-17T12:51:15Z

Am 17.09.2012 14:26, schrieb Serhiy Storchaka:

> I would personally prefer if the computations where done in
> Py_ssize_t, not PyObject*

I too. But on platforms with 64-bit pointers and 32-bit sizes we can
allocate total more than PY_SIZE_MAX bytes (hey, I remember the DOS
memory models with 16-bit size_t and 32-bit pointers). Even faster we
get an overflow if allow the repeated counting of shared objects.
What to do with overflow? Return PY_SIZE_MAX or ignore the
possibility of errors?

It can never overflow. We cannot allocate more memory than SIZE_MAX;
this is (mostly) guaranteed by the C standard. I don't know whether
you deliberately brought up the obscure case of 64-bit pointers and
32-bit sizes. If there are such systems, we don't support them.

serhiy-storchaka · 2012-09-21T06:51:15Z

Patches updated. Now the computations done in size_t.

Note that the patches for the different versions differ.

serhiy-storchaka added stdlib Python modules in the Lib dir topic-IO type-bug An unexpected behavior, bug, or error labels Jul 29, 2012

serhiy-storchaka self-assigned this Aug 5, 2012

serhiy-storchaka removed their assignment Aug 15, 2012

serhiy-storchaka self-assigned this Dec 29, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct sizeof support for StringIO #59695

Correct sizeof support for StringIO #59695

serhiy-storchaka commented Jul 29, 2012

serhiy-storchaka commented Jul 29, 2012

pitrou commented Jul 29, 2012

serhiy-storchaka commented Jul 29, 2012

pitrou commented Jul 29, 2012

loewis mannequin commented Jul 29, 2012

serhiy-storchaka commented Jul 30, 2012

loewis mannequin commented Jul 30, 2012

serhiy-storchaka commented Jul 31, 2012

pitrou commented Jul 31, 2012

serhiy-storchaka commented Aug 14, 2012

serhiy-storchaka commented Sep 16, 2012

pitrou commented Sep 16, 2012

loewis mannequin commented Sep 17, 2012

serhiy-storchaka commented Sep 17, 2012

loewis mannequin commented Sep 17, 2012

serhiy-storchaka commented Sep 21, 2012

Correct __sizeof__ support for StringIO #59695

Correct __sizeof__ support for StringIO #59695

Comments

serhiy-storchaka commented Jul 29, 2012

serhiy-storchaka commented Jul 29, 2012

pitrou commented Jul 29, 2012

serhiy-storchaka commented Jul 29, 2012

pitrou commented Jul 29, 2012

loewis mannequin commented Jul 29, 2012

serhiy-storchaka commented Jul 30, 2012

loewis mannequin commented Jul 30, 2012

serhiy-storchaka commented Jul 31, 2012

pitrou commented Jul 31, 2012

serhiy-storchaka commented Aug 14, 2012

serhiy-storchaka commented Sep 16, 2012

pitrou commented Sep 16, 2012

loewis mannequin commented Sep 17, 2012

serhiy-storchaka commented Sep 17, 2012

loewis mannequin commented Sep 17, 2012

serhiy-storchaka commented Sep 21, 2012

Correct sizeof support for StringIO #59695

Correct sizeof support for StringIO #59695