Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage for ipaddress object instances #67292

Closed
sbromberger mannequin opened this issue Dec 23, 2014 · 31 comments
Closed

Reduce memory usage for ipaddress object instances #67292

sbromberger mannequin opened this issue Dec 23, 2014 · 31 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@sbromberger
Copy link
Mannequin

sbromberger mannequin commented Dec 23, 2014

BPO 23103
Nosy @malemburg, @rhettinger, @ncoghlan, @pitrou, @bitdancer, @ericsnowcurrently, @serhiy-storchaka, @MojoVampire
Dependencies
  • bpo-23133: Pickling of ipaddress classes
  • Files
  • ipaddress_lightweight.patch
  • ipaddress_lightweight_2.patch
  • ipaddress_lightweight_3.patch
  • ipaddress_lightweight_4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2015-03-07.21:45:37.543>
    created_at = <Date 2014-12-23.09:57:26.671>
    labels = ['type-feature', 'library']
    title = 'Reduce memory usage for ipaddress object instances'
    updated_at = <Date 2015-03-08.11:44:39.794>
    user = 'https://bugs.python.org/sbromberger'

    bugs.python.org fields:

    activity = <Date 2015-03-08.11:44:39.794>
    actor = 'pitrou'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2015-03-07.21:45:37.543>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2014-12-23.09:57:26.671>
    creator = 'sbromberger'
    dependencies = ['23133']
    files = ['37539', '37560', '37563', '38047']
    hgrepos = []
    issue_num = 23103
    keywords = ['patch']
    message_count = 31.0
    messages = ['233038', '233039', '233043', '233044', '233049', '233052', '233056', '233057', '233061', '233062', '233064', '233066', '233067', '233069', '233070', '233074', '233083', '233084', '233119', '233171', '233194', '233197', '233200', '233202', '235563', '237443', '237459', '237464', '237466', '237528', '237530']
    nosy_count = 11.0
    nosy_names = ['lemburg', 'rhettinger', 'ncoghlan', 'pitrou', 'pmoody', 'r.david.murray', 'python-dev', 'eric.snow', 'serhiy.storchaka', 'josh.r', 'sbromberger']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue23103'
    versions = ['Python 3.5']

    @sbromberger
    Copy link
    Mannequin Author

    sbromberger mannequin commented Dec 23, 2014

    ipaddress.ip_address instances should be flyweight. There's really no reason to make them mutable:

    >>> a = ipaddress.ip_address("10.1.2.3")
    >>> b = ipaddress.ip_address("10.1.2.3")
    >>> id(a)
    140066533772368
    >>> id(b)
    140066504622544

    Especially with IPv6 and large numbers of addresses, it would be helpful not to have separate instances with the same underlying properties.

    @sbromberger sbromberger mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Dec 23, 2014
    @sbromberger
    Copy link
    Mannequin Author

    sbromberger mannequin commented Dec 23, 2014

    Confirmed on 3.4.0; likely exists in previous versions as well.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Dec 23, 2014

    What is your proposal? WeakValueDictionary mapping raw bytes object to single instance of ipaddress that is queried from ipaddress's __new__? No built-in has quite that extensive a level of caching and aggressive deduplication to my knowledge.

    @sbromberger
    Copy link
    Mannequin Author

    sbromberger mannequin commented Dec 23, 2014

    What is your proposal? WeakValueDictionary mapping raw bytes object to single instance of ipaddress that is queried from ipaddress's __new__? No built-in has quite that extensive a level of caching and aggressive deduplication to my knowledge.

    I was thinking along those lines, but since v4 addresses are just UInt32 and v6 are UInt128, somehow mapping just the "address" information if we can do that via weakref dicts. Right now these objects can't really be used to represent addresses parsed from logs with tens of millions of entries generated by a couple thousand machines. It would be nice to have a memory-efficient way of doing that without having to duplicate the concept in a third-party package.

    There are a few other changes I'd like to see/make as well - this package is too focused on external functions for manipulating/creating addresses, when standard object/method structure would serve it better and be more intuitive. There is also no discernible need to modify an IP address' value. They should be thought of as immutables to the extent we can enforce that policy.

    @serhiy-storchaka
    Copy link
    Member

    IP address instances already are immutable and flyweight. There are no mutating methods. And a number of address attributes (packed, is_loopback, etc) are calculated on fly from integer representation.

    But IP address objects can be lighter.

    1. Use __slots__.
    2. Every instance has the _version attribute. Why this is not class attribute?
    3. Memory consumption can be even less if IP addresses would int subclasses. But this changes the API (in particular adds the __index__ method) and I doubt that we should do this.

    @sbromberger
    Copy link
    Mannequin Author

    sbromberger mannequin commented Dec 23, 2014

    1. However:
    >>> b = ipaddress.IPv4Address('1.2.3.4')
    >>> a = ipaddress.IPv4Address('1.2.3.4')
    >>> id(a)
    4428380928
    >>> id(b)
    4428381768
    >>> a == b
    True
    >>> b._ip += 6
    >>> id(b)
    4428381768
    >>> b
    IPv4Address('1.2.3.10')
    1. Isn’t _version already a class attribute? It’s set in _Basev4/Basev6.

    2. Memory consumption seems particularly wasteful in the current implementation: it takes 56 bytes (via sys.getsizeof) to store what amounts to 4 bytes of data.

    I thought about subclassing Int as well, and I agree it’s the wrong approach. There are too many int methods that don’t make sense in the context of IP addresses.

    @bitdancer
    Copy link
    Member

    If ipaddress objects are already immutable, it seems to me that having a registry that makes it so there is only one instance per ip value isn't a crazy idea. I've written such code in other contexts and it's not complicated. We have caching of identifier strings and "common" integer constants for a reason...this seems a logical extension of that policy in the realm of a specific library data type.

    The opposite argument is that it may be better left up to the application that has to handle lots of ips to do the caching according to what it knows to be an optimum pattern. So consider me +0 on the idea.

    @sbromberger
    Copy link
    Mannequin Author

    sbromberger mannequin commented Dec 23, 2014

    The opposite argument is that it may be better left up to the application that has to handle lots of ips to do the caching according to what it knows to be an optimum pattern.

    I'd agree with you wholeheartedly if ipaddress wasn't part of stdlib, but the issue is that people generally gravitate to using stdlib over other packages when given the choice, and having something that behaves reasonably* in stdlib makes sense wherever possible.

    *in this case, reasonably means, I think, "I shouldn't have to worry that there's 1300% overhead for using IPv4Address() over a native 32-bit representation of an IP address." This matters little when you're just creating a few of these, but best practices regarding reuse might persuade folks to use these classes for all sorts of things. Doing things sanely at scale by default doesn't necessarily preclude further optimization if some developer thinks it's warranted.

    @rhettinger
    Copy link
    Contributor

    Adding __slots__ is reasonable.

    Adding caching to fold together repeated addresses is dubious (it comes with its own space and time cost but won't benefit all users) and, as David said, the decision on how to handle this is up to the user. Also, it is rather late in the game to be challenging the design of the module. IIRC, this code or some variant of it already had heavy use prior to landing in the standard library (it also took into account what was learned with various other ipaddr modules). The module docs say that the library was designed for speed and makes no mention of optimizing for space.

    For the most part, once an API is deployed, users can be expected to exercise every feature whether intended or not. Right now, you can add attributes to two otherwise indistinct ipaddresses. The OP proposes to break that code. Right now, if a user has concerns about memory use due to duplicate address instances, they can easily created their own interning scheme (a = intern_dict.setdefault(a, a)). If the OP get his wish, then those users will be worse off (due to double interning).

    In other words, this ship has already sailed.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Dec 23, 2014

    So, just to be clear, checking the implementation (as of 3.4):

    1. All the ipaddress classes leave __slots__ unset. So the overhead is actually higher than 56 bytes per instance; the __dict__ for an IPv4Address (ignoring the actual keys and values in the dict, just looking at the size of the dict itself), on Python 3.4, 64 bit Windows, is 56 + 288 = 344 bytes. Based on the size of the dict, I'm guessing some instance attributes aren't being set eagerly in __init__, so it's not getting the reduced memory usage from shared key dictionaries either.
    2. It looks like as of 3.4, _version and _max_prefixlen were both instance attributes; the current code base seems to have made _max_prefixlen a class attribute, but left _version as an instance attribute (anything set on self is an instance attribute; doesn't matter if it's set in __init__ of a base class)
    3. Even if we switch to using __slots__, remove support for weak references and user added attributes, and store nothing but the raw address encoded as an int, you're never going to get true "flyweight" objects in CPython. The lowest instance cost you can get for a class defined in Python (on a system with 64 bit pointers) inheriting from object, is 40 bytes, plus 8 bytes per slot (plus the actual cost of the int object, which is another 16 bytes for an IPv4 address on a 64 bit OS). If you want it lighter-weight than that, either you need to inherit from another built-in that it even lighter-weight, or you need to implement it in C. If you built IPv4Address on top of int for instance, you could get the instance cost down to 16 bytes. Downside to inheriting from int is that you'll interact with many other objects as if you were an int, which can cause a lot of unexpected behaviors (as others have noted).

    Basically, there is no concept of "flyweight" object in Python. Aside from implementing it in C or inheriting from int, you're going to be using an absolute minimum (on 64 bit build) of 48 bytes for the basic object structure, plus another 16 for the int itself, 64 bytes total, or about 16x the "real" data being stored. Using a WeakValueDictionary would actually require another 8 bytes per instance (a slot for __weakref__), and the overhead of the dictionary would likely be higher than the memory saved (unless you're regularly creating duplicate IP addresses and storing them for the long haul, but I suspect this is a less common use case than processing many different IP addresses once or twice).

    I do think it would be a good idea to use __slots__ properly to get the memory down below 100 bytes per instance, but adding automatic "IP address interning" is probably not worth the effort. In the longer term, a C implementation of ipaddress might be worth it, not for improvements in computational performance, but to get the memory usage down for applications that need to make millions of instances.

    @malemburg
    Copy link
    Member

    Josh, you might want to have a look at the lightning talk on this page and the associated slides...

    http://www.egenix.com/library/presentations/PyCon-UK-2014-When-performance-matters/

    After having done the tests, using __slots__ is what I consider the flyweight pattern in Python.

    @sbromberger
    Copy link
    Mannequin Author

    sbromberger mannequin commented Dec 23, 2014

    As a test, I tried the following (taken mostly from http://codesnipers.com/?q=python-flyweights):

    class Foo(object):
        _Foo = weakref.WeakValueDictionary()
        def __new__(cls, addr):
            obj = Foo._Foo.get(addr, None)
            if obj is None:
                obj = object.__new__(cls)
                Foo._Foo[addr] = obj
                obj.addr = addr
            return obj

    I created 10 million instances of Foo(34) in an array. Total space taken: ~80 MB. Times: CPU times: user 6.93 s, sys: 48.7 ms, total: 6.98 s
    Wall time: 6.98 s

    I then created 10 million instances of a non-flyweight object, assigning an int to an instance variable:

    class Bar(object):
        pass

    Total space taken: ~1.4 GB. Times:
    CPU times: user 7.64 s, sys: 794 ms, total: 8.44 s
    Wall time: 8.44 s

    This corresponds (roughly) to the space taken by 10 million IPAddr objects.

    So it appears, informally, that caching / flyweight results in modest time and significant memory savings.

    I understand that the ship has sailed for a stdlib implementation, but these results are compelling enough for me to create a separate package.

    @sbromberger
    Copy link
    Mannequin Author

    sbromberger mannequin commented Dec 23, 2014

    Sorry for the additional followup, but I re-ran the test with approximately real-world parameters. In this second test, I created 10 million Foo() instances with random values from 1-2000 (using random.randint). This corresponds to "2000 hosts generating 10 million logs".

    Memory use is ~80 MB. Times: CPU times: user 27.5 s, sys: 101 ms, total: 27.6 s Wall time: 27.6 s

    For the nonoptimized run (10m instances of 2000 random values):
    Memory use is ~1.8GB. Times: CPU times: user 28.2 s, sys: 1.29 s, total: 29.5 s Wall time: 29.5 s

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Dec 24, 2014

    Marc-Andre: Oh, I know. I proselytize to coworkers on the virtues of using __slots__ for classes that will have many instances created (particularly since entirely too much of our stuff is still Py2.7, so we don't have the "free" savings from shared key dictionaries). Heck, I particularly enjoy inheriting from an inlined collections.namedtuple and an empty __slots__(which does add 8 bytes to the size over declaring the __slots__ directly, but gets you as close to truly immutable instances as you can get when you need them, not just "we're all adults here" immutable instances).

    I'm just pointing out that if he thinks 56 bytes per object is too large, he's going to be disappointed with what Python has to offer. General purpose user-defined Python objects don't optimize for low memory usage, and even __slots__ only gets you so far.

    @sbromberger
    Copy link
    Mannequin Author

    sbromberger mannequin commented Dec 24, 2014

    I'm just pointing out that if he thinks 56 bytes per object is too large, he's going to be disappointed with what Python has to offer. General purpose user-defined Python objects don't optimize for low memory usage, and even __slots__ only gets you so far.

    "He" thinks that 1300% overhead is a bit too much for a simple object that can be represented by a 32-bit number, and "he" has been using python for several years and understands, generally, what the language has to offer (though not nearly so well as some of the respondents here). It may be in the roundoff at n < 1e5, but when you start using these objects for real-world analysis, the overhead becomes problematic. I note with some amusement that the overhead is several orders of magnitude greater than those of the protocols these objects represent :)

    In any case - I have no issues with the decision not to make these objects memory-efficient in stdlib and consequently with closing this out. Rolling my own package appears to be the best solution in any case.

    Thanks for the discussion.

    @sbromberger sbromberger mannequin closed this as completed Dec 24, 2014
    @serhiy-storchaka
    Copy link
    Member

    Here is a patch which makes IPv4Address and IPv6Address more lightweight (for other classes the size is not so important). It makes _version the class attribute and adds __slots__.

    @ericsnowcurrently
    Copy link
    Member

    @sbromberger, there's no need for your own package. Just use something like this:

    _addr_cache = {}
    
    def ipaddr(addr):
        try:
            return _addr_cache[addr]
        except KeyError:
            _addr_cache[addr] = ipaddress.ipaddress(addr)
            return _addr_cache[addr]

    You could even throw weakrefs in there if your use case demanded it.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Dec 24, 2014

    Serhiy: I believe you need to add a bunch of __slots__ = () to various base classes in the module, even though they lack member variables. Every link in the inheritance chain must declare __slots__, or the child class will have __dict__ and __weakref__.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 26, 2014

    So, there are several aspects to this request:

    • add a __slots__: this is Serhiy's patch
    • add some weak interning of IP addresses: I believe it is ok to add it as an optimization, provided it doesn't otherwise change the API or the classes' behaviour

    Seth: feel free to write a draft patch of interning, in order to get actual performance/memory consumption numbers. Also, perhaps open a specific issue for it, so we don't have many patches on the same issue?

    @ncoghlan
    Copy link
    Contributor

    I've retitled this issue to be specifically about reducing the memory consumption of the existing types in the IP Address module, as that's a change that isn't easily implemented externally, and shouldn't have any negative side effects under intended usage (although we should probably consider keeping __weakref__ support when adding the __slots__ definitions - I'm OK with taking away arbitrary attribute support, but far more wary of removing the existing weakref support)

    By contrast, interning for even more aggressive memory usage reduction is something that can be implemented relatively easily externally, and is also something that is really hard to tune in the general case. "Small integers" and "strings that look like identifiers" are a win because of the way the language itself works, but there's no similar general purpose heuristic that applies for caching of IP addresses.

    @ncoghlan ncoghlan changed the title ipaddress should be Flyweight Reduce memory usage for ipaddress object instances Dec 29, 2014
    @serhiy-storchaka
    Copy link
    Member

    I believe you need to add a bunch of __slots__ = () to various base classes in the module, even though they lack member variables.

    Done. Here is updated patch.

    I don't think that IP addresses need weak references more than base types as integers or strings. Caching can be implemented without weak references (some caching already is used for networks), and I afraid that supporting weak references list will spent more memory than saved with caching.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 30, 2014

    I don't think that IP addresses need weak references more than base types as integers or strings.

    People may already be taking weak references, so it's a matter of compatibility.
    (and weakrefs can actually help implement an interning scheme as proposed here)

    @serhiy-storchaka
    Copy link
    Member

    Updated patch preserves weak references support.

    @serhiy-storchaka
    Copy link
    Member

    As far as adding __slots__ breaks pickling with protocols < 2, bpo-23133 can be considered as a dependency.

    @serhiy-storchaka
    Copy link
    Member

    Synchronized with tip.

    @serhiy-storchaka
    Copy link
    Member

    Nick, what is your thoughts about the patch?

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 7, 2015

    +1 from me, although since we're committing to preserving the weakref support for compatibility reasons now, I'm wondering if we should also add a test for it.

    @serhiy-storchaka
    Copy link
    Member

    OK. But I hesitate that weakref support is useful for IP addresses.

    @serhiy-storchaka serhiy-storchaka self-assigned this Mar 7, 2015
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 7, 2015

    New changeset 88a5c1698ca4 by Serhiy Storchaka in branch 'default':
    Issue bpo-23103: Reduced the memory consumption of IPv4Address and IPv6Address.
    https://hg.python.org/cpython/rev/88a5c1698ca4

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 8, 2015

    For the record, I don't think the weakref support is likely to be particularly useful either, and if we'd used __slots__ from the beginning, I expect we would have left it out.

    It's specifically the idea of *taking weakref support away* when weakrefs currently work that seems dubious to me.

    It may be worth asking the question on Python dev, as it's possible I'm being overly cautious, and it would be reasonable to drop the weakref support in 3.5.0, while mentioning in the 3.5 porting notes that we'd be open to the idea of adding weakref support back in a 3.5.x maintenance release if anyone reports its removal as a regression from 3.4.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 8, 2015

    -1 on dropping weakref support. It's a feature, and it costs almost nothing.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants