Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intern certain integral floats for memory savings and performance #58589

Closed
kristjanvalur mannequin opened this issue Mar 21, 2012 · 20 comments
Closed

Intern certain integral floats for memory savings and performance #58589

kristjanvalur mannequin opened this issue Mar 21, 2012 · 20 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@kristjanvalur
Copy link
Mannequin

kristjanvalur mannequin commented Mar 21, 2012

BPO 14381
Nosy @rhettinger, @jcea, @mdickinson, @pitrou, @kristjanvalur, @voidspace, @ericsnowcurrently, @serhiy-storchaka
Files
  • internfloat.patch
  • internfloat.patch: Second patch, with using only 0.0 and 1.0
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-03-26.14:56:36.998>
    created_at = <Date 2012-03-21.17:47:22.712>
    labels = ['interpreter-core', 'type-feature']
    title = 'Intern certain integral floats for memory savings and performance'
    updated_at = <Date 2012-04-20.13:41:16.698>
    user = 'https://github.com/kristjanvalur'

    bugs.python.org fields:

    activity = <Date 2012-04-20.13:41:16.698>
    actor = 'kristjan.jonsson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-03-26.14:56:36.998>
    closer = 'kristjan.jonsson'
    components = ['Interpreter Core']
    creation = <Date 2012-03-21.17:47:22.712>
    creator = 'kristjan.jonsson'
    dependencies = []
    files = ['24986', '24994']
    hgrepos = []
    issue_num = 14381
    keywords = ['patch']
    message_count = 20.0
    messages = ['156502', '156504', '156509', '156564', '156566', '156568', '156569', '156571', '156578', '156623', '156637', '156677', '156825', '158709', '158712', '158752', '158800', '158809', '158811', '158827']
    nosy_count = 8.0
    nosy_names = ['rhettinger', 'jcea', 'mark.dickinson', 'pitrou', 'kristjan.jonsson', 'michael.foord', 'eric.snow', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = None
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue14381'
    versions = ['Python 3.3']

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 21, 2012

    Michael Foord reminded me of this issue recently. It was discussed on pydev a few years back and met with limited enthusiasm. I speak from experience in live production with EVE that this simple change saved us a lot of memory. Integral floating point numbers are surprisingly common, falling out of mathematical operations and when reading tabular data. 0.0 and 1.0 in particular.

    @kristjanvalur kristjanvalur mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Mar 21, 2012
    @serhiy-storchaka
    Copy link
    Member

    Does it not decrease the performance?

    Falling out integral floating point numbers in the mathematical floating point calculations seems unlikely. I suggest interning integral floats only in conversions str -> float and int -> float. Exception can be made to zero, which can result in operations with zero, and which simple tested.

    @mdickinson
    Copy link
    Member

    This looks like a duplicate of bpo-4024.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 22, 2012

    Yes, there is a measurable performance decrease in pybench arithmetic tests.

    Integers don't fall out of arithmetic that often, true. But integral floats are incredibly common in tabular data. In a game such as Eve Online, configuration data contains a lot of 0.0, 1.0, -1.0 and so on. This patch saved us many megabytes on the server. I can check again...

    >>> [sys.getrefcount(float(i)) for i in range(-10, 11)]
    [777, 4, 38, 9, 215, 691, 627, 185, 98, 603, 73180, 62111, 8326, 6225, 6357, 11737, 2906, 1393, 3142, 1145, 5601]
    >>> sum([sys.getrefcount(float(i)) for i in range(-10, 11)])
    185340
    >>> 
    This is on an idle server.  A server with lots of stuff going on will have this:
    [16715, 184, 1477, 34, 1505, 27102, 3878, 1344, 6248, 974, 595889, 313062, 124072, 120054, 65585, 138667, 13265, 2499, 15677, 3175, 24821]
    >>> sum([sys.getrefcount(float(i)) for i in range(-10, 11)])
    1465155

    About half of the interned floats are 0.0
    On a 64 bit machine with each float taking 24 bytes, this is 35mb net.

    An alternative could be to add a function for manual intering of floating point data, which one can use e.g. when reading tabular data.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 22, 2012

    Integers don't fall out of arithmetic that often, true. But integral
    floats are incredibly common in tabular data. In a game such as Eve
    Online, configuration data contains a lot of 0.0, 1.0, -1.0 and so
    on. This patch saved us many megabytes on the server.

    Can't you do your own interning when reading configuration data? You can even do it in pure Python.

    If CPython starts interning some integral floats, I think only {-1.0, -0.0, 0.0, 1.0} should apply. The rest is too application-specific.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 22, 2012

    The -10..11 range was determined empirically. As you see from the values, only -10 shows up as significant... In fact, just storing 0 to 5 would capture the bulk of the savings.

    Right, I'll do some test with the hardcoded values you mentioned.
    Btw, I don't think -0.0 is worth it, that value is a typical arithmetic result and not something you typically get from input data.

    Also, while interning in python is certainly possible, perhaps it would make sense to implement such in c. Perhaps in the math module, or perhaps even the 'intern' builtin can be extended for this purpose? It could be documented to intern any immutable object that it sees fit to intern, leaving the details to the implementation.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 22, 2012

    Right, I'll do some test with the hardcoded values you mentioned.
    Btw, I don't think -0.0 is worth it, that value is a typical
    arithmetic result and not something you typically get from input data.

    I was suggesting -0.0 in the hope that it might make the patch a bit
    simpler, but I was probably wrong: you still have to test the float's
    sign to distinguish between the two values.

    Also, while interning in python is certainly possible, perhaps it
    would make sense to implement such in c.

    I think it makes sense in the float implementation if it doesn't
    significantly decrease performance. I suggest you post the benchmark
    numbers (see http://hg.python.org/benchmarks/ ).

    Adding a separate builtin/stdlib function would be overkill IMO.

    @serhiy-storchaka
    Copy link
    Member

    Try moving your changes from PyFloat_FromDouble to PyFloat_FromString. Look at memory and perfomance.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 22, 2012

    PyFloat_FromString() is very restrictive. In our case, for example, the floats in question don't come from text files.

    I'm adding a new file with changes suggested by Antoine. only 0.0 and 1.0 are interned. It turns out that the "int" test is expensive, and special casing it to check for '0.0' and '1.0' with direct float comparison makes the performance hit go away. Here are the figures:

    D:\pydev\hg\cpython2>pcbuild10\python.exe Tools\pybench\pybench.py -s interned8 -c normal
    -------------------------------------------------------------------------------
    PYBENCH 2.1
    -------------------------------------------------------------------------------

    • using CPython 3.3.0a1+ (default, Mar 22 2012, 15:39:50) [MSC v.1600 32 bit (Intel)]
    • disabled garbage collection
    • system check interval set to maximum: 2147483647
    • using timer: time.clock

    -------------------------------------------------------------------------------
    Benchmark: interned8
    -------------------------------------------------------------------------------

    Rounds: 10
    Warp:   10
    Timer:  time.clock
    
    Machine Details:
       Platform ID:    Windows-7-6.1.7601-SP1
       Processor:      Intel64 Family 6 Model 26 Stepping 5, GenuineIntel
    
    Python:
       Implementation: CPython
       Executable:     D:\pydev\hg\cpython2\pcbuild10\python.exe
       Version:        3.3.0a1+
       Compiler:       MSC v.1600 32 bit (Intel)
       Bits:           32bit
       Build:          Mar 22 2012 15:39:50 (#default)
       Unicode:        UCS4
    

    -------------------------------------------------------------------------------
    Comparing with: normal
    -------------------------------------------------------------------------------

    Rounds: 10
    Warp:   10
    Timer:  time.clock
    
    Machine Details:
       Platform ID:    Windows-7-6.1.7601-SP1
       Processor:      Intel64 Family 6 Model 26 Stepping 5, GenuineIntel
    
    Python:
       Implementation: CPython
       Executable:     D:\pydev\hg\cpython2\pcbuild10\python.exe
       Version:        3.3.0a1+
       Compiler:       MSC v.1600 32 bit (Intel)
       Bits:           32bit
       Build:          Mar 22 2012 13:53:45 (#default)
       Unicode:        UCS4
    

    Test minimum run-time average run-time
    this other diff this other diff
    -------------------------------------------------------------------------------
    BuiltinFunctionCalls: 59ms 59ms -0.7% 60ms 60ms -0.3%
    BuiltinMethodLookup: 40ms 40ms -0.2% 40ms 40ms -0.2%
    CompareFloats: 47ms 46ms +2.3% 47ms 46ms +2.3%
    CompareFloatsIntegers: 125ms 126ms -0.8% 129ms 131ms -2.0%
    CompareIntegers: 76ms 76ms -0.2% 76ms 76ms -0.3%
    CompareInternedStrings: 60ms 60ms -0.0% 61ms 61ms +0.2%
    CompareLongs: 44ms 44ms -0.4% 44ms 44ms -0.4%
    CompareStrings: 59ms 58ms +2.1% 59ms 58ms +1.8%
    ComplexPythonFunctionCalls: 59ms 60ms -1.7% 60ms 61ms -2.5%
    ConcatStrings: 49ms 55ms -10.7% 49ms 55ms -10.6%
    CreateInstances: 67ms 66ms +1.1% 68ms 68ms -0.1%
    CreateNewInstances: 50ms 50ms +1.3% 51ms 50ms +0.4%
    CreateStringsWithConcat: 70ms 71ms -1.3% 70ms 72ms -1.7%
    DictCreation: 45ms 46ms -1.7% 46ms 46ms -1.5%
    DictWithFloatKeys: 81ms 84ms -2.8% 81ms 84ms -2.7%
    DictWithIntegerKeys: 52ms 55ms -5.4% 52ms 55ms -5.1%
    DictWithStringKeys: 47ms 51ms -6.6% 48ms 51ms -6.6%
    ForLoops: 42ms 42ms +0.2% 42ms 42ms +0.2%
    IfThenElse: 63ms 59ms +5.5% 63ms 59ms +5.6%
    ListSlicing: 39ms 39ms +0.0% 39ms 39ms -0.4%
    NestedForLoops: 67ms 67ms -0.2% 68ms 68ms +0.0%
    NestedListComprehensions: 63ms 62ms +1.2% 64ms 63ms +1.2%
    NormalClassAttribute: 102ms 100ms +1.3% 102ms 101ms +1.4%
    NormalInstanceAttribute: 62ms 62ms +0.2% 63ms 63ms +0.2%
    PythonFunctionCalls: 60ms 60ms +0.8% 61ms 60ms +1.3%
    PythonMethodCalls: 74ms 74ms -0.0% 74ms 74ms -0.3%
    Recursion: 103ms 102ms +1.4% 104ms 102ms +1.3%
    SecondImport: 109ms 109ms -0.4% 110ms 110ms -0.6%
    SecondPackageImport: 111ms 112ms -0.6% 113ms 113ms +0.2%
    SecondSubmoduleImport: 154ms 153ms +0.7% 155ms 155ms -0.2%
    SimpleComplexArithmetic: 56ms 54ms +3.7% 57ms 56ms +1.9%
    SimpleDictManipulation: 94ms 93ms +0.3% 94ms 94ms -0.2%
    SimpleFloatArithmetic: 45ms 44ms +3.2% 45ms 45ms +1.7%
    SimpleIntFloatArithmetic: 57ms 56ms +1.8% 57ms 57ms +0.1%
    SimpleIntegerArithmetic: 56ms 56ms +1.2% 57ms 56ms +1.0%
    SimpleListComprehensions: 53ms 52ms +1.8% 54ms 53ms +2.4%
    SimpleListManipulation: 49ms 53ms -8.0% 50ms 53ms -6.7%
    SimpleLongArithmetic: 41ms 40ms +1.6% 41ms 41ms +0.1%
    SmallLists: 66ms 68ms -2.8% 67ms 68ms -2.5%
    SmallTuples: 76ms 72ms +5.5% 76ms 72ms +5.8%
    SpecialClassAttribute: 144ms 140ms +2.8% 145ms 142ms +2.1%
    SpecialInstanceAttribute: 64ms 64ms +0.7% 65ms 64ms +1.1%
    StringMappings: 126ms 132ms -4.2% 127ms 132ms -4.0%
    StringPredicates: 71ms 70ms +2.3% 72ms 70ms +2.5%
    StringSlicing: 66ms 67ms -0.6% 67ms 67ms -1.1%
    TryExcept: 44ms 43ms +1.5% 44ms 44ms +0.7%
    TryFinally: 49ms 53ms -7.4% 49ms 54ms -8.1%
    TryRaiseExcept: 24ms 22ms +8.6% 25ms 23ms +8.9%
    TupleSlicing: 61ms 61ms -0.7% 61ms 62ms -0.5%
    WithFinally: 65ms 65ms -0.3% 65ms 65ms -0.0%
    WithRaiseExcept: 62ms 62ms -0.4% 62ms 63ms -0.6%
    -------------------------------------------------------------------------------
    Totals: 3448ms 3454ms -0.2% 3474ms 3487ms -0.4%

    (this=interned8, other=normal)

    D:\pydev\hg\cpython2>

    @pitrou
    Copy link
    Member

    pitrou commented Mar 22, 2012

    Ok. In the current state it doesn't look detrimental, although I'm not convinced it's that useful either.
    That said, the implementation needs a bit of work, I don't think it's ok for float_is_positive_zero() to use the Py_LONG_LONG casting thing.

    @serhiy-storchaka
    Copy link
    Member

    PyBench contains a lot of different tests and executes them not too long. Random fluctuations of a few percent are possible. We need a more stable realistic test working with floats.

    $ ./python -m timeit 'x,y,d=1,0,0.01
    for i in range(628):
      x,y=x-d*y,y+d*x'

    Without patch: 1000 loops, best of 3: 220 usec per loop

    With patch: 1000 loops, best of 3: 232 usec per loop

    Difference is 5%.

    @rhettinger
    Copy link
    Contributor

    FWIW, I'm -1 on making the change. Any memory savings would be microscopic for most Python programs. And anything that slows down float creation (even a teeny bit) is detrimental to all Python programs. We're better off not adding special case logic unless there is a clear benefit (there isn't).

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 26, 2012

    Why do you think it isn't safe, Antoine?
    We only support IEEE platforms, and IEEE defines positive 0.0 to be all bits cleared.

    I personally don't think that even a small measurable performance degradation in creating new floats is problematic since that is not where typical cpu power is spent. But this is the second time I´ve made this suggestion (the first was on python-dev) and in both cases I've run up against the opinion that it is a "rare" problem and worries about "performance." Yet this seemed not to be an issue when ints/longs were united.

    So, while I was recently prompted to resubmit this suggestion, I feel no need to push the matter further and am just letting it rest here, then. I'm sure someone will think it is a good idea again in the future and can then reopen it.

    @kristjanvalur kristjanvalur mannequin closed this as completed Mar 26, 2012
    @mdickinson
    Copy link
    Member

    Why do you think it isn't safe, Antoine?

    It violates C's strict aliasing rules; Google for 'C strict aliasing' or 'C type punning' for more information. This isn't just a theoretical concern: gcc is known to make optimizations based on the assumption that strict aliasing isn't violated.

    @mdickinson
    Copy link
    Member

    We only support IEEE platforms.

    I don't think that's true, BTW.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Apr 19, 2012

    return *(PY_LONG_LONG*)&fval == 0;
    There is no aliasing, because there are no pointer variables in existence.
    If we did this:

    double *pfval = &fval;
    PY_LONG_LONG *pl = (PY_LONG_LONG*)pfval
    return *pfval == 0

    Then we would have aliasing. Because "pfval" in this example doesn't exist but is merely a temporary, there is no aliasing.

    As for IEEE compatibility, I don't think we could have our own floating point formatting library if we didn't make that assumption, but I might be wrong about that. On the other hand, I don't think there is a supported python architecture that defines positive zero as anything else than bitwise zero. And should such a platform be found, it is easy enough to disable this code for it.

    @mdickinson
    Copy link
    Member

    Because "pfval" in this example doesn't exist but is merely a temporary, > there is no aliasing.

    Maybe aliasing wasn't the right word to use, then. The exact rule being violated is C99 6.5, para. 7 (or C11 6.5 para. 7 if you prefer):

    """
    An object shall have its stored value accessed only by an lvalue expression that has one of the following types: ...

    """

    In any case, I don't think it's relevant whether the values involved are named variables or temporaries.

    I don't think we could have our own floating point formatting library > if we didn't make that assumption

    There are configure-time checks that determine whether that floating-point formatting library is used or not, including (amongst some other things) whether the underlying floating-point type is IEEE 754. Officially, Python shouldn't assume IEEE 754 floating-point anywhere---the core should build fine without it, and there's lots of code in the core to deal with non-IEEE 754 platforms.

    And should such a platform be found, it is easy enough to disable this > code for it.

    I think that's the wrong way around: the code should only be enabled in the first place when we're sure that the platform is IEEE 754. That shouldn't be hard to build into the patch, since the checks for IEEE 754 are already all over the place for other purposes.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Apr 20, 2012

    Interesting.
    I declare that this rule does not apply here since the code is a deliberate hack: We are pretending that a certain address points to integers and checking those integers.
    If you insist on following the standard, you could do

    double cmp = 0;
    return mcmcmp(&cmp, &fval, sizeof(fval)) == 0;

    but on all real machines this is the same as:

    PY_LONG_LONG cmp = 0;
    return mcmcmp(&cmp, &fval, sizeof(fval)) == 0;

    Which again is the same as
    return *(PY_LONG_LONG*)&fval == 0;
    technically speaking, even if the standard doesn't agree. You could implement this with in-line assembly and so cheerfully sidestep the standard.

    When you're hacking, your're hacking and the standard isn't your friend :)

    As for IEEE, sure, anyway that thing is oriented is fine, although in this day and age, I find it rather amusing that the logic thinks of IEEE support as the exception, rather than the rule.

    Anyway, this proposal has been rejected due to lack of interest or some misguided idea of performance, so the point is moot.

    @mdickinson
    Copy link
    Member

    I declare that this rule does not apply here ...

    Clearly the gcc developers disagree. :-)

    iwasawa:~ mdickinson$ cat test2.c
    int is_positive_zero(double f) {
      return *(long long*)&f == 0;
    }
    iwasawa:~ mdickinson$ gcc -fstrict-aliasing -O3 -Wall -Wextra -Wstrict-aliasing -c test2.c
    test2.c: In functionis_positive_zero’:
    test2.c:2: warning: dereferencing type-punned pointer will break strict-aliasing rules

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Apr 20, 2012

    Dang. I yield!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants