Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List object memory allocator #70570

Closed
catalin-manciu mannequin opened this issue Feb 18, 2016 · 17 comments
Closed

List object memory allocator #70570

catalin-manciu mannequin opened this issue Feb 18, 2016 · 17 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@catalin-manciu
Copy link
Mannequin

catalin-manciu mannequin commented Feb 18, 2016

BPO 26382
Nosy @terryjreedy, @vstinner, @methane, @florinpapa, @catalin-manciu
Files
  • listobject_CPython3.patch: Patch for CPython 3.x
  • listobject_CPython2.patch: Patch for CPython 2.7.x
  • listobject_CPython2-2.patch
  • listobject_CPython3-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-02-20.11:43:44.353>
    created_at = <Date 2016-02-18.14:26:51.405>
    labels = ['interpreter-core', 'performance']
    title = 'List object memory allocator'
    updated_at = <Date 2017-02-20.14:11:04.401>
    user = 'https://github.com/catalin-manciu'

    bugs.python.org fields:

    activity = <Date 2017-02-20.14:11:04.401>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-02-20.11:43:44.353>
    closer = 'methane'
    components = ['Interpreter Core']
    creation = <Date 2016-02-18.14:26:51.405>
    creator = 'catalin.manciu'
    dependencies = []
    files = ['41953', '41954', '46063', '46066']
    hgrepos = []
    issue_num = 26382
    keywords = ['patch']
    message_count = 17.0
    messages = ['260459', '260467', '260513', '260514', '260517', '260518', '260548', '260550', '260678', '284173', '284200', '284201', '284221', '284222', '284239', '284242', '288206']
    nosy_count = 6.0
    nosy_names = ['terry.reedy', 'vstinner', 'methane', 'alecsandru.patrascu', 'florin.papa', 'catalin.manciu']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue26382'
    versions = ['Python 2.7']

    @catalin-manciu
    Copy link
    Mannequin Author

    catalin-manciu mannequin commented Feb 18, 2016

    Hi All,

    This is Catalin from the Server Scripting Languages Optimization Team at Intel Corporation. I would like to submit a patch that replaces the 'malloc' allocator used by the list object (Objects/listobject.c) with the small object allocator (obmalloc.c) and simplifies the 'list_resize' function by removing a redundant check and properly handling resizing to zero.

    Replacing PyMem_* calls with PyObject_* inside the list implementation is beneficial because many PyMem_* calls are made for requesting sizes that are better handled by the small object allocator. For example, when running Tools/pybench.py -w 1 a total of 48.295.840 allocation requests are made by the list implementation (either by using 'PyMem_MALLOC' directly or by calling 'PyMem_RESIZE') out of which 42.581.993 (88%) are requesting sizes that can be handled by the small object allocator (they're equal or less than 512 bytes in size).

    The changes to 'list_resize' were made in order to further improve performance by removing a redundant check and handling the 'resize to zero' case separately. The 'empty' state of a list is suggested by the 'PyList_New' function as having the 'ob_item' pointer NULL and the 'ob_size' and 'allocated' members equal with 0. Previously, when being called with zero as a size parameter, 'list_resize' would set 'ob_size' and 'allocated' to zero, but it would also call 'PyMem_RESIZE' which, by its design, would call 'realloc' with a size of 1, thus going through the process of allocating an unnecessary 1 byte and setting the 'ob_item' pointer with the newly obtained address. The proposed implementation just deletes the buffer pointed by 'ob_item' and sets 'ob_size', 'allocated' and 'ob_item' to zero when receiving a 'resize to zero' request.

    Hardware and OS Configuration
    =============================
    Hardware: Intel XEON (Haswell-EP) 36 Cores / Intel XEON (Broadwell-EP) 36 Cores

    BIOS settings: Intel Turbo Boost Technology: false
    Hyper-Threading: false

    OS: Ubuntu 14.04.2 LTS

    OS configuration: Address Space Layout Randomization (ASLR) disabled to reduce run
    to run variation by echo 0 > /proc/sys/kernel/randomize_va_space
    CPU frequency set fixed at 2.3GHz

    GCC version: GCC version 5.1.0

    Benchmark: Grand Unified Python Benchmark from
    https://hg.python.org/benchmarks/

    Measurements and Results
    ========================
    A. Repository:
    GUPB Benchmark:
    hg id : 9923b81a1d34 tip
    hg --debug id -i : 9923b81a1d346891f179f57f8780f86dcf5cf3b9

    CPython3:
        hg id : 733a902ac816 tip
        hg id -r 'ancestors(.) and tag()': 737efcadf5a6 (3.4) v3.4.4
        hg --debug id -i : 733a902ac816bd5b7b88884867ae1939844ba2c5
    
    CPython2:
        hg id : 5715a6d9ff12 (2.7)
        hg id -r 'ancestors(.) and tag()': 6d1b6a68f775 (2.7) v2.7.11
        hg --debug id -i : 5715a6d9ff12053e81f7ad75268ac059b079b351
    

    B. Results:
    CPython2 and CPython3 sample results, measured on a Haswell and a Broadwell platform can be viewed in Tables 1, 2, 3 and 4. The first column (Benchmark) is the benchmark name and the second (%D) is the speedup in percents compared with the unpatched version.

    Table 1. CPython 3 results on Intel XEON (Haswell-EP) @ 2.3 GHz

    Benchmark %D
    ----------------------------------
    unpickle_list 20.27
    regex_effbot 6.07
    fannkuch 5.87
    mako_v2 5.19
    meteor_contest 4.31
    simple_logging 3.98
    nqueens 3.40
    json_dump_v2 3.14
    fastpickle 2.16
    django_v3 2.03
    tornado_http 1.90
    pathlib 1.84
    fastunpickle 1.81
    call_simple 1.75
    nbody 1.60
    etree_process 1.58
    go 1.54
    call_method_unknown 1.53
    2to3 1.26
    telco 1.04
    etree_generate 1.02
    json_load 0.85
    etree_parse 0.81
    call_method_slots 0.73
    etree_iterparse 0.68
    call_method 0.65
    normal_startup 0.63
    silent_logging 0.56
    chameleon_v2 0.56
    pickle_list 0.52
    regex_compile 0.50
    hexiom2 0.47
    pidigits 0.39
    startup_nosite 0.17
    pickle_dict 0.00
    unpack_sequence 0.00
    formatted_logging -0.06
    raytrace -0.06
    float -0.18
    richards -0.37
    spectral_norm -0.51
    chaos -0.65
    regex_v8 -0.72

    Table 2. CPython 3 results on Intel XEON (Broadwell-EP) @ 2.3 GHz

    Benchmark %D
    ----------------------------------
    unpickle_list 15.75
    nqueens 5.24
    mako_v2 5.17
    unpack_sequence 4.44
    fannkuch 4.42
    nbody 3.25
    meteor_contest 2.86
    regex_effbot 2.45
    json_dump_v2 2.44
    django_v3 2.26
    call_simple 2.09
    tornado_http 1.74
    regex_compile 1.40
    regex_v8 1.16
    spectral_norm 0.89
    2to3 0.76
    chameleon_v2 0.70
    telco 0.70
    normal_startup 0.64
    etree_generate 0.61
    etree_process 0.55
    hexiom2 0.51
    json_load 0.51
    call_method_slots 0.48
    formatted_logging 0.33
    call_method 0.28
    startup_nosite -0.02
    fastunpickle -0.02
    pidigits -0.20
    etree_parse -0.23
    etree_iterparse -0.27
    richards -0.30
    silent_logging -0.36
    pickle_list -0.42
    simple_logging -0.82
    float -0.91
    pathlib -0.99
    go -1.16
    raytrace -1.16
    chaos -1.26
    fastpickle -1.72
    call_method_unknown -2.94
    pickle_dict -4.73

    Table 3. CPython 2 results on Intel XEON (Haswell-EP) @ 2.3 GHz

    Benchmark %D
    ----------------------------------
    unpickle_list 15.89
    json_load 11.53
    fannkuch 7.90
    mako_v2 7.01
    meteor_contest 4.21
    nqueens 3.81
    fastunpickle 3.56
    django_v3 2.91
    call_simple 2.72
    call_method_slots 2.45
    slowpickle 2.23
    call_method 2.21
    html5lib_warmup 1.90
    chaos 1.89
    html5lib 1.81
    regex_v8 1.81
    tornado_http 1.66
    2to3 1.56
    json_dump_v2 1.49
    nbody 1.38
    rietveld 1.26
    formatted_logging 1.12
    regex_compile 0.99
    spambayes 0.92
    pickle_list 0.87
    normal_startup 0.82
    pybench 0.74
    slowunpickle 0.71
    raytrace 0.67
    startup_nosite 0.59
    float 0.47
    hexiom2 0.46
    slowspitfire 0.46
    pidigits 0.44
    etree_process 0.44
    etree_generate 0.37
    go 0.27
    telco 0.24
    regex_effbot 0.12
    etree_iterparse 0.06
    bzr_startup 0.04
    richards 0.03
    etree_parse 0.00
    unpack_sequence 0.00
    call_method_unknown -0.26
    pathlib -0.57
    fastpickle -0.64
    silent_logging -0.94
    simple_logging -1.10
    chameleon_v2 -1.25
    pickle_dict -1.67
    spectral_norm -3.25

    Table 4. CPython 2 results on Intel XEON (Broadwell-EP) @ 2.3 GHz

    Benchmark %D
    ----------------------------------
    unpickle_list 15.44
    json_load 11.11
    fannkuch 7.55
    meteor_contest 5.51
    mako_v2 4.94
    nqueens 3.49
    html5lib_warmup 3.15
    html5lib 2.78
    call_simple 2.35
    silent_logging 2.33
    json_dump_v2 2.14
    startup_nosite 2.09
    bzr_startup 1.93
    fastunpickle 1.93
    slowspitfire 1.91
    regex_v8 1.79
    rietveld 1.74
    pybench 1.59
    nbody 1.57
    regex_compile 1.56
    pathlib 1.51
    tornado_http 1.33
    normal_startup 1.21
    2to3 1.14
    chaos 1.00
    spambayes 0.85
    etree_process 0.73
    pickle_list 0.70
    float 0.69
    hexiom2 0.51
    slowpickle 0.44
    call_method_unknown 0.42
    slowunpickle 0.37
    pickle_dict 0.25
    etree_parse 0.20
    go 0.19
    django_v3 0.12
    call_method_slots 0.12
    spectral_norm 0.05
    call_method 0.01
    unpack_sequence 0.00
    raytrace -0.08
    pidigits -0.11
    richards -0.16
    etree_generate -0.23
    regex_effbot -0.26
    telco -0.28
    simple_logging -0.32
    etree_iterparse -0.38
    formatted_logging -0.50
    fastpickle -1.08
    chameleon_v2 -1.74

    @catalin-manciu catalin-manciu mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Feb 18, 2016
    @vstinner
    Copy link
    Member

    Instead of modifying individual files, I proposed to modify PyMem_Malloc to use PyObject_Malloc allocator: issue bpo-26249.

    But the patch for Python 2 still makes sense.

    @catalin-manciu
    Copy link
    Mannequin Author

    catalin-manciu mannequin commented Feb 19, 2016

    Hi Victor,

    This patch follows the same idea as your proposal, but it's focused on a single object type. I think doing this incrementally is the safer approach, allowing us to have finer control over the new
    areas where we enable allocating using the small object allocator and detect where this replacement might be detrimental to the performance.

    @vstinner
    Copy link
    Member

    Catalin Gabriel Manciu: "(...) allowing us to have finer control over
    the new areas where we enable allocating using the small object
    allocator and detect where this replacement might be detrimental to
    the performance"

    Ah, interesting, do you think that it's possible that my change can
    *slow down* Python? I don't think so, but I'm interested on feedback
    on my patch :-) You may try to run benchmarks with my patch.

    @catalin-manciu
    Copy link
    Mannequin Author

    catalin-manciu mannequin commented Feb 19, 2016

    Theoretically, an object type that consistently allocates more than the small object threshold would perform a bit slower because
    it would first jump to the small object allocator, do the size comparison and then jump to malloc. There would be a small overhead
    if PyMem_* would be redirected to PyObject_* in this (hypothetical) case and the initial choice of PyMem_* over PyObject_* might have
    been determined by knowing about that overhead. This is because many think of PyMem_* as the lower-level allocator, PyObject_* as a
    higher-level one. Of course, PyMem_Raw* should be used in such cases, but it's not as widely adopted as the other two.

    I will post some benchmark results on your issue page as soon as I get them.

    @vstinner
    Copy link
    Member

    "Theoretically, an object type that consistently allocates more than the small object threshold would perform a bit slower because it would first jump to the small object allocator, do the size comparison and then jump to malloc."

    I expect that the cost of the extra check is *very* cheap (completly negligible) compared to the cost of a call to malloc().

    To have an idea of the cost of the Python code around system allocators, you can take a look at the Performance section of my PEP-445 which added an indirection to all Python allocators:
    https://www.python.org/dev/peps/pep-0445/#performances

    I was unable to measure an overhead on macro benchmarks (perf.py). The overhead on microbenchmarks was really hard to measure because it was so low that benchmarks were very unable.

    @terryjreedy
    Copy link
    Member

    My impression is that we do not do such performance enhancements to 2.7 for the same reason we would not do them to current 3.x -- the risk of breakage. Have I misunderstood?

    @vstinner
    Copy link
    Member

    Terry J. Reedy added the comment:

    My impression is that we do not do such performance enhancements to 2.7 for the same reason we would not do them to current 3.x -- the risk of breakage. Have I misunderstood?

    Breakage of what? The change looks very safe.

    @catalin-manciu
    Copy link
    Mannequin Author

    catalin-manciu mannequin commented Feb 22, 2016

    Our Haswell-EP OpenStack Swift setup shows a 1% improvement in throughput rate using CPython 2.7 (5715a6d9ff12) with this patch.

    @methane
    Copy link
    Member

    methane commented Dec 28, 2016

    Update patch for Python 2.7

    @methane methane added the 3.7 (EOL) end of life label Dec 28, 2016
    @vstinner
    Copy link
    Member

    Il don't understand your change: in Python 3.6, PyMem now uses exactly the
    same allocator than PyObject.

    @vstinner
    Copy link
    Member

    @methane
    Copy link
    Member

    methane commented Dec 29, 2016

    I know PyMem and PyObject allocator is same by default. But it's configurable.
    How should I choose right allocator?

    @methane
    Copy link
    Member

    methane commented Dec 29, 2016

    Maybe, PyObject_MALLOC remains only for backward compatibility?

    @vstinner
    Copy link
    Member

    I know PyMem and PyObject allocator is same by default. But it's
    configurable.
    How should I choose right allocator?

    The two functions always use the same allocator:
    https://docs.python.org/dev/using/cmdline.html#envvar-PYTHONMALLOC

    Sorry but which issue are you trying to fix here? Can you please elaborate
    your use case?

    As I wrote before, only Python 2 should be modified now (if you consider
    that it's worth it, the speedup is small).

    @methane
    Copy link
    Member

    methane commented Dec 29, 2016

    OK. I didn't know PyMem and PyObject allocators are always same.
    No reason to change Python 3.
    How about Python 2?

    Off topic: I want to know which of PyMem and PyObject allocator is preferred
    when writing new code.

    @methane methane removed the 3.7 (EOL) end of life label Dec 29, 2016
    @methane methane closed this as completed Feb 20, 2017
    @vstinner
    Copy link
    Member

    FYI the Python 3.6 change in PyMem_Malloc() required to implement a new complex check on the GIL. Search for "PyMem_Malloc() now fails if the GIL is not held" in my following blog post:
    https://haypo.github.io/contrib-cpython-2016q1.html

    Requiring that the GIL is held is a backward incompatible change. I suggest to run your code with PYTHONMALLOC=debug on Python 3.6 ;-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants