Title: Garbage collection of unused input sections from CPython binaries
Type: performance Stage:
Components: Build Versions: Python 3.6, Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: alecsandru.patrascu, benjamin.peterson, gregory.p.smith, lemburg, martin.panter, pitrou, scoder, serhiy.storchaka, skrah, steve.dower, vstinner, yselivanov, zach.ware
Priority: normal Keywords: patch

Created on 2016-02-04 15:56 by alecsandru.patrascu, last changed 2020-11-04 21:39 by brett.cannon.

File name Uploaded Description Edit
cpython2-deadcode-v01.patch alecsandru.patrascu, 2016-02-04 15:56 review
cpython3-deadcode-v01.patch alecsandru.patrascu, 2016-02-04 15:56 review
gc-removed-cpython2.txt alecsandru.patrascu, 2016-02-04 21:30
gc-removed-cpython3.txt alecsandru.patrascu, 2016-02-04 21:30 alecsandru.patrascu, 2016-02-05 12:50
gc-removed-zones-cpython3.txt alecsandru.patrascu, 2016-02-08 13:48
Messages (11)
msg259572 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-02-04 15:56
Hi all,

This is Alecsandru from the Dynamic Scripting Languages Optimization Team at Intel Corporation. I would like to submit a patch that enables garbage collection of unused input sections from the CPython2 and CPython3 binaries, by using the "--gc-sections" linker flag, which decides which input sections are used by examining symbols and relocations. In order for this to work, GCC must place each function or data item into its own section in the output file, thus dedicated flags are used. With this technique, an average of 1% is gained in both interpreters, with a few small regressions.

1. Get the CPython source codes
    hg clone cpython
    cd cpython
    hg update 2.7 (for CPython2)

2. Build the binary
    a) Default:
    b) Unused input sections patch
        Copy the attached patch files
        hg import --no-commit cpython2-deadcode-v01.patch.patch (for CPython3)
        hg import --no-commit cpython2-deadcode-v01.patch (for CPython2)

Hardware and OS Configuration
Hardware:           Intel XEON (Haswell-EP) 18 Cores

BIOS settings:      Intel Turbo Boost Technology: false
                    Hyper-Threading: false                  

OS:                 Ubuntu 14.04.3 LTS Server

OS configuration:   Address Space Layout Randomization (ASLR) disabled to reduce run
                    to run variation by echo 0 > /proc/sys/kernel/randomize_va_space
                    CPU frequency set fixed at 2.6GHz

GCC version:        GCC version 4.9.2

Benchmark:          Grand Unified Python Benchmark from 

Measurements and Results
CPython2 and CPython3 sample results, measured using GUPB on a Haswell platform, can be viewed in Table 1 and 2. On the first column (Benchmark) you can see the benchmark name and on the second (%S) the speedup compared with the default version; a higher value is better.

Table 1. CPython3 results:
Benchmark           %S
telco               11
etree_parse         7
call_simple         6
etree_iterparse     5
regex_v8            4
meteor_contest      3
etree_process       3
call_method_unknown 3
json_dump_v2        3
formatted_logging   2
hexiom2             2
chaos               2
richards            2
django_v3           2
nbody               2
etree_generate      2
pickle_list         1
go                  1
nqueens             1
call_method         1
mako_v2             1
raytrace            1
chameleon_v2        1
silent_logging      0
fastunpickle        0
2to3                0
float               0
regex_effbot        0
pidigits            0
json_load           0
simple_logging      0
normal_startup      0
startup_nosite      0
fastpickle          0
tornado_http        0
regex_compile       0
fannkuch            0
spectral_norm       0
pickle_dict         0
unpickle_list       0
call_method_slots   0
pathlib             -2
unpack_sequence     -2

Table 2. CPython2 results:
Benchmark           %S
simple_logging      4
formatted_logging   3
slowpickle          2
silent_logging      2
pickle_dict         1
chameleon_v2        1
hg_startup          1
pickle_list         1
call_method_unknown 1
pidigits            1
regex_effbot        1
regex_v8            1
html5lib            0
normal_startup      0
regex_compile       0
etree_parse         0
spambayes           0
html5lib_warmup     0
unpack_sequence     0
richards            0
rietveld            0
startup_nosite      0
raytrace            0
etree_iterparse     0
json_dump_v2        0
fastpickle          0
slowspitfire        0
slowunpickle        0
call_simple         0
float               0
2to3                0
bzr_startup         0
json_load           0
hexiom2             0
chaos               0
unpickle_list       0
call_method_slots   0
tornado_http        0
fastunpickle        0
etree_process       0
spectral_norm       0
meteor_contest      0
pybench             0
go                  0
etree_generate      0
mako_v2             0
django_v3           0
fannkuch            0
nbody               0
nqueens             0
telco               -1
call_method         -2
pathlib             -3

Thank you,
msg259576 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-04 16:42
I'm surprised about the speedups. Is there a logical reason for them?
msg259581 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-02-04 17:32
I thought this was the usual telco benchmark instability, but with the patch _decimal *does* seem to be faster in other areas, too.
msg259593 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-02-04 20:42
I realize now that I should have explained a bit more the background of this patch. I'll do this now, for everyone to be clear of what is the effect of those flags.

This issue was revealed after running the coverage target over various workloads, for both CPython2 and CPython3. After running, it can be observed that there are functions in the interpreter that are not called at all over the lifespan of the interpreter. Even more, these functions occupy space in the resulting binary file, and the CPU is forced to jump to longer offsets than it is required. Furthermore, for production level binaries, it is a good idea to remove these stubs, as they bring no benefit. Now, in order to do this, in the first step, every function or data item must exist in its own section (and the flags -ffunction-sections and -fdata-sections come to help in GCC). In the second step, the linker comes into play and because it has the entire picture of every piece of data or function, it is able to see if there are functions that are never called for the current build (and the flag --gc-sections come to help).

This functionality is not unique or new and are used by default in other interpreters, such as V8/Node.JS in their Release target, to achieve exactly the same goal. Another example for behind the scene usage of this functionality is the Microsoft's compiler, which does it automatically in their interprocedural optimization phase.

To compress all of the above, the main reason for this speedup is the reduction of the code path length and having the useful function close together, so that the CPU will be able to prefetch them in advance and use them instead of trowing them away because they are not used.
msg259594 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-04 21:06
Can we get the list of removed functions?

Some functions are not used in interpreter, but they provide API for extensions.
msg259596 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-02-04 21:30
Sure, I attached them as files because they have a lot of lines for posting here (~90 in total).

The linker offers the possibility to show what piece of data/functions was removed, but I intentionally omitted it in order not to clutter the build trace. If you think it will be useful for the user to see it, I can add them to the patch also.
msg259597 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-04 21:35
Le 04/02/2016 21:42, Alecsandru Patrascu a écrit :
> To compress all of the above, the main reason for this speedup is the
> reduction of the code path length and having the useful function
> close together, so that the CPU will be able to prefetch them in
> advance and use them instead of trowing them away because they are
> not used.

I'm expecting this patch to have an impact on executable or library
size, but not really on runtime performance, as the CPU instruction
cache only fetches whichever pieces of code are actually called.  In
other words, unused sections of code should remain cold wrt. the CPU
caches.  Apart from more or less random aliasing effects (and perhaps
TLB effects, but those should be very minor) I'm surprised that it has
positive performance effects.  But since you work at Intel, perhaps you
know things that I don't ;-)

Also any name starting with Py_ or _Py_ is an API that may be called by
third-party code, so it shouldn't be removed at all...
msg259598 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 21:53
> Also any name starting with Py_ or _Py_ is an API that may be called by third-party code, so it shouldn't be removed at all...

Right. You cannot remove the following functions, they are part of the
public C API (Include/pymem.h).

/usr/bin/ld: Removing unused section '.text.PyMem_RawMalloc' in file
/usr/bin/ld: Removing unused section '.text.PyMem_RawCalloc' in file
/usr/bin/ld: Removing unused section '.text.PyMem_RawRealloc' in file
/usr/bin/ld: Removing unused section '.text.PyMem_RawFree' in file
msg259657 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-02-05 12:50
I've done again the experiments on larger workloads, such as our OpenStack Swift cluster, and it works without any issues.

Also, I've attached an archive with a simple external module in CPython3 that uses PyMem_RawMalloc. The output is ok, and it's copied bellow.

u@palecsandru:~/w/experimente/c_ext3$ /home/u/w/cpython3_deadcode/python build_ext --inplace
running build_ext
building 'mytest' extension
creating build
creating build/temp.linux-x86_64-3.6
gcc -pthread -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fdata-sections -ffunction-sections -Wl,--gc-sections -fPIC -I/home/u/w/cpython3_deadcode/Include -I/home/u/w/cpython3_deadcode -c mytest.c -o build/temp.linux-x86_64-3.6/mytest.o
gcc -pthread -shared build/temp.linux-x86_64-3.6/mytest.o -o /home/u/w/experimente/c_ext3/

u@palecsandru:~/w/experimente/c_ext3$ ll
total 40
drwxrwxr-x  3 u u  4096 Feb  5 14:29 ./
drwxr-xr-x 12 u u  4096 Feb  5 14:00 ../
drwxrwxr-x  3 u u  4096 Feb  5 14:29 build/
-rw-rw-r--  1 u u   619 Feb  5 14:16 mytest.c
-rwxrwxr-x  1 u u 17856 Feb  5 14:29*
-rw-rw-r--  1 u u   132 Feb  5 14:15

u@palecsandru:~/w/experimente/c_ext3$ /home/u/w/cpython3_deadcode/python
Python 3.6.0a0 (default:87dfadd61e0d+, Feb  5 2016, 14:22:57) 
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mytest
>>> mytest.mytest()
msg259768 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-02-07 07:16
Maybe I am missing something, but I don’t see how you could load your module if it uses PyMem_RawMalloc. Perhaps PyMem_RawMalloc has been removed from some other executable (e.g. Parser/pgen), rather than the main Python executable?
msg259844 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-02-08 13:48
I attached the list for CPython3 (gc-removed-zones-cpython3.txt), but now split in two sections (core and parser), for more clarity as from what and where it is removed. As you can see, the reason why the module works is because the API that can be used by modules remains untouched.

What I was trying to say and prove before is that these GCC/LD flags are safe to use in CPython (and for any other software project) and will not break any compatibility with existing or future modules.
Date User Action Args
2020-11-04 21:39:41brett.cannonsetnosy: - brett.cannon
2016-02-27 16:50:57alecsandru.patrascusetnosy: + lemburg, gregory.p.smith, scoder, zach.ware, steve.dower
2016-02-08 13:48:18alecsandru.patrascusetfiles: + gc-removed-zones-cpython3.txt

messages: + msg259844
2016-02-07 07:16:14martin.pantersetnosy: + martin.panter
messages: + msg259768
2016-02-05 12:50:48alecsandru.patrascusetfiles: +

messages: + msg259657
2016-02-04 21:53:16vstinnersetmessages: + msg259598
2016-02-04 21:35:27pitrousetmessages: + msg259597
2016-02-04 21:30:52alecsandru.patrascusetfiles: + gc-removed-cpython3.txt
2016-02-04 21:30:45alecsandru.patrascusetfiles: + gc-removed-cpython2.txt

messages: + msg259596
2016-02-04 21:06:56serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg259594
2016-02-04 20:42:44alecsandru.patrascusetmessages: + msg259593
2016-02-04 17:32:26skrahsetmessages: + msg259581
2016-02-04 16:42:00pitrousetnosy: + pitrou
messages: + msg259576
2016-02-04 16:05:52SilentGhostsetnosy: + brett.cannon, vstinner, benjamin.peterson, skrah, yselivanov
2016-02-04 15:56:38alecsandru.patrascusetfiles: + cpython3-deadcode-v01.patch
2016-02-04 15:56:28alecsandru.patrascucreate