msg259572 - (view) |
Author: Alecsandru Patrascu (alecsandru.patrascu) * |
Date: 2016-02-04 15:56 |
Hi all,
This is Alecsandru from the Dynamic Scripting Languages Optimization Team at Intel Corporation. I would like to submit a patch that enables garbage collection of unused input sections from the CPython2 and CPython3 binaries, by using the "--gc-sections" linker flag, which decides which input sections are used by examining symbols and relocations. In order for this to work, GCC must place each function or data item into its own section in the output file, thus dedicated flags are used. With this technique, an average of 1% is gained in both interpreters, with a few small regressions.
Steps:
======
1. Get the CPython source codes
hg clone https://hg.python.org/cpython cpython
cd cpython
hg update 2.7 (for CPython2)
2. Build the binary
a) Default:
./configure
make
b) Unused input sections patch
Copy the attached patch files
hg import --no-commit cpython2-deadcode-v01.patch.patch (for CPython3)
hg import --no-commit cpython2-deadcode-v01.patch (for CPython2)
./configure
make
Hardware and OS Configuration
=============================
Hardware: Intel XEON (Haswell-EP) 18 Cores
BIOS settings: Intel Turbo Boost Technology: false
Hyper-Threading: false
OS: Ubuntu 14.04.3 LTS Server
OS configuration: Address Space Layout Randomization (ASLR) disabled to reduce run
to run variation by echo 0 > /proc/sys/kernel/randomize_va_space
CPU frequency set fixed at 2.6GHz
GCC version: GCC version 4.9.2
Benchmark: Grand Unified Python Benchmark from
https://hg.python.org/benchmarks/
Measurements and Results
========================
CPython2 and CPython3 sample results, measured using GUPB on a Haswell platform, can be viewed in Table 1 and 2. On the first column (Benchmark) you can see the benchmark name and on the second (%S) the speedup compared with the default version; a higher value is better.
Table 1. CPython3 results:
Benchmark %S
----------------------
telco 11
etree_parse 7
call_simple 6
etree_iterparse 5
regex_v8 4
meteor_contest 3
etree_process 3
call_method_unknown 3
json_dump_v2 3
formatted_logging 2
hexiom2 2
chaos 2
richards 2
django_v3 2
nbody 2
etree_generate 2
pickle_list 1
go 1
nqueens 1
call_method 1
mako_v2 1
raytrace 1
chameleon_v2 1
silent_logging 0
fastunpickle 0
2to3 0
float 0
regex_effbot 0
pidigits 0
json_load 0
simple_logging 0
normal_startup 0
startup_nosite 0
fastpickle 0
tornado_http 0
regex_compile 0
fannkuch 0
spectral_norm 0
pickle_dict 0
unpickle_list 0
call_method_slots 0
pathlib -2
unpack_sequence -2
Table 2. CPython2 results:
Benchmark %S
----------------------
simple_logging 4
formatted_logging 3
slowpickle 2
silent_logging 2
pickle_dict 1
chameleon_v2 1
hg_startup 1
pickle_list 1
call_method_unknown 1
pidigits 1
regex_effbot 1
regex_v8 1
html5lib 0
normal_startup 0
regex_compile 0
etree_parse 0
spambayes 0
html5lib_warmup 0
unpack_sequence 0
richards 0
rietveld 0
startup_nosite 0
raytrace 0
etree_iterparse 0
json_dump_v2 0
fastpickle 0
slowspitfire 0
slowunpickle 0
call_simple 0
float 0
2to3 0
bzr_startup 0
json_load 0
hexiom2 0
chaos 0
unpickle_list 0
call_method_slots 0
tornado_http 0
fastunpickle 0
etree_process 0
spectral_norm 0
meteor_contest 0
pybench 0
go 0
etree_generate 0
mako_v2 0
django_v3 0
fannkuch 0
nbody 0
nqueens 0
telco -1
call_method -2
pathlib -3
Thank you,
Alecsandru
|
msg259576 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2016-02-04 16:42 |
I'm surprised about the speedups. Is there a logical reason for them?
|
msg259581 - (view) |
Author: Stefan Krah (skrah) * |
Date: 2016-02-04 17:32 |
I thought this was the usual telco benchmark instability, but with the patch _decimal *does* seem to be faster in other areas, too.
|
msg259593 - (view) |
Author: Alecsandru Patrascu (alecsandru.patrascu) * |
Date: 2016-02-04 20:42 |
I realize now that I should have explained a bit more the background of this patch. I'll do this now, for everyone to be clear of what is the effect of those flags.
This issue was revealed after running the coverage target over various workloads, for both CPython2 and CPython3. After running, it can be observed that there are functions in the interpreter that are not called at all over the lifespan of the interpreter. Even more, these functions occupy space in the resulting binary file, and the CPU is forced to jump to longer offsets than it is required. Furthermore, for production level binaries, it is a good idea to remove these stubs, as they bring no benefit. Now, in order to do this, in the first step, every function or data item must exist in its own section (and the flags -ffunction-sections and -fdata-sections come to help in GCC). In the second step, the linker comes into play and because it has the entire picture of every piece of data or function, it is able to see if there are functions that are never called for the current build (and the flag --gc-sections come to help).
This functionality is not unique or new and are used by default in other interpreters, such as V8/Node.JS in their Release target, to achieve exactly the same goal. Another example for behind the scene usage of this functionality is the Microsoft's compiler, which does it automatically in their interprocedural optimization phase.
To compress all of the above, the main reason for this speedup is the reduction of the code path length and having the useful function close together, so that the CPU will be able to prefetch them in advance and use them instead of trowing them away because they are not used.
|
msg259594 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-02-04 21:06 |
Can we get the list of removed functions?
Some functions are not used in interpreter, but they provide API for extensions.
|
msg259596 - (view) |
Author: Alecsandru Patrascu (alecsandru.patrascu) * |
Date: 2016-02-04 21:30 |
Sure, I attached them as files because they have a lot of lines for posting here (~90 in total).
The linker offers the possibility to show what piece of data/functions was removed, but I intentionally omitted it in order not to clutter the build trace. If you think it will be useful for the user to see it, I can add them to the patch also.
|
msg259597 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2016-02-04 21:35 |
Le 04/02/2016 21:42, Alecsandru Patrascu a écrit :
>
> To compress all of the above, the main reason for this speedup is the
> reduction of the code path length and having the useful function
> close together, so that the CPU will be able to prefetch them in
> advance and use them instead of trowing them away because they are
> not used.
I'm expecting this patch to have an impact on executable or library
size, but not really on runtime performance, as the CPU instruction
cache only fetches whichever pieces of code are actually called. In
other words, unused sections of code should remain cold wrt. the CPU
caches. Apart from more or less random aliasing effects (and perhaps
TLB effects, but those should be very minor) I'm surprised that it has
positive performance effects. But since you work at Intel, perhaps you
know things that I don't ;-)
Also any name starting with Py_ or _Py_ is an API that may be called by
third-party code, so it shouldn't be removed at all...
|
msg259598 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2016-02-04 21:53 |
> Also any name starting with Py_ or _Py_ is an API that may be called by third-party code, so it shouldn't be removed at all...
Right. You cannot remove the following functions, they are part of the
public C API (Include/pymem.h).
/usr/bin/ld: Removing unused section '.text.PyMem_RawMalloc' in file
'Objects/obmalloc.o'
/usr/bin/ld: Removing unused section '.text.PyMem_RawCalloc' in file
'Objects/obmalloc.o'
/usr/bin/ld: Removing unused section '.text.PyMem_RawRealloc' in file
'Objects/obmalloc.o'
/usr/bin/ld: Removing unused section '.text.PyMem_RawFree' in file
'Objects/obmalloc.o'
|
msg259657 - (view) |
Author: Alecsandru Patrascu (alecsandru.patrascu) * |
Date: 2016-02-05 12:50 |
I've done again the experiments on larger workloads, such as our OpenStack Swift cluster, and it works without any issues.
Also, I've attached an archive with a simple external module in CPython3 that uses PyMem_RawMalloc. The output is ok, and it's copied bellow.
u@palecsandru:~/w/experimente/c_ext3$ /home/u/w/cpython3_deadcode/python setup.py build_ext --inplace
running build_ext
building 'mytest' extension
creating build
creating build/temp.linux-x86_64-3.6
gcc -pthread -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fdata-sections -ffunction-sections -Wl,--gc-sections -fPIC -I/home/u/w/cpython3_deadcode/Include -I/home/u/w/cpython3_deadcode -c mytest.c -o build/temp.linux-x86_64-3.6/mytest.o
gcc -pthread -shared build/temp.linux-x86_64-3.6/mytest.o -o /home/u/w/experimente/c_ext3/mytest.cpython-36m-x86_64-linux-gnu.so
u@palecsandru:~/w/experimente/c_ext3$ ll
total 40
drwxrwxr-x 3 u u 4096 Feb 5 14:29 ./
drwxr-xr-x 12 u u 4096 Feb 5 14:00 ../
drwxrwxr-x 3 u u 4096 Feb 5 14:29 build/
-rw-rw-r-- 1 u u 619 Feb 5 14:16 mytest.c
-rwxrwxr-x 1 u u 17856 Feb 5 14:29 mytest.cpython-36m-x86_64-linux-gnu.so*
-rw-rw-r-- 1 u u 132 Feb 5 14:15 setup.py
u@palecsandru:~/w/experimente/c_ext3$ /home/u/w/cpython3_deadcode/python
Python 3.6.0a0 (default:87dfadd61e0d+, Feb 5 2016, 14:22:57)
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mytest
>>> mytest.mytest()
'test'
>>>
|
msg259768 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2016-02-07 07:16 |
Maybe I am missing something, but I don’t see how you could load your module if it uses PyMem_RawMalloc. Perhaps PyMem_RawMalloc has been removed from some other executable (e.g. Parser/pgen), rather than the main Python executable?
|
msg259844 - (view) |
Author: Alecsandru Patrascu (alecsandru.patrascu) * |
Date: 2016-02-08 13:48 |
I attached the list for CPython3 (gc-removed-zones-cpython3.txt), but now split in two sections (core and parser), for more clarity as from what and where it is removed. As you can see, the reason why the module works is because the API that can be used by modules remains untouched.
What I was trying to say and prove before is that these GCC/LD flags are safe to use in CPython (and for any other software project) and will not break any compatibility with existing or future modules.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:27 | admin | set | github: 70473 |
2020-11-04 21:39:41 | brett.cannon | set | nosy:
- brett.cannon
|
2016-02-27 16:50:57 | alecsandru.patrascu | set | nosy:
+ lemburg, gregory.p.smith, scoder, zach.ware, steve.dower
|
2016-02-08 13:48:18 | alecsandru.patrascu | set | files:
+ gc-removed-zones-cpython3.txt
messages:
+ msg259844 |
2016-02-07 07:16:14 | martin.panter | set | nosy:
+ martin.panter messages:
+ msg259768
|
2016-02-05 12:50:48 | alecsandru.patrascu | set | files:
+ c_ext3.zip
messages:
+ msg259657 |
2016-02-04 21:53:16 | vstinner | set | messages:
+ msg259598 |
2016-02-04 21:35:27 | pitrou | set | messages:
+ msg259597 |
2016-02-04 21:30:52 | alecsandru.patrascu | set | files:
+ gc-removed-cpython3.txt |
2016-02-04 21:30:45 | alecsandru.patrascu | set | files:
+ gc-removed-cpython2.txt
messages:
+ msg259596 |
2016-02-04 21:06:56 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg259594
|
2016-02-04 20:42:44 | alecsandru.patrascu | set | messages:
+ msg259593 |
2016-02-04 17:32:26 | skrah | set | messages:
+ msg259581 |
2016-02-04 16:42:00 | pitrou | set | nosy:
+ pitrou messages:
+ msg259576
|
2016-02-04 16:05:52 | SilentGhost | set | nosy:
+ brett.cannon, vstinner, benjamin.peterson, skrah, yselivanov
|
2016-02-04 15:56:38 | alecsandru.patrascu | set | files:
+ cpython3-deadcode-v01.patch |
2016-02-04 15:56:28 | alecsandru.patrascu | create | |