classification
Title: Link Time Optimizations support for GCC and CLANG
Type: performance Stage: patch review
Components: Build Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution: fixed
Dependencies: 26787 26788 Superseder:
Assigned To: Nosy List: alecsandru.patrascu, brett.cannon, gregory.p.smith, inada.naoki, lemburg, pitrou, python-dev, r.david.murray, scoder, skrah, steve.dower, vstinner, zach.ware
Priority: normal Keywords: patch

Created on 2015-11-23 08:59 by alecsandru.patrascu, last changed 2016-06-03 00:27 by gregory.p.smith.

Files
File name Uploaded Description Edit
lto-cpython2-v01.patch alecsandru.patrascu, 2015-11-23 09:00 review
lto-cpython3-v01.patch alecsandru.patrascu, 2015-11-23 09:00 review
lto-cpython2-v02.patch alecsandru.patrascu, 2015-11-23 14:44 review
lto-cpython3-v02.patch alecsandru.patrascu, 2015-11-23 14:44 review
lto-cpython2-v03.patch alecsandru.patrascu, 2016-01-04 15:19 review
lto-cpython3-v03.patch alecsandru.patrascu, 2016-01-04 15:19 review
lto-cpython2-v04.patch alecsandru.patrascu, 2016-01-20 09:27 review
lto-cpython3-v04.patch alecsandru.patrascu, 2016-01-20 09:28 review
Messages (45)
msg255140 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-11-23 08:59
Title: Link Time Optimizations support for GCC and CLANG

Hi All,

This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a patch that adds support for Link Time Optimization (LTO) when using GCC and CLANG to compile CPython2 and CPython3. LTO is a compiler assisted optimization technique that is performed by the compiler at link time.

Combined with Profile Guided Optimization (PGO), enabled when running "make profile-opt", and running the Grand Unified Python Benchmark (GUPB), a speedup up to 11%, with a few regressions, was observed comparing with PGO only. Compared with a default build, a performance gain as high as 26% was observed from PGO+LTO. In addition, we are also seeing 2% boost in throughput rate from our OpenStack Swift setup comparing with PGO only. Our GUPB performance evaluation was conducted on Intel SkyLake/Broadwell systems running CentOS/Ubuntu, with CLANG/LLVM and GCC 4.*/5.*. Our OpenStack Swift performance was done on various systems consisting of XEON and Avoton processors.

Steps:
======

1. Get the CPython source codes
    hg clone https://hg.python.org/cpython cpython
    cd cpython
    hg update 2.7 (for CPython2)

2. Build the binary
    a) Default:
        ./configure
        make
    
    b) PGO:
        ./configure
        make profile-opt
        
    c) PGO+LTO:
        Copy the attached patch files
        hg import --no-commit lto-cpython3-v01.patch (for CPython3)
        hg import --no-commit lto-cpython2-v01.patch (for CPython2)
        ./configure
        make profile-opt

        
Hardware and OS Configuration
=============================
Hardware:           Intel XEON (Broadwell-DE) 8 Cores

BIOS settings:      Intel Turbo Boost Technology: false
                    Hyper-Threading: false                  

OS:                 Ubuntu 14.04.3 LTS Server

OS configuration:   Address Space Layout Randomization (ASLR) disabled to reduce run
                    to run variation by echo 0 > /proc/sys/kernel/randomize_va_space
                    CPU frequency set fixed at 2.6GHz

GCC version:        GCC version 4.9.2

Benchmark:          Grand Unified Python Benchmark from 
                    https://hg.python.org/benchmarks/

                    
Measurements and Results
========================
A. Repository:
    GUPB Benchmark:
        hg id :  2979f5ce6a0c tip
        hg --debug id -i : 2979f5ce6a0cee994d5485401945d8457bb0afac

    CPython3:
        hg id : 21a28f6de358
        hg id -r 'ancestors(.) and tag()': 374f501f4567 (3.5) v3.5.0
        hg --debug id -i : 21a28f6de3582833652c958b8fd6ae8448b61c7c

    CPython2:
        hg id : a37ea1d56e98 (2.7)
        hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10
        hg --debug id -i : a37ea1d56e98eb158750d3e495a5cf524e8c3980


B. Results: 
CPython2 and CPython3 sample results, measured on a Broadwell platform, can be viewed in Table 1 and 2. On the first column (Benchmark) you can see the benchmark name, on the second (%D) the speedup compared with the default version and on the third column (%PGO) the speedup compared with just PGO; a higher value is better.

Table 1. CPython2 results:
Benchmark           %D      %PGO
--------------------------------
raytrace            18      3
chaos               16      5
django_v2           16      6
mako                16      6
pathlib             15      3
simple_logging      15      1
slowpickle          15      5
django              14      4
go                  14      4
richards            13      -1
float               12      4
slowunpickle        12      4
etree_process       11      3
fastunpickle        11      6
formatted_logging   11      3
nqueens             11      1
regex_compile       11      3
etree_iterparse     10      4
mako_v2             10      3
telco               10      5
pybench             9       1
hexiom2             9       1
html5lib_warmup     9       3
meteor_contest      9       4
pickle_list         9       5
2to3                8       2
bzr_startup         8       2
chameleon           8       0
etree_generate      8       2
regex_v8            8       3
silent_logging      8       1
fannkuch            7       1
html5lib            7       3
json_load           7       -5
tornado_http        7       3
call_method_slots   6       3
json_dump_v2        6       -4
spambayes           6       2
unpickle_list       6       0
etree_parse         5       3
fastpickle          5       4
rietveld            5       1
call_method         4       -1
normal_startup      4       2
startup_nosite      4       2
slowspitfire        3       0
ssbench             4       2
call_method_unknown 1       -6
json_dump           1       -4
nbody               1       1
pidigits            1       -10
pickle_dict         0       -1
regex_effbot        0       -2
spectral_norm       0       -3
call_simple         -3      -3
unpack_sequence     -6      -2


Table 2. CPython3 results:
Benchmark           %D      %PGO
--------------------------------
formatted_logging   26      11
raytrace            24      8
simple_logging      24      6
richards            22      3
chaos               21      7
go                  21      11
hexiom2             21      8
nbody               21      9
etree_generate      19      5
etree_process       19      5
call_method_slots   18      3
fastunpickle        18      0
pathlib             18      5
regex_compile       18      8
float               17      8
nqueens             17      7
call_method         16      3
etree_iterparse     16      9
json_dump           16      -4
json_load           16      5
silent_logging      15      8
2to3                14      5
fannkuch            14      8
call_simple         12      0
meteor_contest      12      7
call_method_unknown 11      -1
spectral_norm       11      4
json_dump_v2        10      3
telco               10      5
fastpickle          9       -4
etree_parse         8       1
normal_startup      8       3
startup_nosite      7       3
unpack_sequence     7       3
regex_v8            6       4
unpickle_list       5       3
pickle_list         1       -10
pidigits            1       -11
regex_effbot        -2      2
pickle_dict         -3      -10

Thank you,
Alecsandru
msg255148 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-11-23 11:36
LTO only exists on recent versions of gcc, so the configure script should probably do some version checking.

Also we can't enable it by default as 1) it makes compile times much longer 2) there are some bugs in some gcc versions (see e.g. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=753134).
msg255150 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-11-23 12:18
LTO exists in GCC since version 4.5, but it is true that only recent versions (>=4.8) perform it in good conditions. It is not enabled by default in this patch, it is only available when building with PGO support. Running just "make" will not activate the LTO flags.

Do you see it as an configure option (using, for example, an explicit --with-lto flag) rather than using it automatically?
msg255165 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-11-23 14:44
Meanwhile I've added the patches (v02) for LTO enabled only if the "./configure --with-lto" command is issued.
msg255167 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-11-23 14:46
Le 23/11/2015 13:18, Alecsandru Patrascu a écrit :
> 
> Do you see it as an configure option (using, for example, an
> explicit --with-lto flag) rather than using it automatically?

That would be nice. This way people can easily test different
combinations of flags.
msg255168 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-11-23 14:47
Le 23/11/2015 15:44, Alecsandru Patrascu a écrit :
> 
> Meanwhile I've added the patches (v02) for LTO enabled only if the "./configure --with-lto" command is issued.

Cool, thanks!
msg257042 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-12-26 17:46
I'm adding Brett, Gregory, Stefan and Victor as nosy because this issue might be interesting for them also.
msg257463 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-01-04 15:19
Hello, I've added an updated set of patches (v03) for the current CPython2 and CPython3 codebase. Also made some small changes to reduce the number of places where the flags are set.
msg258627 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2016-01-19 22:22
I'm a bit concerned that the flags are being added unconditionally to CFLAGS and LDFLAGS (when configured --with-lto), which means extensions are forced into it as well.  I think it would be better to use CFLAGS_NODIST and to add LDFLAGS_NODIST.  Unfortunately, 2.7 doesn't have even CFLAGS_NODIST; I suspect it may be time to backport that.
msg258632 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-01-19 23:03
> Unfortunately, 2.7 doesn't have even CFLAGS_NODIST; I suspect it may be time to backport that.

I don't think now is a good time to introduce instability in the 2.7 branch.
msg258642 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2016-01-20 05:28
Unless I'm just missing something, I don't see how introducing CFLAGS_NODIST and LDFLAGS_NODIST to 2.7 would introduce instability.  It should be a fairly non-invasive change, restricted to configure and the Makefile; both vars should usually be empty and thus builds should be entirely unaffected unless options like --with-lto are chosen.


On a separate note about the patch: as mentioned in msg251305, it's probably better to restrict adding the LTO flags to just the profile-opt targets, even with the --with-lto check.
msg258651 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-01-20 08:37
Any non-bugfix introduction can introduce instability. That has always been our position with respect to adding features to bugfix branches. I don't see how adding a LTO option this late in the 2.7 release cycle can be considered important enough to break that rule.

Let me add that downstream distributors already customize compilation options (Ubuntu's Python is compiled with both PGO and LTO enabled, AFAIR), so this change may only really affect the tiny subset of non-Windows users that compile Python themselves.

But well, perhaps Python development has become boring to the point of deliberately introducing uncertainty and risk to make things a bit more fun? ;-)
msg258653 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-20 08:41
I suggest to only modify the default branch and work with downstream (like
Linux vendors) to compile Python with best compiler options.

I'm talking about the default compilation mode. Maybe we can add a
configure option to 2.7 and 3.5, disabled by default, to use best options.

Sorry I didn't read the whole discussion.
msg258655 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-01-20 09:27
Thank you for your feedback, I've updated the patches and now LTO flags are used only when building with PGO (v04). CFLAGS/LDFLAGS remain untouched, as Antoine and Victor suggested is better.
msg258658 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-01-20 10:24
On 20.01.2016 09:37, Antoine Pitrou wrote:
> Let me add that downstream distributors already customize compilation options (Ubuntu's Python is compiled with both PGO and LTO enabled, AFAIR), so this change may only really affect the tiny subset of non-Windows users that compile Python themselves.

Are the Windows installers on python.org compiled with PGO and
LTO enabled ?

If not, then the patch would also effect the not-so-tiny fraction
of Python users on Windows ;-)

BTW: It may make sense to start collecting the various performance
related optional patches to Python 2.7 on a wiki page for interested
parties to use.
msg258660 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-01-20 10:38
> If not, then the patch would also effect the not-so-tiny fraction
of Python users on Windows ;-)

I don't see how enabling LTO for gcc and clang could ever affect our Windows users ;-)
msg258663 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-01-20 11:03
On 20.01.2016 11:38, Antoine Pitrou wrote:
>> If not, then the patch would also effect the not-so-tiny fraction
>> of Python users on Windows ;-)
> 
> I don't see how enabling LTO for gcc and clang could ever affect our Windows users ;-)

You have a point there, but perhaps we could start offering
an ICC compiled version for Windows ;-)
msg258682 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-01-20 13:57
My understanding is that we (starting with Guido) have made a blanket exception for 2.7 for useful performance and build-system-only related patches.  That doesn't mean *anything* can go in (the usual rules about "is this worth it/backward compatible/won't break things" still apply) but it is a lower bar than is true for other maintenance only releases.  Perhaps my understanding is in error, though.  

I believe Intel is committed to supporting this, so if there do turn out to be any maintenance issues they can handle them.  (Which IIUC is Nick's argument: if someone wants to support 2.7 with stuff we are willing to let in, we should let them as long as they credibly commit to supporting it.)  I'm currently part of that Intel support, though, so someone else should rule on this.
msg258697 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2016-01-20 17:38
To help answer MAL's question about Windows: I know the python.org installers are **not** built with PGO, but I don't know about LTO.
msg258703 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-01-20 18:02
MSVC has had Link-Time Code Generation for many releases, and it should have been used for all 2.7 releases (definitely used in 3.5+) to optimize references between object files. I assume this is equivalent to LTO.

We currently don't use PGO in the official Windows builds, but it is a supported build configuration.
msg259019 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-01-27 12:58
As Steve mentioned, the Microsoft compiler uses LTO (they call it Link-Time Code Generation) and the flags are used when compiling CPython on Windows systems. Thus our proposal to enable it on GCC and CLANG also.
msg261150 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2016-03-03 08:46
Can we use LTO without PGO?
PGO increases build time few times.
msg261154 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-03-03 09:48
Yes, you can use LTO without PGO, but the proposed ways it's more efficient and makes more sense for CPython builds.
msg261155 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2016-03-03 10:15
Sorry my poor English.
I meant that "Does `./configure --with-lto && make` use LTO?".
msg261158 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-03-03 11:09
I understand now your question. LTO is not enabled when running just `make`, only in `make profile-opt`
msg261181 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2016-03-04 01:32
I've tried LTO without PGO in Debian Jessie.

$ LTOFLAGS='-flto -fuse-linker-plugin -ffat-lto-objects -flto-partition=none'
$ CFLAGS=$LTOFLAGS LDFLAGS=$LTOFLAGS ./configure --prefix=...
$ make -j32

results is here (compared with neither LTO and PGO):

Test                             minimum run-time        average  run-time
                                 this    other   diff    this    other   diff
-------------------------------------------------------------------------------
          BuiltinFunctionCalls:    47ms    50ms   -6.6%    48ms    51ms   -6.0%
           BuiltinMethodLookup:    29ms    29ms   -1.3%    29ms    29ms   -0.1%
                 CompareFloats:    32ms    33ms   -2.8%    34ms    34ms   -0.5%
         CompareFloatsIntegers:    67ms    70ms   -3.9%    69ms    71ms   -3.1%
               CompareIntegers:    48ms    46ms   +5.1%    49ms    47ms   +5.8%
        CompareInternedStrings:    30ms    31ms   -1.9%    31ms    31ms   -1.6%
                  CompareLongs:    28ms    26ms   +8.0%    29ms    27ms   +8.5%
                CompareStrings:    26ms    26ms   -0.9%    27ms    26ms   +1.5%
    ComplexPythonFunctionCalls:    47ms    51ms   -8.9%    48ms    52ms   -7.8%
                 ConcatStrings:    32ms    33ms   -3.2%    33ms    34ms   -2.2%
               CreateInstances:    51ms    52ms   -2.5%    52ms    53ms   -3.5%
            CreateNewInstances:    38ms    40ms   -4.5%    39ms    41ms   -4.4%
       CreateStringsWithConcat:    68ms    69ms   -1.4%    70ms    71ms   -0.4%
                  DictCreation:    53ms    51ms   +5.2%    55ms    52ms   +6.7%
             DictWithFloatKeys:    41ms    42ms   -2.2%    43ms    43ms   -0.0%
           DictWithIntegerKeys:    34ms    34ms   +0.1%    35ms    35ms   +0.5%
            DictWithStringKeys:    31ms    32ms   -1.3%    32ms    32ms   -1.6%
                      ForLoops:    26ms    30ms  -12.1%    28ms    30ms   -8.7%
                    IfThenElse:    42ms    41ms   +2.6%    43ms    41ms   +5.0%
                   ListSlicing:    40ms    40ms   -0.8%    41ms    41ms   -0.4%
                NestedForLoops:    42ms    42ms   -0.3%    43ms    43ms   +0.6%
      NestedListComprehensions:    42ms    47ms  -11.9%    45ms    50ms  -10.5%
          NormalClassAttribute:    89ms    96ms   -7.9%    92ms    98ms   -5.9%
       NormalInstanceAttribute:    47ms    45ms   +4.8%    48ms    45ms   +4.9%
           PythonFunctionCalls:    41ms    44ms   -7.5%    41ms    45ms   -7.4%
             PythonMethodCalls:    53ms    59ms   -9.4%    55ms    60ms   -8.5%
                     Recursion:    69ms    73ms   -5.1%    71ms    74ms   -4.2%
                  SecondImport:    36ms    41ms  -12.0%    38ms    42ms   -9.9%
           SecondPackageImport:    45ms    42ms   +6.5%    46ms    43ms   +7.0%
         SecondSubmoduleImport:   115ms   107ms   +7.9%   117ms   108ms   +7.9%
       SimpleComplexArithmetic:    27ms    29ms   -6.5%    28ms    30ms   -4.5%
        SimpleDictManipulation:    60ms    65ms   -7.8%    61ms    66ms   -7.0%
         SimpleFloatArithmetic:    33ms    30ms   +7.4%    34ms    31ms   +8.3%
      SimpleIntFloatArithmetic:    36ms    38ms   -3.3%    37ms    38ms   -4.0%
       SimpleIntegerArithmetic:    36ms    38ms   -5.2%    37ms    38ms   -4.1%
      SimpleListComprehensions:    36ms    37ms   -3.2%    38ms    41ms   -7.5%
        SimpleListManipulation:    34ms    34ms   -1.3%    35ms    38ms   -6.8%
          SimpleLongArithmetic:    26ms    26ms   +0.3%    27ms    30ms   -7.5%
                    SmallLists:    45ms    47ms   -4.1%    46ms    56ms  -17.2%
                   SmallTuples:    51ms    54ms   -6.3%    53ms    62ms  -14.8%
         SpecialClassAttribute:    92ms    97ms   -5.0%    95ms    99ms   -4.8%
      SpecialInstanceAttribute:    46ms    45ms   +2.5%    48ms    46ms   +3.9%
                StringMappings:    71ms   100ms  -29.0%    73ms   101ms  -27.8%
              StringPredicates:    49ms    59ms  -17.8%    50ms    60ms  -16.5%
                 StringSlicing:    48ms    47ms   +3.3%    79ms    47ms  +66.2%
                     TryExcept:    24ms    29ms  -16.9%    25ms    30ms  -15.8%
                    TryFinally:    35ms    37ms   -6.0%    36ms    38ms   -4.6%
                TryRaiseExcept:    12ms    13ms   -7.5%    13ms    14ms   -7.2%
                  TupleSlicing:    48ms    50ms   -2.9%    49ms    51ms   -2.7%
                   WithFinally:    52ms    57ms   -8.4%    53ms    58ms   -8.2%
               WithRaiseExcept:    42ms    46ms   -8.8%    43ms    47ms   -9.1%
-------------------------------------------------------------------------------
Totals:                          2291ms  2398ms   -4.5%  2390ms  2470ms   -3.2%

(this=lto.pybench, other=default.pybench)
msg261183 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-03-04 07:18
From our experience, pybench only is not a representative benchmark. Instead, if you like to measure performance close to real workloads, you can run the Grand Unified Python Benchmark suite, that is more complete. 

Also, you need to take into consideration the hardware and software environment. For this, you can read the initial comment at this issue, section "Hardware and OS Configuration", to see the approach we have here at Intel.
msg261189 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2016-03-04 15:34
The machine is Google Compute Engine n1-highcpu-32 (Intel Ivy Bridge)

Linux bench 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux

cpuinfo:
processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU @ 2.50GHz
stepping        : 4
microcode       : 0x1
cpu MHz         : 2500.000
cache size      : 30720 KB


command:
$ python perf.py -r -b default ../Python-3.5.1/python-default ../Python-3.5.1/python-lto

output:
Report on Linux bench 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 
Total CPU cores: 32

### 2to3 ###
Min: 8.692000 -> 8.160000: 1.07x faster
Avg: 8.816800 -> 8.253600: 1.07x faster
Significant (t=8.07)
Stddev: 0.12726 -> 0.09027: 1.4098x smaller

### chameleon_v2 ###
Min: 6.756928 -> 6.414046: 1.05x faster
Avg: 6.849192 -> 6.666536: 1.03x faster
Significant (t=20.88)
Stddev: 0.04413 -> 0.07555: 1.7120x larger

### fastpickle ###
Min: 0.540906 -> 0.564253: 1.04x slower
Avg: 0.549624 -> 0.579263: 1.05x slower
Significant (t=-34.29)
Stddev: 0.00427 -> 0.00752: 1.7622x larger

### nbody ###
Min: 0.260169 -> 0.273837: 1.05x slower
Avg: 0.267334 -> 0.280441: 1.05x slower
Significant (t=-34.05)
Stddev: 0.00257 -> 0.00286: 1.1125x larger

### regex_v8 ###
Min: 0.047335 -> 0.044750: 1.06x faster
Avg: 0.049424 -> 0.046788: 1.06x faster
Significant (t=10.46)
Stddev: 0.00174 -> 0.00182: 1.0469x larger

The following not significant results are hidden, use -v to show them:
django_v3, fastunpickle, json_dump_v2, json_load, tornado_http.
msg261190 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-03-04 16:02
You are doing measurements on a virtual machine... For sure you are not the only user that has active workloads on the physical machine while you do benchmarks :)

On the other hand, the path you are going with just LTO is nice for experiments, but for real-world usages is not feasible. Using it in conjunction with PGO is the way to have the best Python interpreter, and I strongly recommend for you to use the v04 versions of the patches.
msg261205 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2016-03-05 00:22
> For sure you are not the only user that has active workloads on the physical machine while you do benchmarks :)

I think largest machine type I chosen (32core) can avoid sharing physical machine with other users.

> On the other hand, the path you are going with just LTO is nice for experiments, but for real-world usages is not feasible. Using it in conjunction with PGO is the way to have the best Python interpreter, and I strongly recommend for you to use the v04 versions of the patches.

I agree PGO+LTE is the best.  But I want "only LTO" because:

1) It is a pitfall that `./configure --with-lto && make` doesn't use LTO.
2) PGO makes build too slow.  For casual usecase, I can wait LTO but not PGO.
msg261208 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2016-03-05 00:44
Piping up from the peanut gallery here:

If your use case is not doing release builds for production use, i.e. "casual use", don't bother with either PGO or LTO.  It won't matter.

Your final build that you Q&A ship should absolutely use those. (nobody's going to disagree with that :)

While I would not reject changes that allow --with-lto to work in the absence of PGO, but I don't think it should be anyone's priority.
msg263354 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-13 19:24
+  --with-lto              Enable Link Time Optimization in PGO builds.
+                          Disabled by default.

I don't understand why it's disabled by default. IMHO we must enable all the best optimizers options *by default*.

But I expect all optimizations to be disabled by --with-debug.
msg263355 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-04-13 19:50
LTO is not stable on all platforms (according to doko), and people don't
want to wait for PGO to build when they just run ./configure && make.

--with-pgo and --with-lto is fine.
msg263356 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-13 19:53
> LTO is not stable on all platforms (according to doko), and people don't want to wait for PGO to build when they just run ./configure && make.

Can we have a whitelist of arch known to support PGO and/or LTO? Or maybe a blacklist?

Ubuntu already has this knownledge in their package, no?
msg263357 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-04-13 19:55
On 13.04.2016 21:50, Stefan Krah wrote:
> 
> LTO is not stable on all platforms (according to doko), and people don't
> want to wait for PGO to build when they just run ./configure && make.
> 
> --with-pgo and --with-lto is fine.

Agreed. Let's not make compilation take longer than necessary.

When doing production builds, people can still enable these
optimizations as necessary.
msg263358 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-04-13 19:57
Le 13/04/2016 21:55, Marc-Andre Lemburg a écrit :
>>
>> LTO is not stable on all platforms (according to doko), and people don't
>> want to wait for PGO to build when they just run ./configure && make.
>>
>> --with-pgo and --with-lto is fine.
> 
> Agreed. Let's not make compilation take longer than necessary.
> 
> When doing production builds, people can still enable these
> optimizations as necessary.

Agreed as well. It's enough to make these options sufficiently accessible.
msg263383 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-04-14 08:39
@Stefan and @Marc, you say that people don't want to wait for PGO to build when running ./configure && make, but why? Even though many developers use it, this mode is not intended for development, it is production level and should be run once (or at leas a limited number or times), when the developers are sure that everything is fine in the debug mode. As Victor previously said, we should have all the *good* stuff (PGO, LTO, etc) enabled by default, regardless the time needed to do it.

@Victor, indeed, LTO is not yet good enough to use it stand-alone in CPython. That is the reason why it is enabled only with PGO, because applied over it, we obtain further speedups than PGO alone. Also Ubuntu uses PGO and LTO in their releases.

But in the end maybe `./configure --with-lto && make profile-opt` will have to do for everybody.
msg263385 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-04-14 08:55
On Thu, Apr 14, 2016 at 08:39:20AM +0000, Alecsandru Patrascu wrote:
> @Stefan and @Marc, you say that people don't want to wait for PGO to build when running ./configure && make, but why? Even though many developers use it, this mode is not intended for development, it is production level and should be run once (or at leas a limited number or times), when the developers are sure that everything is fine in the debug mode. As Victor previously said, we should have all the *good* stuff (PGO, LTO, etc) enabled by default, regardless the time needed to do it.

I use it all the time in development:

  - For running math tests that would be too slow otherwise.

  - To diagnose invalid accesses that only occur with -O2.

  - To speed up Valgrind runs.
msg263386 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-04-14 09:07
On 14.04.2016 10:39, Alecsandru Patrascu wrote:
> 
> @Stefan and @Marc, you say that people don't want to wait for PGO to build when running ./configure && make, but why? Even though many developers use it, this mode is not intended for development, it is production level and should be run once (or at leas a limited number or times), when the developers are sure that everything is fine in the debug mode. As Victor previously said, we should have all the *good* stuff (PGO, LTO, etc) enabled by default, regardless the time needed to do it.

You need to compile Python a lot during Python development and
here the compile speed matters, the performance of the resulting
binary is secondary (as long as it is consistent).

For production, it's easily possible to add those options to configure,
plus it's not 100% clear whether all optimizations really do create
correct code. We've had lots of issues with optimization errors in
compilers in the past and have generally been rather conservative with
the default optimization settings. It's better to have a stable running
Python, than a Python that is fast at failing or creating wrong
results ;-)

I think having these extra options readily accessible and
working is great, and people who know what they are doing can
then use them for the benefit of getting an even faster Python.

Distributors will know what they are doing, so many Python
users will still be able to benefit from them.
msg263387 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-04-14 09:13
On Thu, Apr 14, 2016 at 08:55:25AM +0000, Stefan Krah wrote:
> I use it all the time in development:

... where "it" refers to "./configure && make", not to PGO.
msg263395 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2016-04-14 10:17
Maybe an workflow like the one proposed in issue #26359 can be helpful in these development phases.
msg263532 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-04-15 23:59
New changeset f16ec63055ad by Gregory P. Smith in branch '3.5':
Issue #25702: A --with-lto configure option has been added that will
https://hg.python.org/cpython/rev/f16ec63055ad

New changeset 3103af76f4c4 by Gregory P. Smith in branch 'default':
Issue #25702: A --with-lto configure option has been added that will
https://hg.python.org/cpython/rev/3103af76f4c4
msg263534 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2016-04-16 00:16
What i committed for 3.5 and 3.6 matches lto-cpython3-v04.patch which just adds --with-lto support.  2.7 still needs to be patched.

For reference: Using ubuntu's gcc 5.2.1 i was seeing a 2-3% performance increase in the resulting LTO binary vs a plain profile-opt PGO build.  That'll vary based on arch and compiler toolchain.
msg266993 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-06-02 23:44
New changeset f710dac07312 by Gregory P. Smith in branch '2.7':
Issue #25702: A --with-lto configure option has been added that will
https://hg.python.org/cpython/rev/f710dac07312
msg267007 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2016-06-03 00:27
the main part of this issue is done but it can't be closed until the dependencies listed are also dealt with.  un-assigning myself.
History
Date User Action Args
2016-06-03 00:27:51gregory.p.smithsetassignee: gregory.p.smith ->
resolution: fixed
messages: + msg267007
2016-06-02 23:44:58python-devsetmessages: + msg266993
2016-04-17 06:30:12gregory.p.smithsetdependencies: + test_gdb fails all tests on a profile-opt build configured --with-lto
2016-04-17 06:21:50gregory.p.smithsetdependencies: + test_distutils fails when configured --with-lto
2016-04-16 00:16:25gregory.p.smithsetassignee: gregory.p.smith
messages: + msg263534
2016-04-15 23:59:18python-devsetnosy: + python-dev
messages: + msg263532
2016-04-14 10:17:43alecsandru.patrascusetmessages: + msg263395
2016-04-14 09:13:39skrahsetmessages: + msg263387
2016-04-14 09:07:32lemburgsetmessages: + msg263386
2016-04-14 08:55:25skrahsetmessages: + msg263385
2016-04-14 08:39:20alecsandru.patrascusetmessages: + msg263383
2016-04-13 19:57:45pitrousetmessages: + msg263358
2016-04-13 19:55:45lemburgsetmessages: + msg263357
2016-04-13 19:53:38vstinnersetmessages: + msg263356
2016-04-13 19:50:26skrahsetmessages: + msg263355
2016-04-13 19:24:24vstinnersetnosy: + vstinner
messages: + msg263354
2016-03-05 00:44:43gregory.p.smithsetmessages: + msg261208
2016-03-05 00:22:05inada.naokisetmessages: + msg261205
2016-03-04 16:02:07alecsandru.patrascusetmessages: + msg261190
2016-03-04 15:34:47inada.naokisetmessages: + msg261189
2016-03-04 07:18:31alecsandru.patrascusetmessages: + msg261183
2016-03-04 01:32:42inada.naokisetmessages: + msg261181
2016-03-03 11:09:03alecsandru.patrascusetmessages: + msg261158
2016-03-03 10:15:10inada.naokisetmessages: + msg261155
2016-03-03 09:48:43alecsandru.patrascusetmessages: + msg261154
2016-03-03 08:46:28inada.naokisetnosy: + inada.naoki
messages: + msg261150
2016-01-27 12:58:16alecsandru.patrascusetmessages: + msg259019
2016-01-20 18:02:23steve.dowersetmessages: + msg258703
2016-01-20 17:38:52brett.cannonsetnosy: + steve.dower
messages: + msg258697
2016-01-20 13:57:16r.david.murraysetmessages: + msg258682
2016-01-20 13:03:19vstinnersetnosy: - vstinner
2016-01-20 11:03:26lemburgsetmessages: + msg258663
2016-01-20 10:38:15pitrousetmessages: + msg258660
2016-01-20 10:24:18lemburgsetnosy: + lemburg
messages: + msg258658
2016-01-20 09:28:00alecsandru.patrascusetfiles: + lto-cpython3-v04.patch
2016-01-20 09:27:49alecsandru.patrascusetfiles: + lto-cpython2-v04.patch

messages: + msg258655
2016-01-20 08:41:55vstinnersetmessages: + msg258653
2016-01-20 08:37:58pitrousetmessages: + msg258651
2016-01-20 05:28:16zach.waresetmessages: + msg258642
2016-01-19 23:03:46pitrousetmessages: + msg258632
2016-01-19 22:54:42r.david.murraysetnosy: + r.david.murray
2016-01-19 22:22:53zach.waresetnosy: + zach.ware
messages: + msg258627
2016-01-04 15:19:46alecsandru.patrascusetfiles: + lto-cpython3-v03.patch
2016-01-04 15:19:39alecsandru.patrascusetfiles: + lto-cpython2-v03.patch

messages: + msg257463
2015-12-26 17:46:36alecsandru.patrascusetnosy: + brett.cannon, gregory.p.smith, scoder, vstinner, skrah
messages: + msg257042
2015-11-23 14:47:04pitrousetmessages: + msg255168
2015-11-23 14:46:20pitrousetmessages: + msg255167
2015-11-23 14:44:27alecsandru.patrascusetfiles: + lto-cpython3-v02.patch
2015-11-23 14:44:20alecsandru.patrascusetfiles: + lto-cpython2-v02.patch
2015-11-23 14:44:11alecsandru.patrascusetmessages: + msg255165
2015-11-23 12:18:48alecsandru.patrascusetmessages: + msg255150
2015-11-23 11:36:30pitrousetnosy: + pitrou

messages: + msg255148
stage: patch review
2015-11-23 09:00:11alecsandru.patrascusetfiles: + lto-cpython3-v01.patch
2015-11-23 09:00:03alecsandru.patrascusetfiles: + lto-cpython2-v01.patch
keywords: + patch
2015-11-23 08:59:40alecsandru.patrascucreate