Title: Significant performance problems with Python 2.7 built with clang 3.x or 4.x
Type: performance Stage:
Components: Interpreter Core Versions: Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, benjamin.peterson, inada.naoki, ned.deily, ronaldoussoren, tdsmith, zmwangx
Priority: normal Keywords: patch

Created on 2018-01-22 04:34 by zmwangx, last changed 2018-04-15 05:11 by ned.deily.

Pull Requests
URL Status Linked Edit
PR 5574 merged inada.naoki, 2018-02-07 02:27
Messages (18)
msg310395 - (view) Author: Zhiming Wang (zmwangx) * Date: 2018-01-22 04:34
Python 2.7 could be significantly slower (5x in some cases) when compiled with clang 3.x or 4.x, compared to clang 5.x. This is quite a problem on macOS, since the latest clang from Apple (which comes with Xcode 9.2) is based on LLVM 4.x. This issue was first noticed by Bart Skowron and reported to the Homebrew project.[1]

I ran some preliminary benchmarks (here[2] are the exact setup scripts) with just a simple loop:

    import time

    def f(n):
        while n > 0:
            n -= 1

    start = time.time()
    stop = time.time()
    print('%.6f' % (stop - start))

and here are my results:

- macOS 10.13.2 on a MacBook Pro:

    2.082144	/usr/bin/python2.7
    7.964049	/usr/local/bin/python2.7
    8.750652	dist/python27-apple-clang-900/bin/python2.7
    8.476405	dist/python27-clang-3.9/bin/python2.7
    8.625660	dist/python27-clang-4.0/bin/python2.7
    1.760096	dist/python27-clang-5.0/bin/python2.7
    3.254814	/usr/local/bin/python3.6
    2.864716	dist/python-master-apple-clang-900/bin/python3
    3.071757	dist/python-master-clang-3.9/bin/python3
    2.925192	dist/python-master-clang-4.0/bin/python3
    2.908782	dist/python-master-clang-5.0/bin/python3

- Ubuntu 17.10 in VirtualBox:

    1.475095	/usr/bin/python2.7
    8.576817	dist/python27-clang-3.9/bin/python2.7
    8.165588	dist/python27-clang-4.0/bin/python2.7
    1.779193	dist/python27-clang-5.0/bin/python2.7
    1.728321	dist/python27-gcc-5/bin/python2.7
    1.570040	dist/python27-gcc-6/bin/python2.7
    1.604617	dist/python27-gcc-7/bin/python2.7
    2.323037	/usr/bin/python3.6
    2.964338	dist/python-master-clang-3.9/bin/python3
    3.054277	dist/python-master-clang-4.0/bin/python3
    2.734908	dist/python-master-clang-5.0/bin/python3
    2.490278	dist/python-master-gcc-5/bin/python3
    2.494691	dist/python-master-gcc-6/bin/python3
    2.642277	dist/python-master-gcc-7/bin/python3

I haven't got time to run more rigorous benchmark suites (e.g., the performance[3] package). I did try the floating point benchmark from performance, and again saw a 2x difference in performance.

msg310423 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2018-01-22 15:36
Has anyone done the same analysis with Python 3.6 or 3.7?
msg310424 - (view) Author: Zhiming Wang (zmwangx) * Date: 2018-01-22 15:37
My benchmarks above do contain py37 (master) stats.
msg311597 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-02-04 11:52
Is there anything we (the CPython developers) can do about this? 

If I read the issue correctly clang 5.x generates faster binaries than clang 3.x and 4.x.  If that is indeed the issue there's probably not much we can do about this. 

BTW. I'm -1 on building the installer with anything but the compiler included in Xcode (and it would be nice to build with a recent version of Xcode to use an up-to-date compiler and SDK)
msg311651 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-05 08:21
It seems clang4 failed to efficient register assigning.
FYI, --without-computed-gotos configure option make penalty smaller.

clang 5 (wihtout CGs): 2.653426
clang 5 (with CGs): 1.997584
clang 4 (without CGs): 3.330879
clang 4 (with CGs): 8.585673
msg311661 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-05 11:17
This is assembly code for FAST_DISPATCH()

It seems there are many redundant spills.  But I don't know how to remove them.
Are their clang expert?
msg311723 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-06 11:31
Bad news: --enable-optimization doesn't solve it.
I hope same thing doesn't happen for Python 3.
Anyone tried Xcode 9.3?  What version of LLVM does Apple use?

Anyway, I think we need help of LLVM expert.
msg311747 - (view) Author: Zhiming Wang (zmwangx) * Date: 2018-02-06 19:52
Turns out python 2.7.10 doesn't suffer from the performance issue even when compiled with stock clang 4.x, and upon further investigation, I tracked down the commit that introduced the regression:

    commit 2c992a0788536087bfd78da8f2c62b30a461d7e2
    Author: Benjamin Peterson <>
    Date:   Thu May 28 12:45:31 2015 -0500
        backport computed gotos (#4753)

So Naoki was right that computed gotos is (solely) to blame here.
msg311751 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-02-06 21:35
Quick test: there doesn't seem to be a similar regression when building 3.6 with the current clang provided by Xcode 9.2, just with 2.7.  And both 2.7 and 3.6 configure HAVE_COMPUTED_GOTOS on.  Benjamin?

(FWIW, the 2.7.x binaries provided by the installers do not suffer from this performance regression as they are not built with clang.)
msg311780 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-07 10:09
New changeset 2942b909d9a428e6683d90b3436cfa4a81bd5d8a by INADA Naoki in branch '2.7':
bpo-32616: Disable computed gotos by default for clang < 5 (GH-5574)
msg311839 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-08 17:32
I'm sorry, my patch doesn't work on Xcode (Apple LLVM).
computed-gotos is still enabled by default.

Apple doesn't expose LLVM version.  It's really annoying.

$ cat x.c
#include <stdio.h>

int main()
    printf("__clang__ : %d\n", __clang__);
    printf("__llvm__ : %d\n", __llvm__);
    printf("__VERSION__ : %s\n", __VERSION__);
    printf("__clang_version__ : %s\n", __clang_version__);
    printf("__clang_major__   : %d\n", __clang_major__);
    printf("__clang_minor__   : %d\n", __clang_minor__);
    printf("__clang_patchlevel__ : %d\n", __clang_patchlevel__);

$ cc x.c && ./a.out
__clang__ : 1
__llvm__ : 1
__VERSION__ : 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)
__clang_version__ : 9.0.0 (clang-900.0.39.2)
__clang_major__   : 9
__clang_minor__   : 0
__clang_patchlevel__ : 0
msg311842 - (view) Author: Zhiming Wang (zmwangx) * Date: 2018-02-08 18:59
Yeah, Apple LLVM versions are a major headache. I resorted to feature detection, using C++ coroutines support as the clang 5 distinguisher[1]:

$ cat /tmp/test/
#include <experimental/coroutine>

int main() {
    return 0;

$ /Applications/ -v
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/

$ /Applications/ -o stub -fcoroutines-ts -stdlib=libc++ fatal error: 'experimental/coroutine' file not found
#include <experimental/coroutine>
1 error generated.

$ /Applications/ -v
Apple LLVM version 9.1.0 (clang-902.0.31)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/

$ /Applications/ -o stub -fcoroutines-ts -stdlib=libc++

Here is Xcode 9.2 and is Xcode 9.3 beta 2.

The conclusion here seems to be that Apple LLVM 9.0.0 is based on LLVM 4, while Apple LLVM 9.1.0 is based on LLVM 5.

msg311862 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-09 02:45
How can we distinguish Apple LLVM with LLVM easily?
Or should we disable computed-gotos by default on LLVM?

It's only for Python 2.
5x slowdown is too large comparing to 10% speedup.
msg311863 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-02-09 03:36
Can anyone explain what the difference is between 2.7 and 3.6, i.e. why there is the performance regression for 2.7 but not for 3.6 using the same compiler instance?  It would be better to understand and solve that problem rather than trying to special case compiler versions.
msg311865 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-09 03:54
I don't know exactly.  But as far as I saw, Python 3's eval loop has less function-wide local variables.

For example, ROT_THREE uses only block local variable.

On the other hand, there are more function-wide local variables in Python 2.  And some of them are used over `case`s actually.

I suspect that's why LLVM4 failed to optimize Python 2 but success to optimize Python 3.
msg311868 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-02-09 08:12
I different question w.r.t. detection of the clang/llvm version on Apple's system compiler: Is it worthwhile to do so? 

If the compiler included in the Xcode 9.3 beta (and hence likely the one in Xcode 9.3 final) fixes the performance issue a very large subset of people building Python for themselves will get a fixed compiler fairly soon. It would then be enough to warn about this issue in a readme file for other users.
msg311992 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-02-11 09:55
This should remain everyone that backporting performance improvements is not a no-brainer.
msg315308 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-04-15 05:11
A followup - Ronald asked:
> w.r.t. detection of the clang/llvm version on Apple's system compiler: Is it worthwhile to do so?

Now that Xcode 9.3 (for macOS 10.13+) is officially released, I ran a quick series of test on it and on the most recent Xcode versions for the last several macOS OS families: 10.12, 10.11, and 10.9 (I didn't have a 10.10 system available at the moment).  Only looking at the most recent supported Xcode/compiler version for each major release is reasonable since I think most people follow Apple's strong encouragement to keep software updated and only use the most recent releases.  And I think most people follow Apple's lead in using their build tools, via Xcode or the command line utilities, rather than a third-party compiler.

By that measure, it seems clear that (1) there is only one current version that exhibits the performance degradation, that is the Xcode 9.2 version labeled Apple LLVM 9.0.0 (clang-900.0.39.2) and (2) that is now only an issue for macOS 10.12 (Sierra) where Xcode 9.2 is (and will likely remain) the most recent version.  For macOS 10.13 (High Sierra), the compiler in the newly released Xcode 9.3 does not exhibit the problem.  And the most recent versions of Xcode for the tested earlier macOS releases do not either.

BTW, the MacPorts project maintains a handy webpage listing Xcode releases and compiler versions by macOS release:

Here are the results.  The methodology was to download and build the just released Python 2.7.15rc1 from source using the default configure options, i.e. just ./configure, and then run the test program 3 times with it and then three times with the Apple-provided system /usr/bin/python2.7 as a baseline.

ProductName:	Mac OS X
ProductVersion:	10.13.4
BuildVersion:	17E199
2.7.15rc1 (default, Apr 15 2018, 00:22:29)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)]
2.7.10 (default, Oct  6 2017, 22:29:07)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)]

ProductName:	Mac OS X
ProductVersion:	10.12.6
BuildVersion:	16G1314
2.7.15rc1 (default, Apr 15 2018, 00:31:39)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)]
2.7.10 (default, Feb  7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)]

ProductName:	Mac OS X
ProductVersion:	10.11.6
BuildVersion:	15G20015
2.7.15rc1 (default, Apr 15 2018, 00:38:12)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]
2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)]

ProductName:	Mac OS X
ProductVersion:	10.9.5
BuildVersion:	13F1911
2.7.15rc1 (default, Apr 15 2018, 00:42:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
2.7.5 (default, Mar  9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]
Date User Action Args
2018-04-15 05:11:23ned.deilysetmessages: + msg315308
2018-02-11 09:55:28pitrousetnosy: - pitrou
2018-02-11 09:55:16pitrousetnosy: + pitrou
messages: + msg311992
2018-02-09 08:12:32ronaldoussorensetmessages: + msg311868
2018-02-09 03:54:25inada.naokisetmessages: + msg311865
2018-02-09 03:36:28ned.deilysetmessages: + msg311863
2018-02-09 02:45:11inada.naokisetmessages: + msg311862
2018-02-08 18:59:15zmwangxsetstatus: pending -> open

messages: + msg311842
2018-02-08 17:32:33inada.naokisetstatus: closed -> pending
resolution: fixed ->
messages: + msg311839

stage: resolved ->
2018-02-08 07:26:40inada.naokisetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-02-07 10:09:40inada.naokisetmessages: + msg311780
2018-02-07 02:27:37inada.naokisetkeywords: + patch
stage: patch review
pull_requests: + pull_request5392
2018-02-06 21:35:15ned.deilysetnosy: + benjamin.peterson
messages: + msg311751
components: - macOS
2018-02-06 19:52:37zmwangxsetmessages: + msg311747
2018-02-06 11:31:25inada.naokisetmessages: + msg311723
2018-02-05 11:17:25inada.naokisetmessages: + msg311661
2018-02-05 08:21:30inada.naokisetmessages: + msg311651
2018-02-04 11:52:03ronaldoussorensetmessages: + msg311597
2018-01-26 23:50:58terry.reedysetnosy: + ned.deily, ronaldoussoren
components: + macOS
2018-01-23 01:15:34tdsmithsetnosy: + tdsmith
2018-01-22 15:37:47zmwangxsetmessages: + msg310424
2018-01-22 15:36:30barrysetnosy: + barry
messages: + msg310423
2018-01-22 14:19:28pablogsalsettype: performance
components: + Interpreter Core
2018-01-22 05:10:04inada.naokisetnosy: + inada.naoki
2018-01-22 04:34:14zmwangxcreate