classification
Title: magiccube2x2 permutations 28% slower with Python 3.8.0rc1 vs 3.7
Type: performance Stage: resolved
Components: Versions: Python 3.8
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Louis Huemiller, gregory.p.smith
Priority: normal Keywords:

Created on 2019-10-15 01:42 by Louis Huemiller, last changed 2019-10-17 06:33 by Louis Huemiller. This issue is now closed.

Files
File name Uploaded Description Edit
20191005b_configuration Louis Huemiller, 2019-10-15 01:42
20191011b_chart.svg Louis Huemiller, 2019-10-16 04:20
Messages (9)
msg354682 - (view) Author: Louis Huemiller (Louis Huemiller) Date: 2019-10-15 01:42
Have created a program that by brute force determines all reachable states of a 2-by-2 MagicCube, sometimes referred to as a 2x2 Rubick's cube. On one of my servers, which is described in the attached file 20191011b_configuration, this program takes 19520.03 seconds to execute under Python2.7.15+, while under 3.8.0rc1 it takes 21887.53 seconds. This program is taking 12.13% longer to execute with 3.8.0rc1 versus 2.7.15+. The exact versions of Python as reported by sys.version are:

  2.7.15+ (default, Oct  7 2019, 17:39:04)
  [GCC 7.4.0]

  3.8.0rc1 (default, Oct  1 2019, 21:48:24)
  [GCC 7.4.0]

Used the following script to execute this program under Python 2.7, 3.6, 3.7, and 3.8:

  mkdir -p ./results
  echo "======== git log ========" > ./results/20191011b_configuration
  git log -n4 >> ./results/20191011b_configuration

  echo ""  >> ./results/20191011b_configuration
  echo "======== uname ========" >> ./results/20191011b_configuration
  uname -a >> ./results/20191011b_configuration

  echo ""  >> ./results/20191011b_configuration
  echo "======== lsb_release ========" >> 
  ./results/20191011b_configuration
  lsb_release -a >> ./results/20191011b_configuration

  echo ""  >> ./results/20191011b_configuration
  echo "======== lshw ========" >> ./results/20191011b_configuration
  sudo lshw >> ./results/20191011b_configuration

  make clean
  make
  ./target/permutations2x2 > \
    ./results/20191011b_magiccube2x2_permutations_cxx_unordered_map 2>&1
  python3.8 ./permutations2x2  > \
    ./results/20191011b_magiccube2x2_permutations_python3.8.0rc1 2>&1
  python3.7 ./permutations2x2  > \
    ./results/20191011b_magiccube2x2_permutations_python3.7 2>&1
  python3.6 ./permutations2x2  > \
    ./results/20191011b_magiccube2x2_permutations_python3.6 2>&1
  python2.7 ./permutations2x2  > 
  ./results/20191011b_magiccube2x2_permutations_python2.7 2>&1

Doing egrep "^# Total_Time:" on each of the output files, shows that the various executions took the following amount of time:

  cxx_unordered_map:# Total_Time: 1098.25
  python2.7:# Total_Time: 19520.03
  python3.6:# Total_Time: 19562.28
  python3.7:# Total_Time: 17012.67
  python3.8.0rc1:# Total_Time: 21887.53

Under Python2.7 and Python3.6 the program ran in approximately the same amount of time, while under Python3.7 it ran 12.8% faster than execution with Python2.7. Finally, the python3.8.0rc1 execution is noticeably slower than all the other executions, at 12.1% lower than Python2.7 and 28.6% slower than Python3.7.

The source for the magic_cube2x2 program is available under gist.github.com at:

  https://gist.github.com/lhuemill/f1273291e5f5e85e4b42e5c7614e60ef

This program creates a fairly large dictionary of 88.179840M entries. Each entry has a key of a string of length 29 and a value string of with a maximum length of 22 characters. A quick back-of-the-envelope calculation shows that this dictionary would use approximately 9GB of memory, but for some reason, all of the Python executions use approximately 22.6GB of memory. Even the C++ version is using 20.2GB when the permutations are stored in an unordered_map<string, string> and 20.55GB when stored in a map<string, string>. Quite surprising that the C++ is using over twice the expected amount of memory.

  CAUTION: When executing this program with less than the needed
           amount of physical memory, be careful that swap is not
           backed by a flash device. Doing so will likely cause
           a significant amount of I/O to the flash device and
           likely quickly wear it out.
msg354767 - (view) Author: Louis Huemiller (Louis Huemiller) Date: 2019-10-16 04:20
Attached is a chart, which shows the results from the runs mentioned in the initial post.  This chart was produced through the use of:

  ../permutations2x2_chart_results \
    -f ./20191011b_chart.svg \ 
 C++unordered_map:./20191011b_magiccube2x2_permutations_cxx_unordered_map \
        Python2.7:./20191011b_magiccube2x2_permutations_python2.7 \
        Python3.6:./20191011b_magiccube2x2_permutations_python3.6 \
        Python3.7:./20191011b_magiccube2x2_permutations_python3.7 \
       
  Python3.8.0rc1:./20191011b_magiccube2x2_permutations_python3.8.0rc1

The permutations2x2_chart_results program is present in the gist.github repository mentioned in the initial post.
msg354768 - (view) Author: Louis Huemiller (Louis Huemiller) Date: 2019-10-16 04:32
Each of the Python runs mentioned in the initial post took around 4 hours to execute. Although not as accurate, the issue can be demonstrated in less than 5 minutes through the use of:

  # If needed use the following to install Python3.8.0rc1
  $ apt-get update
  $ sudo apt-get upgrade
  $ sudo add-apt-repository ppa:deadsnakes/ppa
  $ sudo apt update
  $ sudo apt install python3.8

  # Use Python3.7 to produce baseline.
  $ python3.7 permutations2x2 --max_depth 6 | egrep "^# Total_Time"

  # Perform Python3.8.0rc1 execution
  python3.8 permutations2x2 --max_depth 6 | egrep "^# Total_Time"

When I did this on my server the Python3.7 execution produced:

  # Total_Time: 62.46

While the Python3.8.0rc1 execution produced:

  # Total_Time: 79.72

By running the program with "--max_depth 6" specified, the program only traversed through permutations that are up to 6 moves from the initial position. Although much less than running through an unlimited number of moves, the above run still showed the Python3.8.0rc1 run as 26.7% ((79.72 - 62.46) / 62.46) slower than the Python3.7 execution. Which is close to the 28.6% reduction in performance seen from runs with unlimited depth of moves.
msg354772 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-10-16 06:34
I can not confirm performance regression in 3.8.0.

$ time ~/pyenv/versions/3.7.3/bin/python permutations2x2 --max_depth 6 | egrep "^# Total_Time" ; time ~/pyenv/versions/3.8.0/bin/python permutations2x2 --max_depth 6 | egrep "^# Total_Time"
# Total_Time: 50.24

real    0m50.665s                                                                                                                                                                                                                                                             user    0m49.729s
sys     0m4.354s
# Total_Time: 47.97

real    0m48.502s
user    0m47.380s
sys     0m4.506s

I suspect that this is a binary specific issue.

Would you try "perf record python ..."?
msg354773 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-10-16 06:35
Ah, my bad.  I meant `perf stat python ...`.
msg354774 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-10-16 06:47
Deadsneak Python 3.8:

>>> sysconfig.get_config_var('CFLAGS')
'-Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g   -fstack-protector-strong -Wformat -Werror=format-security'

Disco /usr/bin/python3.7:

>>> sysconfig.get_config_var('CFLAGS')
'-Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g   -fstack-protector-strong -Wformat -Werror=format-security  -g -flto -fuse-linker-plugin -ffat-lto-objects'

So the performance difference comes from optimization options.
msg354777 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2019-10-16 07:07
If I compile my own 3.7.5 and 3.8.0 both with --enable-optimizations, I am unable to reproduce this.

greg@zoonaut:~/sandbox/python/cpython/lh$ ../b37/python permutations2x2 --max_depth 6 | egrep "^# Total_Time"
# Total_Time: 75.92
greg@zoonaut:~/sandbox/python/cpython/lh$ ../b38/python permutations2x2 --max_depth 6 | egrep "^# Total_Time"
# Total_Time: 84.53
greg@zoonaut:~/sandbox/python/cpython/lh$ ../b37/python permutations2x2 --max_depth 6 | egrep "^# Total_Time"
# Total_Time: 79.90
greg@zoonaut:~/sandbox/python/cpython/lh$ ../b38/python permutations2x2 --max_depth 6 | egrep "^# Total_Time"
# Total_Time: 84.19
greg@zoonaut:~/sandbox/python/cpython/lh$ ../b37/python permutations2x2 --max_depth 6 | egrep "^# Total_Time"
# Total_Time: 86.49
greg@zoonaut:~/sandbox/python/cpython/lh$ ../b37/python permutations2x2 --max_depth 6 | egrep "^# Total_Time"
# Total_Time: 83.95
greg@zoonaut:~/sandbox/python/cpython/lh$ ../b38/python permutations2x2 --max_depth 6 | egrep "^# Total_Time"
# Total_Time: 78.85
greg@zoonaut:~/sandbox/python/cpython/lh$ ../b38/python permutations2x2 --max_depth 6 | egrep "^# Total_Time"
# Total_Time: 78.49

which if you add up the 3.7 and 3.8 times over their four runs each, they both come out to ~81 seconds average.  So I don't see a regression.

I'd say that benchmark is unstable.  Also, making sure you're comparing identically prepared builds of cpython rather than something made by different distro package owners.
msg354828 - (view) Author: Louis Huemiller (Louis Huemiller) Date: 2019-10-17 02:29
Thank you for looking into the potential issue. Instead of getting Python3.8 from ppa:deadsnakes/ppa, I downloaded the source from:

  https://www.python.org/ftp/python/3.8.0/Python-3.8.0rc1.tgz

Then did the following to build and install it:

  $ cd /usr/src
  $ sudo tar xzf ~/Python-3.8.0rc1.tgz
  $ cd Python-3.8.0rc1/
  $ sudo ./configure --enable-optimizations
  $ sudo apt-get install zlib1g-dev
  $ sudo -H make altinstall

Then re-running the magiccube2x2 permutations with a max_depth of 6 I obtained:

  $ python3.7 permutations2x2 --max_depth 6 | egrep "^# Total_Time"
  # Total_Time: 61.90

  $ python3.8 permutations2x2 --max_depth 6 | egrep "^# Total_Time"
  # Total_Time: 82.45

Unfortunately, this still shows the Python3.8 execution as significantly slower, in this case 33.2% slower. Not sure of the cause, but perhaps I wasn't completely successful at removing the prior installation from ppa:deadsnakes/ppa. I removed it through the use of:

  $ sudo apt remove python3.8
  $ sudo add-apt-repository -r ppa:deadsnakes/ppa
  $ shutdown -r now

Next step I will completly re-install this server and then only install Python3.8 from https://www.python.org/ftp/python/3.8.0/Python-3.8.0rc1.tgz.
msg354832 - (view) Author: Louis Huemiller (Louis Huemiller) Date: 2019-10-17 06:33
Reinstalled the server I'd been using for this issue, with Ubuntu 18.04.3 LTS Server amd64. This installation already came with Python3.6.8.  Downloaded and installed Python3.7.3 and 3.8.0 from:

  https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tgz
  https://www.python.org/ftp/python/3.8.0/Python-3.8.0.tgz

The installation process was somewhat complicated, with several to be discovered prerequisites.  Used "apt install" to install the following needed dependencies:

  build-essential checkinstall libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev zlib1g-dev libffi-dev

Then was able to do the following to build and install the downloaded Python3.7.3 and 3.8.0:

  $ sudo ./configure --enable-optimizations
  $ sudo -H make altinstall

Then downloaded a clone of my magiccube2x2 repository and obtained the following results:

  $ python3.6 ./permutations2x2 --max_depth 6 | egrep "^# Total_Time:"
  # Total_Time: 69.02
  $ python3.7 ./permutations2x2 --max_depth 6 | egrep "^# Total_Time:"
  # Total_Time: 84.52
  $ python3.8 ./permutations2x2 --max_depth 6 | egrep "^# Total_Time:"
  # Total_Time: 83.14

In this case the runtimes under both of the Python versions that I downloaded the source for and built are significantly slower than the run with python3.6, which came already as part of Ubuntu 18.04.3.

Instead of using the version of python3.7.3 that I built, was able to do:

  $ sudo apt install python3.7

to get one that had already been built for Ubuntu.  Doing this allowed the following result:

  $ /usr/bin/python3.7 ./permutations2x2 --max_depth 6 \
    | egrep "^# Total_Time:"
  # Total_Time: 62.39

Which is better than even the Python3.6 result.  Unfortunately, there is not currently a pre-built Ubuntu package for Python3.8 and the one from ppa:deadsnakes/ppa, I already know has the performance issue.

I did some searching on wiki.python.org and I was unable to find any instructions for how to build python from the source.  Perhaps the instructions are already there but I was unable to find them.

Perhaps this issue should be re-opened as a documentation error. I could really use instructions on how to built a performance version of Python and I suspect several others could also use this. For example, I'd like to pass that information to the maintainers of the deadsnakes repository.
History
Date User Action Args
2019-10-17 06:33:30Louis Huemillersetmessages: + msg354832
2019-10-17 02:29:26Louis Huemillersetmessages: + msg354828
2019-10-16 07:14:08gregory.p.smithsetstatus: open -> closed
resolution: not a bug
stage: resolved
2019-10-16 07:07:25gregory.p.smithsetstatus: closed -> open
nosy: + gregory.p.smith, - methane
messages: + msg354777

resolution: not a bug -> (no value)
stage: resolved -> (no value)
2019-10-16 06:47:31methanesetstatus: open -> closed
resolution: not a bug
messages: + msg354774

stage: resolved
2019-10-16 06:35:21methanesetmessages: + msg354773
2019-10-16 06:34:40methanesetmessages: + msg354772
2019-10-16 05:47:03methanesetnosy: + methane
2019-10-16 04:32:13Louis Huemillersetmessages: + msg354768
2019-10-16 04:20:06Louis Huemillersetfiles: + 20191011b_chart.svg

messages: + msg354767
2019-10-15 01:42:50Louis Huemillercreate