This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Hash computation speedup for {buffer, string, unicode}object
Type: performance Stage:
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: JelleZijlstra, Rosuav, alecsandru.patrascu, benjamin.peterson, brett.cannon, fijall, gregory.p.smith, serhiy.storchaka, skip.montanaro, vstinner
Priority: normal Keywords: patch

Created on 2015-09-14 07:29 by alecsandru.patrascu, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
hash8-v01.patch alecsandru.patrascu, 2015-09-14 07:29 Initial patch for Python 2.7 review
hash8-v02.patch alecsandru.patrascu, 2015-09-14 10:55 review
hash8-v03-gps01.patch gregory.p.smith, 2015-09-14 21:53 review
Messages (21)
msg250639 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 07:29
Hi All,

This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation.

I would like to submit a patch that improves the performance of the hash computation code on stringobject, bufferobject and unicodeobject. As can be seen from the attached sample performance results from the Grand Unified Python Benchmark, speedups up to 40% were observed. Furthermore, we see a 5-7% performance on OpenStack/Swift, where most of the code is in Python 2.7.

Attached is the patch that modifies Object/stringobject.c, Object/bufferobject.c and Object/unicodeobject.c files. We built and tested this patch for Python 2.7 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores).

Steps to apply the patch:
1.  hg clone https://hg.python.org/cpython cpython 
2.  cd cpython 
3.  hg update 2.7
4.  Copy hash8.patch to the current directory 
5.  hg import --no-commit hash8.patch
6.  ./configure 
7.  make



In the following, please find our sample performance results measured on a XEON Haswell machine.  

Hardware (HW):      Intel XEON (Haswell) 18 Cores

BIOS settings:      Intel Turbo Boost Technology: false
                    Hyper-Threading: false

Operating System:   Ubuntu 14.04.3 LTS trusty

OS configuration:   CPU freq set at fixed: 2.0GHz by
                        echo 2000000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
                        echo 2000000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
                    Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by
                        echo 0 > /proc/sys/kernel/randomize_va_space
                
GCC version:        gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)

Benchmark:          Grand Unified Python Benchmark (GUPB)
                    GUPB Source: https://hg.python.org/benchmarks/                    

Python2.7 results:
    Python source: hg clone https://hg.python.org/cpython cpython
    Python Source: hg update 2.7

        Benchmarks          Speedup(%)
        unpack_sequence     40.32733766
        chaos               24.84002537
        chameleon           23.01392651
        silent_logging      22.27202911
        django              20.83842317
        etree_process       20.46968294
        nqueens             20.34234985
        pathlib             19.63445919
        pidigits            19.34722148
        etree_generate      19.25836634
        pybench             19.06895825
        django_v2           18.06073108
        etree_iterparse     17.3797149
        fannkuch            17.08120879
        pickle_list         16.60363602
        raytrace            16.0316265
        slowpickle          15.86611184
        pickle_dict         15.30447114
        call_simple         14.42909032
        richards            14.2949594
        simple_logging      13.6522626
        etree_parse         13.38113097
        json_dump_v2        12.26588885
        float               11.88164311
        mako                11.20606516
        spectral_norm       11.04356684
        hg_startup          10.57686164
        mako_v2             10.37912648
        slowunpickle        10.24030714
        go                  10.03567319
        meteor_contest      9.956231435
        normal_startup      9.607401586
        formatted_logging   9.601244811
        html5lib            9.082603748
        2to3                8.741557816
        html5lib_warmup     8.268150981
        nbody               7.507012306
        regex_compile       7.153922724
        bzr_startup         7.140244739
        telco               6.869411927
        slowspitfire        5.746323922
        tornado_http        5.24360121
        rietveld            3.865704876
        regex_v8            3.777622219
        hexiom2             3.586305282
        json_dump           3.477551682
        spambayes           3.183991854
        fastunpickle        2.971645347
        fastpickle          0.673086656
        regex_effbot        0.127946837
        json_load           0.023727176

Thank you,
Alecsandru
msg250640 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-09-14 08:06
Python 2.7 is closed for adding new features. Python 3.4+ uses different code for calculating hashes.
msg250642 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-09-14 08:23
> Python 2.7 is closed for adding new features. Python 3.4+ uses different code for calculating hashes.

The code doesn't modify the hash function. It's a common loop unroll optimization technique.

Since the optimization can be seen on real world benchmark, I think that it's worth to take this optimization.
msg250643 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-09-14 08:41
Hmm. Is Duff's Device a valid construct for CPython? It'd shorten this a lot...
msg250644 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-09-14 08:46
Or at very least, can fallthrough be used in the switch block, so there aren't 7+6+5+4+3+2+1 copies of the same line?

-- Not a C performance expert --
msg250645 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-09-14 09:00
> Or at very least, can fallthrough be used in the switch block, so there
> aren't 7+6+5+4+3+2+1 copies of the same line?

It is how _Py_HashBytes is implemented in 3.4+.
msg250646 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 09:16
Yes, sure it can be implemented in smaller space, as Serhiy and Chris well pointed out. I submitted the 01 version like that just to be clear that I don't want to re-implement a new hash computing value and just unroll the loop in the existing one. I will submit a new version based on this observations.
msg250654 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 10:55
I've submitted a more compact version of the previous code (hash8-v02.patch)
msg250685 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2015-09-14 17:05
Is this applicable to Python 3 at all? From Serhiy's comment I think this might be a backport of a technique already used there, but it's not entirely clear as Alecsandru didn't say why there wasn't a 3.6 patch.
msg250687 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 17:25
This patch is not applicable to Python 3, as it already has a better hash function and has the same unrolling technique applied. 

As Brett well observed, it is a backport of an optimization technique used in Python 3, applied to the existing hash function in 2.7.
msg250689 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2015-09-14 17:30
If this is applying something we already did in Python 3 and it doesn't break compatibility then this should fall under the same sort of backport exemption that Guido granted for the eval loop performance enhancements that Intel did a couple months back.
msg250691 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 17:39
Yes, it doesn't break the compatibility with existing applications. To make sure, we tested it in various scenarios and workloads.
msg250692 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-09-14 17:43
+1 for anything that makes Python faster with provably no semantic changes.
msg250695 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-09-14 18:01
The code for str and buffer can be shared as _Py_HashBytes() (as in 3.4+). Also please add comments in the switch block as in 3.4+.

Should we ask Guido for granting backport exemption for this optimization?
msg250698 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2015-09-14 18:14
Assuming Benjamin doesn't object I don't think we need to explicit ask Guido. He made his feelings quite clear about allowing backports of performance improvements that already exist in Python 3 and have been battle-tested.

Having said that, I hope Intel is also looking at Python 3 improvements and not solely backporting stuff to Python 2.7. =)
msg250704 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-09-14 21:53
attaching an updated patch.

fwiw, in my own quick tests i'm not seeing a performance difference on the benchmark suite.  but i am not trying to run it in such an isolated quiet machine locked freq.  environment.

it is still a good idea regardless.  _some_ compilers really benefit from the manually unrolling.  any that already unrolled things themselves should not care.
msg250724 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-09-15 06:03
Testing this on a host with a fixed frequency and mostx background tasks disabled I cannot reproduce the speedups you list.  I see no significant change on most of them and a 6% slowdown on json_load and a 1-4% slowdown on regex_v8 (not sure if those are in the noise)

I suspect your comparison was including other patches or was not built the same.
msg250834 - (view) Author: Maciej Fijalkowski (fijall) * (Python committer) Date: 2015-09-16 10:01
I find numbers really hard to believe. It would mean that over 40% of django templates is string hashing (assuming 2x speedup) which really sounds unbelievable.

In fact in PyPy I never found string hashing to be significant while I would expect PyPy to have string hashing more of a bottleneck, since it's almost never optimized away really.

What made you think string hashing is a good target for optimizations?
msg250858 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-16 17:38
Please hold this patch for a while, while I analyze the received feedback.
msg255428 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-26 17:29
Ping.
msg255431 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-11-26 17:50
Reopen this if you find a reproducible way for this patch to prove useful.
History
Date User Action Args
2022-04-11 14:58:21adminsetgithub: 69293
2015-11-26 17:50:52gregory.p.smithsetstatus: open -> closed
resolution: rejected
messages: + msg255431
2015-11-26 17:29:05serhiy.storchakasetstatus: pending -> open

messages: + msg255428
2015-09-19 23:04:48brett.cannonsetstatus: open -> pending
2015-09-19 22:15:20JelleZijlstrasetstatus: pending -> open
nosy: + JelleZijlstra
2015-09-16 17:38:49brett.cannonsetstatus: open -> pending
2015-09-16 17:38:26alecsandru.patrascusetmessages: + msg250858
2015-09-16 17:02:40skip.montanarosetnosy: + skip.montanaro
2015-09-16 10:01:56fijallsetnosy: + fijall
messages: + msg250834
2015-09-15 06:03:20gregory.p.smithsetmessages: + msg250724
2015-09-14 21:53:50gregory.p.smithsetfiles: + hash8-v03-gps01.patch

messages: + msg250704
2015-09-14 18:14:17brett.cannonsetmessages: + msg250698
2015-09-14 18:01:02serhiy.storchakasetnosy: + benjamin.peterson
messages: + msg250695
2015-09-14 17:43:57Rosuavsetmessages: + msg250692
2015-09-14 17:39:10alecsandru.patrascusetmessages: + msg250691
2015-09-14 17:30:19brett.cannonsetmessages: + msg250689
2015-09-14 17:25:48alecsandru.patrascusetmessages: + msg250687
2015-09-14 17:05:07brett.cannonsetnosy: + brett.cannon
messages: + msg250685
2015-09-14 15:12:23gregory.p.smithsetnosy: + gregory.p.smith
2015-09-14 10:55:54alecsandru.patrascusetmessages: + msg250654
2015-09-14 10:55:04alecsandru.patrascusetfiles: + hash8-v02.patch
2015-09-14 09:16:55alecsandru.patrascusetmessages: + msg250646
2015-09-14 09:00:43serhiy.storchakasetmessages: + msg250645
title: Hash computation speedup for {buffer,string,unicode}object -> Hash computation speedup for {buffer, string, unicode}object
2015-09-14 08:46:07Rosuavsetmessages: + msg250644
2015-09-14 08:41:13Rosuavsetnosy: + Rosuav
messages: + msg250643
2015-09-14 08:23:19vstinnersettype: enhancement -> performance

messages: + msg250642
nosy: + vstinner
2015-09-14 08:06:57serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg250640
2015-09-14 07:29:28alecsandru.patrascucreate