Issue 16286: Use hash if available to optimize a==b and a!=b for bytes and str

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60490

classification

Title:	Use hash if available to optimize a==b and a!=b for bytes and str
Type:	performance	Stage:	patch review
Components:		Versions:	Python 3.4

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	christian.heimes, djc, ezio.melotti, gregory.p.smith, meador.inge, pitrou, python-dev, rhettinger, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-10-19 12:36 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
compare_hash.patch	vstinner, 2012-10-19 12:36		review
compare_hash-2.patch	vstinner, 2013-11-01 01:57		review
compare_hash-3.patch	vstinner, 2013-11-04 10:30		review

Messages (23)
msg173332 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-19 12:36
Attached patch optimize a==b and a!=b operators for bytes and str types of Python 3.4. For str, memcmp() is now always used, instead of a loop using PyUnicode_READ() (which is slow) for kind different than 1. For bytes, compare the first but also the last byte before calling memcmp(), instead of just comparing the first byte. Similar optimization was implemented in Py_UNICODE_MATCH(): changeset: 38242:0de9a789de39 branch: legacy-trunk user: Fredrik Lundh <fredrik@pythonware.com> date: Tue May 23 10:10:57 2006 +0000 files: Include/unicodeobject.h description: needforspeed: check first and last character before doing a full memcmp Initially I only wrote the patch to check the hash values before comparing content of the strings. -- I done some statistics tests. For a fresh Python interpreter, the hash values are only known in 7% cases (but when hashes are compared, they are quite always different, so the optimization is useful). When running "./python -m test test_os", hashes are known and different in 41.4%. After running 70 tests, hashes are known and different in 80%.
msg173338 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-19 13:13
Good. I would like to see similar statistics tests for any real application.
msg173371 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-10-20 06:05
something to include in your statistics is the lengths of the already hashed data being compared. i expect there to be a minimum length before this optimization is useful.
msg173576 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2012-10-23 01:56
Rather than see statistics, I'm curious about what circumstances where the optimization would kick in. Interned strings are pre-hashed but they already benefit from an identity-implies-equality check. Dicts and sets already incorporate a check-hash-before-equality check. That raises the question of what strings ever have had their hash already computed if the string hasn't been interned or has been used in a dict or set? P.S. I rather like the optimization and don't want to discourage it. I'm just curious about what the current optimizations are missing.
msg173588 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-23 10:11
Oh, I forgot this issue when I did the following commit: -- changeset: 79902:b68be1025c42 user: Victor Stinner <victor.stinner@gmail.com> date: Tue Oct 23 02:48:49 2012 +0200 files: Objects/unicodeobject.c description: Optimize PyUnicode_RichCompare() for Py_EQ and Py_NE: always use memcmp() -- I will benchmark the overhead of memcmp() on short strings. We may check the first and last characters before calling memcmp() to limit the overhead of calling a function. I also read that GCC uses its builtin memcmp() which is slower than the memcmp() of the GNU libc.
msg178903 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-01-03 02:24
> P.S. I rather like the optimization and don't want to discourage it. I'm just curious about what the current optimizations are missing. I'm too lazy to produce more statistics or run other benchmarks. I just saw an interesting optimization oportunity. I don't understand why it was not done before. If you consider that compare_hash.patch might slow down Python, so ok, I will just close the issue as rejected.
msg186460 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-04-09 22:17
"I will benchmark the overhead of memcmp() on short strings. We may check the first and last characters before calling memcmp() to limit the overhead of calling a function." I created the issue #17628 for this point.
msg201570 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-10-28 19:06
> That raises the question of what strings ever have had their hash > already computed if the string hasn't been interned or has been used > in a dict or set? Currently, none, I think.
msg201571 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-10-28 19:17
>> That raises the question of what strings ever have had their hash >> already computed if the string hasn't been interned or has been used >> in a dict or set? > > Currently, none, I think. Strings are used (and compared) outside dict and set.
msg201573 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-10-28 20:18
Let's try to identify some use cases in the Python test suite using gdb: (gdb) b unicode_compare_eq (gdb) condition 1 ((PyASCIIObject)str1)->hash != -1 && ((PyASCIIObject)str2)->hash != -1 && ((PyASCIIObject)str1)->hash != ((PyASCIIObject)str2)->hash (gdb) run I didn't dig to understand why hash of these strings are computed. Tell me if you need more examples. Random examples: (1) compare "constant" strings (strings from co_consts of code objects) importlib._bootstrap: _setup(): os_details = ('posix', ['/']), ('nt', ['\\', '/']) for builtin_os, path_separators in os_details: ... ... if builtin_os == 'nt': <== HERE ... (2) importlib._bootstrap: _LoaderBasics.is_package() def is_package(self, fullname): filename = _path_split(self.get_filename(fullname))[1] filename_base = filename.rsplit('.', 1)[0] tail_name = fullname.rpartition('.')[2] return filename_base == '__init__' and ... <== HERE It's surprising that filename_base has its hash computed. I suppose that all these functions (_path_split, .rsplit, .rpartition) return the string unmodified. (3) importlib._bootstrap: PathFinder._path_importer_cache(): @classmethod def _path_importer_cache(cls, path): ... if path == '': <== HERE path is an entry of sys.path. (4) str in __all__ (list of str): os.py: if "putenv" not in __all__: __all__.append("putenv") __all__ is a list of strings. (5) site.py: if __name__ == '__main__': <== HERE __name__ is 'site'. (6) Python/ceval.py: PyEval_EvalCodeEx() called with arbitrary keyword for (i = 0; i < kwcount; i++) { PyObject *co_varnames; PyObject keyword = kws[2i]; PyObject value = kws[2i + 1]; int j; ... / Speed hack: do raw pointer compares. As names are normally interned this should almost always hit. / co_varnames = ((PyTupleObject )(co->co_varnames))->ob_item; for (j = 0; j < total_args; j++) { PyObject nm = co_varnames[j]; if (nm == keyword) goto kw_found; } / Slow fallback, just in case / for (j = 0; j < total_args; j++) { PyObject nm = co_varnames[j]; int cmp = PyObject_RichCompareBool( <== HERE keyword, nm, Py_EQ); if (cmp > 0) goto kw_found; else if (cmp < 0) goto fail; } It looks like the "just in case" path is taken. (gdb) where #0 unicode_compare_eq (str1='isTest', str2='func') at Objects/unicodeobject.c:10532 #1 0x000000000052dd41 in PyUnicode_RichCompare (left='isTest', right='func', op=2) at Objects/unicodeobject.c:10609 #2 0x00000000004be4db in do_richcompare (v='isTest', w='func', op=2) at Objects/object.c:647 #3 0x00000000004be790 in PyObject_RichCompare (v='isTest', w='func', op=2) at Objects/object.c:696 #4 0x00000000004be832 in PyObject_RichCompareBool (v='isTest', w='func', op=2) at Objects/object.c:718 #5 0x00000000005a0f68 in PyEval_EvalCodeEx (...) at Python/ceval.c:3450 ... Traceback (most recent call first): File "/home/haypo/prog/python/default/Lib/test/test_xml_etree.py", line 1669, in test_get_keyword_args e1 = ET.Element('foo' , x=1, y=2, z=3) ElementTree.Element() accepts arbitary keywords. (7) letter==letter singletons: xml.etree.ElementPath: iterfind() def iterfind(elem, path, namespaces=None): ... if path[-1:] == "/": <== HERE Traceback (most recent call first): File "/home/haypo/prog/python/default/Lib/xml/etree/ElementPath.py", line 254, in iterfind if path[-1:] == "/": path is ".//grandchild", path[-1] is 'd' which is a singleton, Python already computed the hash of 'd'. Similar example in the same file: def xpath_tokenizer(pattern, namespaces=None): for token in xpath_tokenizer_re.findall(pattern): tag = token[1] if tag and tag[0] != "{" and ":" in tag: <== HERE ... tag[0] != "{" <= tag is 'grandchild', tag[0] is a singleton. Another example: Traceback (most recent call first): File "/home/haypo/prog/python/default/Lib/sre_parse.py", line 194, in __next if char == "\\": (8) str not in (list of str), test_descr.py: test_dir(): File "/home/haypo/prog/python/default/Lib/test/test_descr.py", line 2255, in <listcomp> names = [x for x in dir(minstance) if x not in default_attributes] minstance = M("m") minstance.b = 2 minstance.a = 1 default_attributes = ['__name__', '__doc__', '__package__', '__loader__'] names = [x for x in dir(minstance) if x not in default_attributes]
msg201578 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-10-28 20:47
""" (4) str in __all__ (list of str): os.py: if "putenv" not in __all__: __all__.append("putenv") """ For this example: "putenv" is probably interned by "def putenv(...)". "putenv" string and the name of the function are the same constant. When a function is defined, it is stored in the locals dictionary, so the key ("putenv") is interned by PyDict_SetItem(). So dict is not used explicitly, but implicitly by the namespace.
msg201871 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-01 01:57
Updated patch.
msg202109 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-11-04 10:11
New changeset 536a7c09c7fd by Victor Stinner in branch 'default': Issue #16286: write a new subfunction bytes_compare_eq() http://hg.python.org/cpython/rev/536a7c09c7fd
msg202111 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-11-04 10:24
New changeset 5fa291435740 by Victor Stinner in branch 'default': Issue #16286: optimize PyUnicode_RichCompare() for identical strings (same http://hg.python.org/cpython/rev/5fa291435740
msg202114 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-11-04 10:29
New changeset da9c6e4ef301 by Victor Stinner in branch 'default': Issue #16286: remove duplicated identity check from unicode_compare() http://hg.python.org/cpython/rev/da9c6e4ef301
msg202115 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-04 10:30
I applied changes unrelated to the hash.
msg202116 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-04 10:56
Results of benchmarks using compare_hash-3.patch: $ time ../benchmarks/perf.py -r -b default ./pythonorig ./pythonhash INFO:root:Skipping benchmark slowspitfire; not compatible with Python 3.4 INFO:root:Skipping benchmark slowpickle; not compatible with Python 3.4 INFO:root:Skipping benchmark slowunpickle; not compatible with Python 3.4 INFO:root:Skipping benchmark spambayes; not compatible with Python 3.4 Running 2to3... Running django_v2... Report on Linux smithers 3.9.4-200.fc18.x86_64 #1 SMP Fri May 24 20:10:49 UTC 2013 x86_64 x86_64 Total CPU cores: 8 ### 2to3 ### Min: 6.358000 -> 6.055000: 1.05x faster Avg: 6.407600 -> 6.179800: 1.04x faster Significant (t=3.53) Stddev: 0.04311 -> 0.13785: 3.1979x larger ### nbody ### Min: 0.219029 -> 0.212477: 1.03x faster Avg: 0.224940 -> 0.219248: 1.03x faster Significant (t=10.13) Stddev: 0.00373 -> 0.00420: 1.1288x larger The following not significant results are hidden, use -v to show them: django_v2. At least, Python is not slower with the patch :-) I'm surprised that the benchmark shows a difference. nbody benchmark is focused on float numbers. I checked with gdb, nbody benchmark does not call any Unicode comparison function.
msg202322 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-07 10:17
I added recently a new _PyUnicode_CompareWithId() function: changeset 77bebcf5c4cf (issue #19512). This function can be used instead of PyUnicode_CompareWithASCIIString() when the right parameter is a common string. It is interesting when the right string is probably present in a dictionary. For example, "path" is always present as "sys.path". So interning the string doesn't eat more memory. _PyUnicode_CompareWithId() would be more efficient with compare_hash-3.patch. The function is not used yet in critical path. It is now used in type_new() for example.
msg202323 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-07 10:30
Serhiy, Gregory, Raymond, Antoine: so what is your feeling on this issue? Is it worth it?
msg202702 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-12 16:10
I ran pybench with the patch. I don't understand this result (10% slower with the patch): DictWithStringKeys: 28ms 25ms +10.7% 28ms 26ms +10.5% This test doesn't use unicode_compare_eq() from Objects/unicodeobject.c but unicode_eq() from Objects/stringlib/eq.h which is not modified by the patch.
msg202703 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-12 16:13
(oops, I didn't want to close the issue, it's a mistake)
msg206387 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-16 23:10
If it's hard to see a real speedup, it's probably not interesting to use the hash in string comparison.
msg216654 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-04-17 04:58
> Serhiy, Gregory, Raymond, Antoine: so what is your feeling on > this issue? Is it worth it? I don't think it is worth it. There may be some cases that benefit, but it adds extra branching code to the common cases (sets and dicts) that already have the identity check.

History
Date	User	Action	Args
2022-04-11 14:57:37	admin	set	github: 60490
2014-04-17 04:58:58	rhettinger	set	messages: + msg216654
2013-12-16 23:10:58	vstinner	set	status: open -> closed resolution: not a bug messages: + msg206387
2013-11-12 16:13:05	vstinner	set	status: closed -> open resolution: not a bug -> (no value) messages: + msg202703
2013-11-12 16:10:48	vstinner	set	status: open -> closed resolution: not a bug messages: + msg202702
2013-11-07 10:30:08	vstinner	set	messages: + msg202323
2013-11-07 10:17:18	vstinner	set	messages: + msg202322
2013-11-04 10:56:50	vstinner	set	messages: + msg202116
2013-11-04 10:30:24	vstinner	set	files: + compare_hash-3.patch messages: + msg202115
2013-11-04 10:29:10	python-dev	set	messages: + msg202114
2013-11-04 10:24:52	python-dev	set	messages: + msg202111
2013-11-04 10:11:11	python-dev	set	nosy: + python-dev messages: + msg202109
2013-11-01 01:57:36	vstinner	set	files: + compare_hash-2.patch messages: + msg201871
2013-10-28 20:47:20	vstinner	set	messages: + msg201578
2013-10-28 20:18:37	vstinner	set	messages: + msg201573
2013-10-28 19:17:49	vstinner	set	messages: + msg201571
2013-10-28 19:06:52	pitrou	set	nosy: + pitrou messages: + msg201570
2013-04-09 22:17:40	vstinner	set	messages: + msg186460
2013-04-07 23:12:20	vstinner	set	title: Optimize a==b and a!=b for bytes and str -> Use hash if available to optimize a==b and a!=b for bytes and str
2013-01-03 02:24:49	vstinner	set	messages: + msg178903
2012-12-29 04:51:41	meador.inge	set	nosy: + meador.inge
2012-10-25 17:29:43	ezio.melotti	set	nosy: + ezio.melotti stage: patch review
2012-10-23 10:11:15	vstinner	set	messages: + msg173588
2012-10-23 01:56:04	rhettinger	set	nosy: + rhettinger messages: + msg173576
2012-10-20 06:05:20	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg173371
2012-10-19 17:02:54	christian.heimes	set	nosy: + christian.heimes
2012-10-19 13:43:56	djc	set	nosy: + djc
2012-10-19 13:13:08	serhiy.storchaka	set	messages: + msg173338
2012-10-19 12:36:06	vstinner	create