Issue 23119: Remove unicode specialization from set objects

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67308

classification

Title:	Remove unicode specialization from set objects
Type:	performance	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	ezio.melotti, lemburg, pitrou, python-dev, rhettinger, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2014-12-27 10:50 by rhettinger, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
one_lookkey.diff	rhettinger, 2014-12-27 10:50	Only use one lookkey() function	review
no_special_hash.diff	rhettinger, 2014-12-27 10:51	Remove check for precomputed unicode hashes
time_suite.sh	rhettinger, 2014-12-27 10:51	Way too simplistic timing suite
move_unicode_inside.diff	rhettinger, 2014-12-29 08:15	Replace global lookkey indirection with a pair of inlined type checks	review
measure_build_set.py	rhettinger, 2014-12-30 03:07	Time three patches for building sets of various sizes
build_set_timings.txt	rhettinger, 2014-12-30 03:08	Results of set construction timings on GCC-4.9 and CLang
inline_unicode_specialization.diff	rhettinger, 2015-01-25 06:05	Inline the unicode specialization	review

Messages (16)
msg233128 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-12-27 10:50
This tracker item is to record experiments with removing unicode specialization code from set objects and run timings to determine the performance benefits or losses from those specializations. * Removes the set_lookkey_unicode() function and the attendant so->lookup indirections. That saves 60 lines of code. On each lookup, it saves one indirection for the lookup dispatch, but in the case of unicode only tables, it costs an additional indirection through the abstract API for PyObject_RichCompareBool. * Removes the specialization code in add, discard, and contains functions to check for a unicode key with an already computed hash value. This saves a type check (cheap), a hash field check, and a nine lines of code. In the cast where the hash value would have already been computed, it costs a call to PyObject_Hash (which has an indirection, but otherwise does the same field test that we are doing). The working hypothesis is that this specialization code saves only a little in cases where it applies and adds a little to all the cases where it does not apply. (Note, the use cases for sets are less likely than dicts to be looking up strings whose hash value has already been computed.) ---------------------- Here are some initial timings for the first patch. It seems to show that intersection benefits slightly and that set creation time is unaffected. $ ./time_suite.sh 100000 loops, best of 3: 14.9 usec per loop 100000 loops, best of 3: 15.3 usec per loop 1000000 loops, best of 3: 1.17 usec per loop 1000000 loops, best of 3: 1.13 usec per loop 10000 loops, best of 3: 24.9 usec per loop 10000 loops, best of 3: 24.2 usec per loop $ ./time_suite.sh 100000 loops, best of 3: 14.7 usec per loop 100000 loops, best of 3: 14.6 usec per loop 1000000 loops, best of 3: 1.16 usec per loop 1000000 loops, best of 3: 1.07 usec per loop 10000 loops, best of 3: 23.1 usec per loop 10000 loops, best of 3: 23.4 usec per loop $ ./time_suite.sh 100000 loops, best of 3: 14.5 usec per loop 100000 loops, best of 3: 14.5 usec per loop 1000000 loops, best of 3: 1.16 usec per loop 1000000 loops, best of 3: 1.17 usec per loop 10000 loops, best of 3: 22.5 usec per loop 10000 loops, best of 3: 22 usec per loop
msg233132 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-12-27 13:59
+1 for this. In my experience sets of unicode keys are not as common as dicts with unicode keys, and the posted numbers make the simplification a no-brainer.
msg233164 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-12-29 08:15
Attaching an alternative patch that handles the unicode specific case with far less code and less overhead. It seems to speed-up all the timings I've tried. It keeps the unicode_eq() specific path which bypasses several unneeded steps: * an incref/decref of the startkey * an incref/decref of Py_True or Py_False * Py_Enter/LeaveRecursive call overhead * A NotImplemented check * Various indirections, null checks, xor bool result, and several nested function calls that save and restore registers * Test for an error result * Test for changed table or entry usual incref/decref, richbool, richcmp, tablesize_change, errorcheck, recursion depth, check for NotImplemented, conv to PyTrue/False, incr/dec Py_True/Py_False, conv back to int }
msg233181 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-12-29 18:49
Timings for no_special_hash.diff: $ ~/cpython/python.exe -m timeit -r7 -s 's={"html"}' '"html" in s' 10000000 loops, best of 7: 0.0315 usec per loop $ ~/nounicode/python.exe -m timeit -r7 -s 's={"html"}' '"html" in s' 10000000 loops, best of 7: 0.0336 usec per loop $ ~/cpython/python.exe -m timeit -r7 -s 's={"html"}' '"html" in s' 10000000 loops, best of 7: 0.0316 usec per loop $ ~/nounicode/python.exe -m timeit -r7 -s 's={"html"}' '"html" in s' 10000000 loops, best of 7: 0.0336 usec per loop $ ~/cpython/python.exe -m timeit -r7 -s 's={1}' '1 in s' 10000000 loops, best of 7: 0.0331 usec per loop $ ~/nounicode/python.exe -m timeit -r7 -s 's={1}' '1 in s' 10000000 loops, best of 7: 0.0328 usec per loop $ ~/cpython/python.exe -m timeit -r7 -s 's={1}' '1 in s' 10000000 loops, best of 7: 0.0332 usec per loop $ ~/nounicode/python.exe -m timeit -r7 -s 's={1}' '1 in s' 10000000 loops, best of 7: 0.0328 usec per loop The unicode specialization in set_add, set_discard, and set_contains benefits unicode lookups by 7% but impairs other lookups by 1%.
msg233658 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-01-08 14:01
+1 for removing unicode specialization. Dictionaries with string keys is a part of the language, but sets of strings are not special.
msg233660 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-01-08 14:35
I'm not sure I follow: Sets of strings are very common when trying to create a unique set of strings or optimizing "name in set_of_names" lookups. Regarding your benchmark numbers: I have a hard time following how they work. A simply "word in set_of_one_word" certainly doesn't make a good benchmark for these sort of tests :-) You should at least test with sets of various sizes and include the non-matching checks as well.
msg233664 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-01-08 14:46
> Sets of strings are very common when trying to create a unique set of strings or optimizing "name in set_of_names" lookups. This is not nearly so common as attributes or globals access, or passing keyword arguments.
msg233667 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-01-08 15:02
On 08.01.2015 15:46, Serhiy Storchaka wrote: > >> Sets of strings are very common when trying to create a unique set of strings or optimizing "name in set_of_names" lookups. > > This is not nearly so common as attributes or globals access, or passing keyword arguments. Not at the interpreter level, but unlike dicts, sets are not used much to implement interpreter logic. Their use cases are more based in application code where you basically find two major use cases: 1. check for flag in set of numeric codes 2. check for string in unique set of strings (and their associated set operations, i.e. checking for subsets, superset, etc.) This is why I don't follow Raymond's reasoning. Of course, if he comes up with a patch that makes both cases faster, I'd be +1 ;-)
msg233669 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-01-08 16:03
@Raymond: Please disable git format for patches, because Rietveld doesn't support such patch and so we don't get the "review" button.
msg233670 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2015-01-08 16:13
Without changesets information (not included in the git format) rietveld will try to apply the patch on default and if it applies clearly it will work, so creating the patch against an up to date py3 clone should work even with the git format.
msg233730 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-09 08:33
I'm withdrawing this one. After more work trying many timings on multiple compilers and various sizes and kinds of datasets, it appears that the unicode specialization is still worth it. The cost of the lookup indirection appears to be completely insignificant (i.e. doesn't harm the non-unicode case) while the benefits of the unicode specialized lookup does have measurable benefits in the use case of deduping an iterable of strings.
msg233739 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-01-09 09:04
On 09.01.2015 09:33, Raymond Hettinger wrote: > > I'm withdrawing this one. After more work trying many timings on multiple compilers and various sizes and kinds of datasets, it appears that the unicode specialization is still worth it. > > The cost of the lookup indirection appears to be completely insignificant (i.e. doesn't harm the non-unicode case) while the benefits of the unicode specialized lookup does have measurable benefits in the use case of deduping an iterable of strings. Thanks, Raymond, for the additional testing :-) I did a grep over the Python C source code and it seems that sets are only used by Python/symtable.c for anything mildly performance relevant (which IIRC is used by the byte code compiler) - and those sets have Unicode strings as members. The stdlib uses sets with both Unicode strings and integers as members. From looking at the grep hits, it seems that Unicode strings are more commonly used than integers in the stdlib as set members, e.g. for method names, module names and character sets.
msg234655 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-25 06:05
Thanks Mark. I appreciate the effort looking at use cases. I'm reopening this one because I have an alternative patch that simplifies the code but keeps the unicode specialization. It replaces the lookkey indirection with a fast and predictable inline test for PyUnicode_CheckExact. The timings show no measureable difference, but the code is simpler and the setobject size is reduced by one pointer. This simplify future maintenance and development as well as making the code more pleasant.
msg234658 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-01-25 07:55
May be add the unicode specialization right in PyObject_RichCompareBool?
msg234708 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-25 23:49
> May be add the unicode specialization right in PyObject_RichCompareBool? That's a possibility but it has much broader ramifications and might or might not be the right thing to do. I'll leave that for someone else to pursue and keep my sights on sets for while. (If you do modify PyObject_RichCompareBool, take a look moving the surrounding incref/decref pair as well).
msg234709 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-01-26 00:13
New changeset 20f54cdf351d by Raymond Hettinger in branch 'default': Issue #23119: Simplify setobject by inlining the special case for unicode equality testing. https://hg.python.org/cpython/rev/20f54cdf351d

History
Date	User	Action	Args
2022-04-11 14:58:11	admin	set	github: 67308
2015-01-26 00:13:20	rhettinger	set	status: open -> closed resolution: fixed
2015-01-26 00:13:07	python-dev	set	nosy: + python-dev messages: + msg234709
2015-01-25 23:49:19	rhettinger	set	messages: + msg234708
2015-01-25 07:55:25	serhiy.storchaka	set	messages: + msg234658
2015-01-25 06:05:07	rhettinger	set	status: closed -> open files: + inline_unicode_specialization.diff messages: + msg234655 resolution: rejected -> (no value) stage: patch review
2015-01-09 09:04:45	lemburg	set	messages: + msg233739
2015-01-09 08:33:20	rhettinger	set	status: open -> closed resolution: rejected messages: + msg233730
2015-01-08 16:13:37	ezio.melotti	set	messages: + msg233670
2015-01-08 16:03:27	vstinner	set	nosy: + vstinner messages: + msg233669
2015-01-08 15:10:35	ezio.melotti	set	nosy: + ezio.melotti
2015-01-08 15:02:07	lemburg	set	messages: + msg233667
2015-01-08 14:46:29	serhiy.storchaka	set	messages: + msg233664
2015-01-08 14:35:43	lemburg	set	nosy: + lemburg messages: + msg233660
2015-01-08 14:01:08	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg233658
2014-12-30 03:08:13	rhettinger	set	files: + build_set_timings.txt
2014-12-30 03:07:23	rhettinger	set	files: + measure_build_set.py
2014-12-29 18:49:51	rhettinger	set	messages: + msg233181
2014-12-29 08:15:41	rhettinger	set	files: + move_unicode_inside.diff messages: + msg233164
2014-12-27 13:59:59	pitrou	set	nosy: + pitrou messages: + msg233132
2014-12-27 10:51:56	rhettinger	set	files: + time_suite.sh
2014-12-27 10:51:10	rhettinger	set	files: + no_special_hash.diff
2014-12-27 10:50:11	rhettinger	create