Issue 28832: Reduce memset in dict creation

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73018

classification

Title:	Reduce memset in dict creation
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	methane	Nosy List:	methane, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2016-11-29 12:56 by methane, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
reduce-memset.patch	methane, 2016-11-29 12:56		review
default.json.gz	methane, 2016-11-30 09:34
patched.json.gz	methane, 2016-11-30 09:37

Messages (7)
msg281989 - (view)	Author: Inada Naoki (methane) *	Date: 2016-11-29 12:56
This patch delays initialization of dk_entries. Entries are initialized when first use (when dk_netries is incremented). Minimum dk_entries of 64bit arch is 5 * 8 * 3 = 120bytes. So it can save 4 cache lines! I'm running benchmark for now.
msg282005 - (view)	Author: Inada Naoki (methane) *	Date: 2016-11-29 16:09
$ ./python-default -m perf compare_to default.json patched.json -G Slower (2): - xml_etree_iterparse: 328 ms +- 26 ms -> 350 ms +- 29 ms: 1.07x slower - fannkuch: 1.58 sec +- 0.09 sec -> 1.65 sec +- 0.09 sec: 1.05x slower Faster (9): - scimark_sor: 870 ms +- 59 ms -> 800 ms +- 54 ms: 1.09x faster - deltablue: 28.0 ms +- 2.1 ms -> 26.4 ms +- 1.6 ms: 1.06x faster - regex_dna: 423 ms +- 26 ms -> 399 ms +- 26 ms: 1.06x faster - scimark_sparse_mat_mult: 13.9 ms +- 1.0 ms -> 13.2 ms +- 0.9 ms: 1.06x faster - raytrace: 2.16 sec +- 0.12 sec -> 2.07 sec +- 0.11 sec: 1.05x faster - unpickle_pure_python: 1.36 ms +- 0.11 ms -> 1.31 ms +- 0.07 ms: 1.04x faster - pickle_dict: 103 us +- 11 us -> 99.7 us +- 7.7 us: 1.03x faster - scimark_monte_carlo: 408 ms +- 35 ms -> 397 ms +- 25 ms: 1.03x faster - scimark_lu: 765 ms +- 49 ms -> 746 ms +- 55 ms: 1.03x faster Benchmark hidden because not significant (53): 2to3, call_method, call_method_slots, call_method_unknown, call_simple, chameleon, chaos, crypto_pyaes, django_template, dulwich_log, float, genshi_text, gen shi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_list, pickle_pure_python, pidigits, pyt hon_startup, python_startup_no_site, regex_compile, regex_effbot, regex_v8, richards, scimark_fft, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate , sympy_str, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, xml_etree_generate, xml_etree_parse, xml_etree_process
msg282010 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-11-29 16:31
Can you please attached the two JSON files, compressed with gzip (.gz files). perf is also to read gzipped JSON.
msg282014 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-29 16:44
I think that clearing 120 bytes at a time is faster than clear it later entry-by-entry. Your patch removes some asserts, this looks not good. Could your provide microbenchmarks that show the largest speed up and the largest slow down? So we would see what type of code gets the benefit.
msg282018 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-11-29 17:19
You might experiment Python_Calloc(). I'm not sure that this allocator is well optimized (especially the pymalloc allocator) :-/ Last time I played with it, it was slower, especially for allocations smaller than 1 MB.
msg282068 - (view)	Author: Inada Naoki (methane) *	Date: 2016-11-30 09:37
I ran pyperformance again on more large machine (Azure D2 -> D4) for more robust result. $ ./python-default -m perf compare_to default.json.gz patched.json.gz -G Slower (4): - xml_etree_generate: 425 ms +- 18 ms -> 442 ms +- 19 ms: 1.04x slower - call_method: 25.7 ms +- 0.9 ms -> 26.1 ms +- 1.3 ms: 1.02x slower - xml_etree_iterparse: 345 ms +- 17 ms -> 349 ms +- 20 ms: 1.01x slower - chameleon: 44.9 ms +- 1.9 ms -> 45.5 ms +- 2.1 ms: 1.01x slower Faster (39): - scimark_sor: 874 ms +- 32 ms -> 811 ms +- 33 ms: 1.08x faster - unpickle_pure_python: 1.41 ms +- 0.07 ms -> 1.31 ms +- 0.08 ms: 1.08x faster - raytrace: 2.19 sec +- 0.07 sec -> 2.05 sec +- 0.04 sec: 1.07x faster - sympy_integrate: 78.3 ms +- 9.2 ms -> 73.1 ms +- 4.8 ms: 1.07x faster - pickle_dict: 105 us +- 11 us -> 98.5 us +- 4.4 us: 1.06x faster - html5lib: 385 ms +- 21 ms -> 362 ms +- 16 ms: 1.06x faster - pathlib: 80.5 ms +- 5.1 ms -> 76.1 ms +- 3.4 ms: 1.06x faster - pickle: 38.1 us +- 2.7 us -> 36.0 us +- 2.9 us: 1.06x faster - sqlite_synth: 15.7 us +- 0.8 us -> 14.8 us +- 0.9 us: 1.06x faster - meteor_contest: 318 ms +- 18 ms -> 301 ms +- 12 ms: 1.06x faster - hexiom: 37.4 ms +- 1.9 ms -> 35.5 ms +- 1.8 ms: 1.05x faster - telco: 34.6 ms +- 2.1 ms -> 32.9 ms +- 1.4 ms: 1.05x faster - go: 957 ms +- 35 ms -> 914 ms +- 36 ms: 1.05x faster - regex_compile: 702 ms +- 24 ms -> 671 ms +- 23 ms: 1.05x faster - regex_dna: 421 ms +- 14 ms -> 403 ms +- 15 ms: 1.04x faster - genshi_xml: 314 ms +- 9 ms -> 301 ms +- 13 ms: 1.04x faster - tornado_http: 664 ms +- 29 ms -> 636 ms +- 27 ms: 1.04x faster - sqlalchemy_imperative: 103 ms +- 9 ms -> 98.6 ms +- 7.8 ms: 1.04x faster - dulwich_log: 265 ms +- 16 ms -> 254 ms +- 10 ms: 1.04x faster - richards: 291 ms +- 15 ms -> 279 ms +- 14 ms: 1.04x faster - json_dumps: 45.2 ms +- 2.5 ms -> 43.5 ms +- 1.8 ms: 1.04x faster - scimark_lu: 768 ms +- 40 ms -> 738 ms +- 32 ms: 1.04x faster - mako: 65.9 ms +- 3.9 ms -> 63.4 ms +- 2.1 ms: 1.04x faster - logging_simple: 52.9 us +- 3.5 us -> 51.0 us +- 2.0 us: 1.04x faster - regex_effbot: 7.64 ms +- 0.48 ms -> 7.36 ms +- 0.43 ms: 1.04x faster - pickle_pure_python: 2.08 ms +- 0.13 ms -> 2.01 ms +- 0.09 ms: 1.04x faster - scimark_monte_carlo: 414 ms +- 26 ms -> 399 ms +- 20 ms: 1.04x faster - unpack_sequence: 196 ns +- 11 ns -> 189 ns +- 6 ns: 1.04x faster - logging_silent: 1.17 us +- 0.07 us -> 1.13 us +- 0.05 us: 1.04x faster - sympy_expand: 1.68 sec +- 0.06 sec -> 1.62 sec +- 0.06 sec: 1.03x faster - call_simple: 20.9 ms +- 1.0 ms -> 20.3 ms +- 0.9 ms: 1.03x faster - nqueens: 411 ms +- 19 ms -> 401 ms +- 19 ms: 1.03x faster - sympy_str: 762 ms +- 29 ms -> 743 ms +- 23 ms: 1.03x faster - pidigits: 488 ms +- 18 ms -> 476 ms +- 12 ms: 1.02x faster - logging_format: 62.2 us +- 3.2 us -> 60.8 us +- 3.0 us: 1.02x faster - sympy_sum: 351 ms +- 18 ms -> 343 ms +- 16 ms: 1.02x faster - spectral_norm: 413 ms +- 21 ms -> 405 ms +- 18 ms: 1.02x faster - float: 464 ms +- 18 ms -> 456 ms +- 19 ms: 1.02x faster - python_startup_no_site: 15.2 ms +- 0.8 ms -> 15.0 ms +- 0.7 ms: 1.01x faster Benchmark hidden because not significant (21): 2to3, call_method_slots, call_method_unknown, chaos, crypto_pyaes, deltablue, django_template, fannkuch, genshi_text, json_loads, nbody, pickle_list, python_startup, regex_v8, scimark_fft, scimark_sparse_mat_mult, sqlalchemy_declarative, unpickle, unpickle_list, xml_etree_parse, xml_etree_process
msg282071 - (view)	Author: Inada Naoki (methane) *	Date: 2016-11-30 10:08
> I think that clearing 120 bytes at a time is faster than clear it later entry-by-entry. Ah, my word was wrong. This patch skips zero clear entirely. In pseudo code: // When allocating PyDictKeyObject. - memset(dk_entries, 0, sizeof(dk_entries)); // When inserting new item. n = dk_nentries++ e = &dk_entries[dk_nentries++]; e->me_hash = hash; e->me_key = key; if (split_table) { + e->me_value = NULL; ma_values[n] = value; } else { e->me_value = value; } > Your patch removes some asserts, this looks not good. This patch fills dk_entries with 0xcc when Py_DEBUG is enabled. It can find unintentional access to value which comes from reused memory. I'll search more points I can insert effective asserts. > Could your provide microbenchmarks that show the largest speed up and the largest slow down? So we would see what type of code gets the benefit. Avoiding cache pollution is more important than avoiding 120byte memset in this case. It's difficult to write simple micro benchmark to show effects of cache pollution... $ ./python-patched -m perf timeit --rigorous --compare-to `pwd`/python-default --duplicate 8 -- '{}' python-default: ......................................... 44.6 ns +- 2.4 ns python-patched: ......................................... 44.1 ns +- 1.8 ns Median +- std dev: [python-default] 44.6 ns +- 2.4 ns -> [python-patched] 44.1 ns +- 1.8 ns: 1.01x faster

History
Date	User	Action	Args
2022-04-11 14:58:40	admin	set	github: 73018
2018-04-10 13:07:31	methane	set	status: open -> closed resolution: wont fix stage: patch review -> resolved
2016-11-30 10:08:01	methane	set	messages: + msg282071
2016-11-30 09:37:37	methane	set	files: + patched.json.gz messages: + msg282068
2016-11-30 09:34:57	methane	set	files: + default.json.gz
2016-11-29 17:19:37	vstinner	set	messages: + msg282018
2016-11-29 16:44:18	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg282014
2016-11-29 16:31:36	vstinner	set	messages: + msg282010
2016-11-29 16:09:18	methane	set	nosy: + vstinner messages: + msg282005
2016-11-29 12:56:55	methane	create