This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Reduce memset in dict creation
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: methane Nosy List: methane, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2016-11-29 12:56 by methane, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
reduce-memset.patch methane, 2016-11-29 12:56 review
default.json.gz methane, 2016-11-30 09:34
patched.json.gz methane, 2016-11-30 09:37
Messages (7)
msg281989 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-29 12:56
This patch delays initialization of dk_entries.
Entries are initialized when first use (when dk_netries is incremented).

Minimum dk_entries of 64bit arch is 5 * 8 * 3 = 120bytes.
So it can save 4 cache lines!

I'm running benchmark for now.
msg282005 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-29 16:09
$ ./python-default -m perf compare_to default.json patched.json  -G
Slower (2):
- xml_etree_iterparse: 328 ms +- 26 ms -> 350 ms +- 29 ms: 1.07x slower
- fannkuch: 1.58 sec +- 0.09 sec -> 1.65 sec +- 0.09 sec: 1.05x slower

Faster (9):
- scimark_sor: 870 ms +- 59 ms -> 800 ms +- 54 ms: 1.09x faster
- deltablue: 28.0 ms +- 2.1 ms -> 26.4 ms +- 1.6 ms: 1.06x faster
- regex_dna: 423 ms +- 26 ms -> 399 ms +- 26 ms: 1.06x faster
- scimark_sparse_mat_mult: 13.9 ms +- 1.0 ms -> 13.2 ms +- 0.9 ms: 1.06x faster
- raytrace: 2.16 sec +- 0.12 sec -> 2.07 sec +- 0.11 sec: 1.05x faster
- unpickle_pure_python: 1.36 ms +- 0.11 ms -> 1.31 ms +- 0.07 ms: 1.04x faster
- pickle_dict: 103 us +- 11 us -> 99.7 us +- 7.7 us: 1.03x faster
- scimark_monte_carlo: 408 ms +- 35 ms -> 397 ms +- 25 ms: 1.03x faster
- scimark_lu: 765 ms +- 49 ms -> 746 ms +- 55 ms: 1.03x faster

Benchmark hidden because not significant (53): 2to3, call_method, call_method_slots, call_method_unknown, call_simple, chameleon, chaos, crypto_pyaes, django_template, dulwich_log, float, genshi_text, gen
shi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_list, pickle_pure_python, pidigits, pyt
hon_startup, python_startup_no_site, regex_compile, regex_effbot, regex_v8, richards, scimark_fft, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate
, sympy_str, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, xml_etree_generate, xml_etree_parse, xml_etree_process
msg282010 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-29 16:31
Can you please attached the two JSON files, compressed with gzip (.gz files). perf is also to read gzipped JSON.
msg282014 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-29 16:44
I think that clearing 120 bytes at a time is faster than clear it later entry-by-entry.

Your patch removes some asserts, this looks not good.

Could your provide microbenchmarks that show the largest speed up and the largest slow down? So we would see what type of code gets the benefit.
msg282018 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-29 17:19
You might experiment Python_Calloc().

I'm not sure that this allocator is well optimized (especially the pymalloc allocator) :-/ Last time I played with it, it was slower, especially for allocations smaller than 1 MB.
msg282068 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-30 09:37
I ran pyperformance again on more large machine (Azure D2 -> D4) for more robust result.

$ ./python-default -m perf compare_to default.json.gz patched.json.gz -G
Slower (4):
- xml_etree_generate: 425 ms +- 18 ms -> 442 ms +- 19 ms: 1.04x slower
- call_method: 25.7 ms +- 0.9 ms -> 26.1 ms +- 1.3 ms: 1.02x slower
- xml_etree_iterparse: 345 ms +- 17 ms -> 349 ms +- 20 ms: 1.01x slower
- chameleon: 44.9 ms +- 1.9 ms -> 45.5 ms +- 2.1 ms: 1.01x slower

Faster (39):
- scimark_sor: 874 ms +- 32 ms -> 811 ms +- 33 ms: 1.08x faster
- unpickle_pure_python: 1.41 ms +- 0.07 ms -> 1.31 ms +- 0.08 ms: 1.08x faster
- raytrace: 2.19 sec +- 0.07 sec -> 2.05 sec +- 0.04 sec: 1.07x faster
- sympy_integrate: 78.3 ms +- 9.2 ms -> 73.1 ms +- 4.8 ms: 1.07x faster
- pickle_dict: 105 us +- 11 us -> 98.5 us +- 4.4 us: 1.06x faster
- html5lib: 385 ms +- 21 ms -> 362 ms +- 16 ms: 1.06x faster
- pathlib: 80.5 ms +- 5.1 ms -> 76.1 ms +- 3.4 ms: 1.06x faster
- pickle: 38.1 us +- 2.7 us -> 36.0 us +- 2.9 us: 1.06x faster
- sqlite_synth: 15.7 us +- 0.8 us -> 14.8 us +- 0.9 us: 1.06x faster
- meteor_contest: 318 ms +- 18 ms -> 301 ms +- 12 ms: 1.06x faster
- hexiom: 37.4 ms +- 1.9 ms -> 35.5 ms +- 1.8 ms: 1.05x faster
- telco: 34.6 ms +- 2.1 ms -> 32.9 ms +- 1.4 ms: 1.05x faster
- go: 957 ms +- 35 ms -> 914 ms +- 36 ms: 1.05x faster
- regex_compile: 702 ms +- 24 ms -> 671 ms +- 23 ms: 1.05x faster
- regex_dna: 421 ms +- 14 ms -> 403 ms +- 15 ms: 1.04x faster
- genshi_xml: 314 ms +- 9 ms -> 301 ms +- 13 ms: 1.04x faster
- tornado_http: 664 ms +- 29 ms -> 636 ms +- 27 ms: 1.04x faster
- sqlalchemy_imperative: 103 ms +- 9 ms -> 98.6 ms +- 7.8 ms: 1.04x faster
- dulwich_log: 265 ms +- 16 ms -> 254 ms +- 10 ms: 1.04x faster
- richards: 291 ms +- 15 ms -> 279 ms +- 14 ms: 1.04x faster
- json_dumps: 45.2 ms +- 2.5 ms -> 43.5 ms +- 1.8 ms: 1.04x faster
- scimark_lu: 768 ms +- 40 ms -> 738 ms +- 32 ms: 1.04x faster
- mako: 65.9 ms +- 3.9 ms -> 63.4 ms +- 2.1 ms: 1.04x faster
- logging_simple: 52.9 us +- 3.5 us -> 51.0 us +- 2.0 us: 1.04x faster
- regex_effbot: 7.64 ms +- 0.48 ms -> 7.36 ms +- 0.43 ms: 1.04x faster
- pickle_pure_python: 2.08 ms +- 0.13 ms -> 2.01 ms +- 0.09 ms: 1.04x faster
- scimark_monte_carlo: 414 ms +- 26 ms -> 399 ms +- 20 ms: 1.04x faster
- unpack_sequence: 196 ns +- 11 ns -> 189 ns +- 6 ns: 1.04x faster
- logging_silent: 1.17 us +- 0.07 us -> 1.13 us +- 0.05 us: 1.04x faster
- sympy_expand: 1.68 sec +- 0.06 sec -> 1.62 sec +- 0.06 sec: 1.03x faster
- call_simple: 20.9 ms +- 1.0 ms -> 20.3 ms +- 0.9 ms: 1.03x faster
- nqueens: 411 ms +- 19 ms -> 401 ms +- 19 ms: 1.03x faster
- sympy_str: 762 ms +- 29 ms -> 743 ms +- 23 ms: 1.03x faster
- pidigits: 488 ms +- 18 ms -> 476 ms +- 12 ms: 1.02x faster
- logging_format: 62.2 us +- 3.2 us -> 60.8 us +- 3.0 us: 1.02x faster
- sympy_sum: 351 ms +- 18 ms -> 343 ms +- 16 ms: 1.02x faster
- spectral_norm: 413 ms +- 21 ms -> 405 ms +- 18 ms: 1.02x faster
- float: 464 ms +- 18 ms -> 456 ms +- 19 ms: 1.02x faster
- python_startup_no_site: 15.2 ms +- 0.8 ms -> 15.0 ms +- 0.7 ms: 1.01x faster

Benchmark hidden because not significant (21): 2to3, call_method_slots, call_method_unknown, chaos, crypto_pyaes, deltablue, django_template, fannkuch, genshi_text, json_loads, nbody, pickle_list, python_startup, regex_v8, scimark_fft, scimark_sparse_mat_mult, sqlalchemy_declarative, unpickle, unpickle_list, xml_etree_parse, xml_etree_process
msg282071 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-30 10:08
> I think that clearing 120 bytes at a time is faster than clear it later entry-by-entry.

Ah, my word was wrong. This patch skips zero clear entirely.
In pseudo code:

  // When allocating PyDictKeyObject.
- memset(dk_entries, 0, sizeof(dk_entries));

  // When inserting new item.
  n = dk_nentries++
  e = &dk_entries[dk_nentries++];
  e->me_hash = hash;
  e->me_key = key;
  if (split_table) {
+     e->me_value = NULL;
      ma_values[n] = value;
  } else {
      e->me_value = value;
  }


> Your patch removes some asserts, this looks not good.

This patch fills dk_entries with 0xcc when Py_DEBUG is enabled.
It can find unintentional access to value which comes from reused memory.

I'll search more points I can insert effective asserts.


> Could your provide microbenchmarks that show the largest speed up and the largest slow down? So we would see what type of code gets the benefit.

Avoiding cache pollution is more important than avoiding 120byte memset in this case.
It's difficult to write simple micro benchmark to show effects of cache pollution...

$ ./python-patched -m perf timeit --rigorous --compare-to `pwd`/python-default --duplicate 8 -- '{}'
python-default: ......................................... 44.6 ns +- 2.4 ns
python-patched: ......................................... 44.1 ns +- 1.8 ns
Median +- std dev: [python-default] 44.6 ns +- 2.4 ns -> [python-patched] 44.1 ns +- 1.8 ns: 1.01x faster
History
Date User Action Args
2022-04-11 14:58:40adminsetgithub: 73018
2018-04-10 13:07:31methanesetstatus: open -> closed
resolution: wont fix
stage: patch review -> resolved
2016-11-30 10:08:01methanesetmessages: + msg282071
2016-11-30 09:37:37methanesetfiles: + patched.json.gz

messages: + msg282068
2016-11-30 09:34:57methanesetfiles: + default.json.gz
2016-11-29 17:19:37vstinnersetmessages: + msg282018
2016-11-29 16:44:18serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg282014
2016-11-29 16:31:36vstinnersetmessages: + msg282010
2016-11-29 16:09:18methanesetnosy: + vstinner
messages: + msg282005
2016-11-29 12:56:55methanecreate