Issue 32940: IDLE: pyparse - simplify StringTranslatePseudoMapping

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77121

classification

Title:	IDLE: pyparse - simplify StringTranslatePseudoMapping
Type:	enhancement	Stage:	patch review
Components:	IDLE	Versions:	Python 3.8, Python 3.7, Python 3.6

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	terry.reedy	Nosy List:	cheryl.sabella, miss-islington, serhiy.storchaka, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2018-02-24 17:38 by cheryl.sabella, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 5854	closed	cheryl.sabella, 2018-02-24 17:44
PR 5862	merged	cheryl.sabella, 2018-02-24 23:26
PR 5940	merged	miss-islington, 2018-02-28 22:24
PR 5941	merged	miss-islington, 2018-02-28 22:25

Messages (15)
msg312734 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-24 17:38
Based on timing test on msg312733, StringTranslatePseudoMappingTest and a defaultdict have about the same performance, so this replaces the custom class with the stdlib functionality.
msg312752 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-24 20:07
Tal had written this on the original issue21765: ---- Finally, since the defaultdict is kept around as long as IDLE is running, I decided to avoid having it grow continually and consume memory unnecessarily. So I wrote a simple Mapping class, which wraps a normal dict and uses a custom default value instead of None, ord('x') in this case. Works like a charm :) ----- So maybe I misunderstood and this shouldn't be a defaultdict?
msg312764 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-24 22:53
I forget about this defaultdict behavior: "this value is inserted in the dictionary for the key, and returned." Reason: when default_factory returns a mutable, d[key] must return the same possibly mutated object with each call. I agree that defaultdict is not the right replacement. We need to pass to str.translate a dict that can be used by subscripting, newchar = d[char]. So partial(non-defaults.get, default_value) will not work. Instead, we need a __getitem__ that returns the same. In msg312444 I suggested simplifying STPM (including the name) because it has unneeded complexity. Remove the buggy .get override. Combine the _get stuff in __init__ (also removed) with current __getitem__ and simplify and we get what we actually need (untested, at yet). def __getitem__ return self._non_defaults.get(self._default_value) Actually, we could hard-code the default value as 'X' as we never need anything else. How about ParseMap for the name?
msg312768 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-25 00:01
New PR submitted. Really glad I worked on this today. I learned a lot of things that were new to me. :-)
msg312770 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-25 02:56
I simplified ParseMap to a dict subclass with one override -- __getitem__ and then the tests. They run faster. I suspect translate is faster.
msg312778 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-02-25 07:35
For efficiency I suggest to initialize the mapping with dict.fromkeys(range(128), 'x') rather of an empty dict. It is also possible to use regular expressions: _trans = re.compile(r'''[^(){}\[]"'\\\n#]+''') code = _trans.sub('x', code) code = code.replace('{', '(') code = code.replace('}', ')') code = code.replace('[', '(') code = code.replace(']', '(') code = code.replace('\nx', '\n') I didn't check what way is more efficient.
msg312796 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-25 10:18
To me, it is plausible but not slam-dunk obvious that preloading ascii to 'x' mappings will make ascii lookup faster. On #21765, where the pyparse special translation was a side-issue, Tal Einat claimed that the unpublished regex he tried was 100x slower.
msg312799 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-25 10:33
A similar regular expression version was mentioned on issue21765 and I had run some tests on it yesterday to verify. On my system, it ran at a factor of 10x slower, so if the translate finished in 0.003, the regex took 0.03. This was consistent for me, regardless of how big I made the document. The reason for not using a defauldict was to keep the 'x' mappings out of the dictionary so that it wouldn't grow and take up space. Although, I did realize yesterday that it wasn't really boundless because most values in source code would be ASCII. Running both the version the doesn't add the 'x' mappings and the `fromkeys`, there doesn't seem to be much of a difference in time when processing the doc.
msg313046 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-28 04:57
I settled on the following to compare ParseMap implementations. from idlelib.pyparse import Parser import timeit class ParseGet(dict): def __getitem__(self, key): return self.get(key, ord('x')) class ParseMis(dict): def __missing__(self, key): return ord('x') for P in (ParseGet, ParseMis): print(P.__name__, 'hit', 'miss') p = p=P({i:i for i in (10, 34, 35, 39, 40, 41, 91, 92, 93, 123, 125)}) print(timeit.timeit( "p[10],p[34],p[35],p[39],p[40],p[41],p[91],p[92],p[93],p[125]", number=100000, globals = globals())) print(timeit.timeit( "p[11],p[33],p[36],p[45],p[50],p[61],p[71],p[82],p[99],p[125]", number=100000, globals = globals())) ParseGet hit miss 1.104342376 1.112531999 ParseMis hit miss 0.3530207070000002 1.2165967760000003 ParseGet hit miss 1.185322191 1.1915449519999999 ParseMis hit miss 0.3477272720000002 1.317010653 Avoiding custom code for all ascii chars will be a win. I am sure that calling __missing__ for non-ascii will be at least as fast as it is presently. I will commit a revision tomorrow. I may then compare to Serhiy's sub/replace suggestion. My experiments with 'code.translate(tran)' indicate that time grows sub-linearly up to 1000 or 10000 chars. This suggests that there are significant constant or log-like terms.
msg313048 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-02-28 05:47
Don't use ord('x'). Use just 'x' or b'x'[0].
msg313068 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-28 20:23
Replacing an expression with a less clear equivalent expression makes no sense to me. Anyway, having __missing__ return 120 reduces the benchmark miss time from 1.2-1.3 to .93, making ParseMis always faster than ParseGet and reducing the penalty for non-ascii chars. re.sub + str.replace is slower than translate import re import timeit class ParseMap(dict): def __missing__(self, key): return 120 # ord('x') trans = ParseMap((i,120) for i in range(128)) trans.update((ord(c), ord('(')) for c in "({[") trans.update((ord(c), ord(')')) for c in ")}]") trans.update((ord(c), ord(c)) for c in "\"'\\\n#") trans_re = re.compile(r'''[^(){}\[]"'\\\n#]+''') code='\t a([{b}])b"c\'d\n'*1000 # n = 1, 10, 100, 1000 print(timeit.timeit( 'code.translate(trans)', number=10000, globals = globals())) print(timeit.timeit( "code1 = trans_re.sub('x', code)\n" "code2 = code1.replace('{', '(')\n" "code3 = code2.replace('}', ')')\n" "code4 = code3.replace('[', '(')\n" "code5 = code4.replace(']', '(')\n" r"code6 = code5.replace('\nx', '\n')", number=10000, globals = globals())) n trans re 1 .06 .09 10 .08 .17 100 .28 1.00 1000 2.2 8.9 Multiply by 100 to get microseconds or seconds for 1000000.
msg313071 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-28 21:49
The mapping passed to str.translate must map ints representing codepoints to either either ints or strings. Translate can extract binary codepoints for the new string from either. Ints are slightly faster, so I am inclined not to switch. import timeit class ParseMapN(dict): def __missing__(self, key): return 120 class ParseMapS(dict): def __missing__(self, key): return 'x' trans1 = ParseMapN.fromkeys(range(128), 120) trans1.update((ord(c), ord('(')) for c in "({[") trans1.update((ord(c), ord(')')) for c in ")}]") trans1.update((ord(c), ord(c)) for c in "\"'\\\n#") trans2 = ParseMapN.fromkeys(range(128), 'x') trans2.update((ord(c), '(') for c in "({[") trans2.update((ord(c), ')') for c in ")}]") trans2.update((ord(c), c) for c in "\"'\\\n#") for i in (1, 10, 100, 1000): code = '\t a([{b}])b"c\'d\n' * i print('N ', i) print(timeit.repeat( 'code.translate(trans1)', number=10000, globals = globals())) print('S ', i) print(timeit.repeat( 'code.translate(trans2)', number=10000, globals = globals())) N 1 [0.056562504, 0.056747570, 0.05654651, 0.056460749, 0.056428776] S 1 [0.060795346, 0.062304155, 0.061043432, 0.062349345, 0.061191301] N 10 [0.076474600, 0.076463227, 0.076560984, 0.076581582, 0.076010091] S 10 [0.080739106, 0.080798745, 0.08034192, 0.080987501, 0.080617369] N 100 [0.28529922, 0.28383868, 0.283949046, 0.284461512, 0.284291203] S 100 [0.289629521, 0.288535418, 0.289154560, 0.28811548, 0.28862180] N1000 [2.23882157, 2.2383192, 2.2384120, 2.2377972, 2.23854982] S1000 [2.24242237, 2.2426845, 2.2424623, 2.2420030, 2.24254871] The pattern of all S repeats being greater than all corresponding N repeats was true for 2 other runs.
msg313072 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-28 22:24
New changeset f0daa880a405c8de6743e44fa46006754aa145c9 by Terry Jan Reedy (Cheryl Sabella) in branch 'master': bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862) https://github.com/python/cpython/commit/f0daa880a405c8de6743e44fa46006754aa145c9
msg313073 - (view)	Author: miss-islington (miss-islington)	Date: 2018-02-28 23:08
New changeset 7e5763469e2fc9d08a3f6b6205f87f20a1bdd465 by Miss Islington (bot) in branch '3.7': bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862) https://github.com/python/cpython/commit/7e5763469e2fc9d08a3f6b6205f87f20a1bdd465
msg313074 - (view)	Author: miss-islington (miss-islington)	Date: 2018-02-28 23:12
New changeset 32f5392f64f004382e26a988b1145d2dc96c4978 by Miss Islington (bot) in branch '3.6': bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862) https://github.com/python/cpython/commit/32f5392f64f004382e26a988b1145d2dc96c4978

History
Date	User	Action	Args
2022-04-11 14:58:58	admin	set	github: 77121
2018-02-28 23:12:17	miss-islington	set	messages: + msg313074
2018-02-28 23:08:24	miss-islington	set	nosy: + miss-islington messages: + msg313073
2018-02-28 22:25:13	miss-islington	set	pull_requests: + pull_request5711
2018-02-28 22:24:12	miss-islington	set	stage: needs patch -> patch review pull_requests: + pull_request5710
2018-02-28 22:24:00	terry.reedy	set	messages: + msg313072
2018-02-28 21:49:18	terry.reedy	set	messages: + msg313071
2018-02-28 20:23:43	terry.reedy	set	messages: + msg313068
2018-02-28 05:47:01	serhiy.storchaka	set	messages: + msg313048
2018-02-28 04:57:35	terry.reedy	set	messages: + msg313046
2018-02-25 10:33:20	cheryl.sabella	set	messages: + msg312799
2018-02-25 10:18:39	terry.reedy	set	messages: + msg312796
2018-02-25 07:35:04	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg312778
2018-02-25 02:56:09	terry.reedy	set	messages: + msg312770 stage: patch review -> needs patch
2018-02-25 00:01:24	cheryl.sabella	set	messages: + msg312768
2018-02-24 23:26:59	cheryl.sabella	set	stage: needs patch -> patch review pull_requests: + pull_request5637
2018-02-24 22:53:33	terry.reedy	set	title: IDLE: pyparse - replace StringTranslatePseudoMapping with defaultdict -> IDLE: pyparse - simplify StringTranslatePseudoMapping messages: + msg312764 stage: patch review -> needs patch
2018-02-24 20:07:12	cheryl.sabella	set	messages: + msg312752
2018-02-24 18:29:08	cheryl.sabella	set	title: IDLE: pyparse - replace StringTranslatePseudoMappingTest with defaultdict -> IDLE: pyparse - replace StringTranslatePseudoMapping with defaultdict
2018-02-24 17:45:13	cheryl.sabella	link	issue32880 dependencies
2018-02-24 17:44:26	cheryl.sabella	set	keywords: + patch stage: patch review pull_requests: + pull_request5629
2018-02-24 17:38:40	cheryl.sabella	create