msg312734 - (view) |
Author: Cheryl Sabella (cheryl.sabella) *  |
Date: 2018-02-24 17:38 |
Based on timing test on msg312733, StringTranslatePseudoMappingTest and a defaultdict have about the same performance, so this replaces the custom class with the stdlib functionality.
|
msg312752 - (view) |
Author: Cheryl Sabella (cheryl.sabella) *  |
Date: 2018-02-24 20:07 |
Tal had written this on the original issue21765:
----
Finally, since the defaultdict is kept around as long as IDLE is running, I decided to avoid having it grow continually and consume memory unnecessarily. So I wrote a simple Mapping class, which wraps a normal dict and uses a custom default value instead of None, ord('x') in this case. Works like a charm :)
-----
So maybe I misunderstood and this shouldn't be a defaultdict?
|
msg312764 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2018-02-24 22:53 |
I forget about this defaultdict behavior: "this value is inserted in the dictionary for the key, and returned." Reason: when default_factory returns a mutable, d[key] must return the same possibly mutated object with each call. I agree that defaultdict is not the right replacement.
We need to pass to str.translate a dict that can be used by subscripting, newchar = d[char]. So partial(non-defaults.get, default_value) will not work. Instead, we need a __getitem__ that returns the same.
In msg312444 I suggested simplifying STPM (including the name) because it has unneeded complexity. Remove the buggy .get override. Combine the _get stuff in __init__ (also removed) with current __getitem__ and simplify and we get what we actually need (untested, at yet).
def __getitem__
return self._non_defaults.get(self._default_value)
Actually, we could hard-code the default value as 'X' as we never need anything else.
How about ParseMap for the name?
|
msg312768 - (view) |
Author: Cheryl Sabella (cheryl.sabella) *  |
Date: 2018-02-25 00:01 |
New PR submitted. Really glad I worked on this today. I learned a lot of things that were new to me. :-)
|
msg312770 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2018-02-25 02:56 |
I simplified ParseMap to a dict subclass with one override -- __getitem__ and then the tests. They run faster. I suspect translate is faster.
|
msg312778 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2018-02-25 07:35 |
For efficiency I suggest to initialize the mapping with dict.fromkeys(range(128), 'x') rather of an empty dict.
It is also possible to use regular expressions:
_trans = re.compile(r'''[^(){}\[]"'\\\n#]+''')
code = _trans.sub('x', code)
code = code.replace('{', '(')
code = code.replace('}', ')')
code = code.replace('[', '(')
code = code.replace(']', '(')
code = code.replace('\nx', '\n')
I didn't check what way is more efficient.
|
msg312796 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2018-02-25 10:18 |
To me, it is plausible but not slam-dunk obvious that preloading ascii to 'x' mappings will make ascii lookup faster.
On #21765, where the pyparse special translation was a side-issue, Tal Einat claimed that the unpublished regex he tried was 100x slower.
|
msg312799 - (view) |
Author: Cheryl Sabella (cheryl.sabella) *  |
Date: 2018-02-25 10:33 |
A similar regular expression version was mentioned on issue21765 and I had run some tests on it yesterday to verify. On my system, it ran at a factor of 10x slower, so if the translate finished in 0.003, the regex took 0.03. This was consistent for me, regardless of how big I made the document.
The reason for not using a defauldict was to keep the 'x' mappings out of the dictionary so that it wouldn't grow and take up space. Although, I did realize yesterday that it wasn't really boundless because most values in source code would be ASCII. Running both the version the doesn't add the 'x' mappings and the `fromkeys`, there doesn't seem to be much of a difference in time when processing the doc.
|
msg313046 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2018-02-28 04:57 |
I settled on the following to compare ParseMap implementations.
from idlelib.pyparse import Parser
import timeit
class ParseGet(dict):
def __getitem__(self, key): return self.get(key, ord('x'))
class ParseMis(dict):
def __missing__(self, key): return ord('x')
for P in (ParseGet, ParseMis):
print(P.__name__, 'hit', 'miss')
p = p=P({i:i for i in (10, 34, 35, 39, 40, 41, 91, 92, 93, 123, 125)})
print(timeit.timeit(
"p[10],p[34],p[35],p[39],p[40],p[41],p[91],p[92],p[93],p[125]",
number=100000, globals = globals()))
print(timeit.timeit(
"p[11],p[33],p[36],p[45],p[50],p[61],p[71],p[82],p[99],p[125]",
number=100000, globals = globals()))
ParseGet hit miss
1.104342376
1.112531999
ParseMis hit miss
0.3530207070000002
1.2165967760000003
ParseGet hit miss
1.185322191
1.1915449519999999
ParseMis hit miss
0.3477272720000002
1.317010653
Avoiding custom code for all ascii chars will be a win. I am sure that calling __missing__ for non-ascii will be at least as fast as it is presently. I will commit a revision tomorrow.
I may then compare to Serhiy's sub/replace suggestion. My experiments with 'code.translate(tran)' indicate that time grows sub-linearly up to 1000 or 10000 chars. This suggests that there are significant constant or log-like terms.
|
msg313048 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2018-02-28 05:47 |
Don't use ord('x'). Use just 'x' or b'x'[0].
|
msg313068 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2018-02-28 20:23 |
Replacing an expression with a less clear equivalent expression makes no sense to me. Anyway, having __missing__ return 120 reduces the benchmark miss time from 1.2-1.3 to .93, making ParseMis always faster than ParseGet and reducing the penalty for non-ascii chars.
re.sub + str.replace is slower than translate
import re
import timeit
class ParseMap(dict):
def __missing__(self, key): return 120 # ord('x')
trans = ParseMap((i,120) for i in range(128))
trans.update((ord(c), ord('(')) for c in "({[")
trans.update((ord(c), ord(')')) for c in ")}]")
trans.update((ord(c), ord(c)) for c in "\"'\\\n#")
trans_re = re.compile(r'''[^(){}\[]"'\\\n#]+''')
code='\t a([{b}])b"c\'d\n'*1000 # n = 1, 10, 100, 1000
print(timeit.timeit(
'code.translate(trans)',
number=10000, globals = globals()))
print(timeit.timeit(
"code1 = trans_re.sub('x', code)\n"
"code2 = code1.replace('{', '(')\n"
"code3 = code2.replace('}', ')')\n"
"code4 = code3.replace('[', '(')\n"
"code5 = code4.replace(']', '(')\n"
r"code6 = code5.replace('\nx', '\n')",
number=10000, globals = globals()))
n trans re
1 .06 .09
10 .08 .17
100 .28 1.00
1000 2.2 8.9
Multiply by 100 to get microseconds or seconds for 1000000.
|
msg313071 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2018-02-28 21:49 |
The mapping passed to str.translate must map ints representing codepoints to either either ints or strings. Translate can extract binary codepoints for the new string from either. Ints are slightly faster, so I am inclined not to switch.
import timeit
class ParseMapN(dict):
def __missing__(self, key): return 120
class ParseMapS(dict):
def __missing__(self, key): return 'x'
trans1 = ParseMapN.fromkeys(range(128), 120)
trans1.update((ord(c), ord('(')) for c in "({[")
trans1.update((ord(c), ord(')')) for c in ")}]")
trans1.update((ord(c), ord(c)) for c in "\"'\\\n#")
trans2 = ParseMapN.fromkeys(range(128), 'x')
trans2.update((ord(c), '(') for c in "({[")
trans2.update((ord(c), ')') for c in ")}]")
trans2.update((ord(c), c) for c in "\"'\\\n#")
for i in (1, 10, 100, 1000):
code = '\t a([{b}])b"c\'d\n' * i
print('N ', i)
print(timeit.repeat(
'code.translate(trans1)',
number=10000, globals = globals()))
print('S ', i)
print(timeit.repeat(
'code.translate(trans2)',
number=10000, globals = globals()))
N 1 [0.056562504, 0.056747570, 0.05654651, 0.056460749, 0.056428776]
S 1 [0.060795346, 0.062304155, 0.061043432, 0.062349345, 0.061191301]
N 10 [0.076474600, 0.076463227, 0.076560984, 0.076581582, 0.076010091]
S 10 [0.080739106, 0.080798745, 0.08034192, 0.080987501, 0.080617369]
N 100 [0.28529922, 0.28383868, 0.283949046, 0.284461512, 0.284291203]
S 100 [0.289629521, 0.288535418, 0.289154560, 0.28811548, 0.28862180]
N1000 [2.23882157, 2.2383192, 2.2384120, 2.2377972, 2.23854982]
S1000 [2.24242237, 2.2426845, 2.2424623, 2.2420030, 2.24254871]
The pattern of all S repeats being greater than all corresponding N repeats was true for 2 other runs.
|
msg313072 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2018-02-28 22:24 |
New changeset f0daa880a405c8de6743e44fa46006754aa145c9 by Terry Jan Reedy (Cheryl Sabella) in branch 'master':
bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862)
https://github.com/python/cpython/commit/f0daa880a405c8de6743e44fa46006754aa145c9
|
msg313073 - (view) |
Author: miss-islington (miss-islington) |
Date: 2018-02-28 23:08 |
New changeset 7e5763469e2fc9d08a3f6b6205f87f20a1bdd465 by Miss Islington (bot) in branch '3.7':
bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862)
https://github.com/python/cpython/commit/7e5763469e2fc9d08a3f6b6205f87f20a1bdd465
|
msg313074 - (view) |
Author: miss-islington (miss-islington) |
Date: 2018-02-28 23:12 |
New changeset 32f5392f64f004382e26a988b1145d2dc96c4978 by Miss Islington (bot) in branch '3.6':
bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862)
https://github.com/python/cpython/commit/32f5392f64f004382e26a988b1145d2dc96c4978
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:58 | admin | set | github: 77121 |
2018-02-28 23:12:17 | miss-islington | set | messages:
+ msg313074 |
2018-02-28 23:08:24 | miss-islington | set | nosy:
+ miss-islington messages:
+ msg313073
|
2018-02-28 22:25:13 | miss-islington | set | pull_requests:
+ pull_request5711 |
2018-02-28 22:24:12 | miss-islington | set | stage: needs patch -> patch review pull_requests:
+ pull_request5710 |
2018-02-28 22:24:00 | terry.reedy | set | messages:
+ msg313072 |
2018-02-28 21:49:18 | terry.reedy | set | messages:
+ msg313071 |
2018-02-28 20:23:43 | terry.reedy | set | messages:
+ msg313068 |
2018-02-28 05:47:01 | serhiy.storchaka | set | messages:
+ msg313048 |
2018-02-28 04:57:35 | terry.reedy | set | messages:
+ msg313046 |
2018-02-25 10:33:20 | cheryl.sabella | set | messages:
+ msg312799 |
2018-02-25 10:18:39 | terry.reedy | set | messages:
+ msg312796 |
2018-02-25 07:35:04 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg312778
|
2018-02-25 02:56:09 | terry.reedy | set | messages:
+ msg312770 stage: patch review -> needs patch |
2018-02-25 00:01:24 | cheryl.sabella | set | messages:
+ msg312768 |
2018-02-24 23:26:59 | cheryl.sabella | set | stage: needs patch -> patch review pull_requests:
+ pull_request5637 |
2018-02-24 22:53:33 | terry.reedy | set | title: IDLE: pyparse - replace StringTranslatePseudoMapping with defaultdict -> IDLE: pyparse - simplify StringTranslatePseudoMapping messages:
+ msg312764 stage: patch review -> needs patch |
2018-02-24 20:07:12 | cheryl.sabella | set | messages:
+ msg312752 |
2018-02-24 18:29:08 | cheryl.sabella | set | title: IDLE: pyparse - replace StringTranslatePseudoMappingTest with defaultdict -> IDLE: pyparse - replace StringTranslatePseudoMapping with defaultdict |
2018-02-24 17:45:13 | cheryl.sabella | link | issue32880 dependencies |
2018-02-24 17:44:26 | cheryl.sabella | set | keywords:
+ patch stage: patch review pull_requests:
+ pull_request5629 |
2018-02-24 17:38:40 | cheryl.sabella | create | |