Title: IDLE: pyparse - simplify StringTranslatePseudoMapping
Type: enhancement Stage: patch review
Components: IDLE Versions: Python 3.8, Python 3.7, Python 3.6
Status: open Resolution:
Dependencies: Superseder:
Assigned To: terry.reedy Nosy List: cheryl.sabella, miss-islington, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2018-02-24 17:38 by cheryl.sabella, last changed 2018-02-28 23:12 by miss-islington.

Pull Requests
URL Status Linked Edit
PR 5854 closed cheryl.sabella, 2018-02-24 17:44
PR 5862 merged cheryl.sabella, 2018-02-24 23:26
PR 5940 merged miss-islington, 2018-02-28 22:24
PR 5941 merged miss-islington, 2018-02-28 22:25
Messages (15)
msg312734 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2018-02-24 17:38
Based on timing test on msg312733, StringTranslatePseudoMappingTest and a defaultdict have about the same performance, so this replaces the custom class with the stdlib functionality.
msg312752 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2018-02-24 20:07
Tal had written this on the original issue21765:
Finally, since the defaultdict is kept around as long as IDLE is running, I decided to avoid having it grow continually and consume memory unnecessarily. So I wrote a simple Mapping class, which wraps a normal dict and uses a custom default value instead of None, ord('x') in this case. Works like a charm :)
So maybe I misunderstood and this shouldn't be a defaultdict?
msg312764 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-24 22:53
I forget about this defaultdict behavior: "this value is inserted in the dictionary for the key, and returned."  Reason: when default_factory returns a mutable, d[key] must return the same possibly mutated object with each call.  I agree that defaultdict is not the right replacement.

We need to pass to str.translate a dict that can be used by subscripting, newchar = d[char].  So partial(non-defaults.get, default_value) will not work.  Instead, we need a __getitem__ that returns the same.

In msg312444 I suggested simplifying STPM (including the name) because it has unneeded complexity.  Remove the buggy .get override.  Combine the _get stuff in __init__ (also removed) with current __getitem__ and simplify and we get what we actually need (untested, at yet).

def __getitem__
    return self._non_defaults.get(self._default_value)

Actually, we could hard-code the default value as 'X' as we never need anything else.  

How about ParseMap for the name?
msg312768 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2018-02-25 00:01
New PR submitted.  Really glad I worked on this today.  I learned a lot of things that were new to me.  :-)
msg312770 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-25 02:56
I simplified ParseMap to a dict subclass with one override -- __getitem__ and then the tests.  They run faster.  I suspect translate is faster.
msg312778 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-25 07:35
For efficiency I suggest to initialize the mapping with dict.fromkeys(range(128), 'x') rather of an empty dict.

It is also possible to use regular expressions:

_trans = re.compile(r'''[^(){}\[]"'\\\n#]+''')
code = _trans.sub('x', code)
code = code.replace('{', '(')
code = code.replace('}', ')')
code = code.replace('[', '(')
code = code.replace(']', '(')
code = code.replace('\nx', '\n')

I didn't check what way is more efficient.
msg312796 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-25 10:18
To me, it is plausible but not slam-dunk obvious that preloading ascii to 'x' mappings will make ascii lookup faster.

On #21765, where the pyparse special translation was a side-issue, Tal Einat claimed that the unpublished regex he tried was 100x slower.
msg312799 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2018-02-25 10:33
A similar regular expression version was mentioned on issue21765 and I had run some tests on it yesterday to verify.  On my system, it ran at a factor of 10x slower, so if the translate finished in 0.003, the regex took 0.03.  This was consistent for me, regardless of how big I made the document.

The reason for not using a defauldict was to keep the 'x' mappings out of the dictionary so that it wouldn't grow and take up space.  Although, I did realize yesterday that it wasn't really boundless because most values in source code would be ASCII.  Running both the version the doesn't add the 'x' mappings and the `fromkeys`, there doesn't seem to be much of a difference in time when processing the doc.
msg313046 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-28 04:57
I settled on the following to compare ParseMap implementations.

from idlelib.pyparse import Parser
import timeit

class ParseGet(dict):
    def __getitem__(self, key): return self.get(key, ord('x'))
class ParseMis(dict):
    def __missing__(self, key): return ord('x')

for P in (ParseGet, ParseMis):
    print(P.__name__, 'hit', 'miss')
    p = p=P({i:i for i in (10, 34, 35, 39, 40, 41, 91, 92, 93, 123, 125)})
        number=100000, globals = globals()))
        number=100000, globals = globals()))

ParseGet hit miss
ParseMis hit miss

ParseGet hit miss
ParseMis hit miss

Avoiding custom code for all ascii chars will be a win.  I am sure that calling __missing__ for non-ascii will be at least as fast as it is presently.  I will commit a revision tomorrow.  

I may then compare to Serhiy's sub/replace suggestion.  My experiments with 'code.translate(tran)' indicate that time grows sub-linearly up to 1000 or 10000 chars.  This suggests that there are significant constant or log-like terms.
msg313048 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-28 05:47
Don't use ord('x'). Use just 'x' or b'x'[0].
msg313068 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-28 20:23
Replacing an expression with a less clear equivalent expression makes no sense to me.  Anyway, having __missing__ return 120 reduces the benchmark miss time from 1.2-1.3 to .93, making ParseMis always faster than ParseGet and reducing the penalty for non-ascii chars.

re.sub + str.replace is slower than translate

import re
import timeit

class ParseMap(dict):
    def __missing__(self, key): return 120  # ord('x')

trans = ParseMap((i,120) for i in range(128))
trans.update((ord(c), ord('(')) for c in "({[")
trans.update((ord(c), ord(')')) for c in ")}]")
trans.update((ord(c), ord(c)) for c in "\"'\\\n#")

trans_re = re.compile(r'''[^(){}\[]"'\\\n#]+''')
code='\t a([{b}])b"c\'d\n'*1000  # n = 1, 10, 100, 1000

    number=10000, globals = globals()))
    "code1 = trans_re.sub('x', code)\n"
    "code2 = code1.replace('{', '(')\n"
    "code3 = code2.replace('}', ')')\n"
    "code4 = code3.replace('[', '(')\n"
    "code5 = code4.replace(']', '(')\n"
    r"code6 = code5.replace('\nx', '\n')",
    number=10000, globals = globals()))

n     trans   re
1      .06    .09 
10     .08    .17
100    .28   1.00
1000  2.2    8.9

Multiply by 100 to get microseconds or seconds for 1000000.
msg313071 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-28 21:49
The mapping passed to str.translate must map ints representing codepoints to either either ints or strings.  Translate can extract binary codepoints for the new string from either. Ints are slightly faster, so I am inclined not to switch.

import timeit

class ParseMapN(dict):
    def __missing__(self, key): return 120
class ParseMapS(dict):
    def __missing__(self, key): return 'x'

trans1 = ParseMapN.fromkeys(range(128), 120)
trans1.update((ord(c), ord('(')) for c in "({[")
trans1.update((ord(c), ord(')')) for c in ")}]")
trans1.update((ord(c), ord(c)) for c in "\"'\\\n#")

trans2 = ParseMapN.fromkeys(range(128), 'x')
trans2.update((ord(c), '(') for c in "({[")
trans2.update((ord(c), ')') for c in ")}]")
trans2.update((ord(c), c) for c in "\"'\\\n#")

for i in (1, 10, 100, 1000):
    code = '\t a([{b}])b"c\'d\n' * i
    print('N ', i)
        number=10000, globals = globals()))
    print('S ', i)
        number=10000, globals = globals()))
N   1 [0.056562504, 0.056747570, 0.05654651,  0.056460749, 0.056428776]
S   1 [0.060795346, 0.062304155, 0.061043432, 0.062349345, 0.061191301]

N  10 [0.076474600, 0.076463227, 0.076560984, 0.076581582, 0.076010091]
S  10 [0.080739106, 0.080798745, 0.08034192,  0.080987501, 0.080617369]

N 100 [0.28529922,  0.28383868,  0.283949046, 0.284461512, 0.284291203]
S 100 [0.289629521, 0.288535418, 0.289154560, 0.28811548,  0.28862180]

N1000 [2.23882157,  2.2383192,   2.2384120,   2.2377972,   2.23854982]
S1000 [2.24242237,  2.2426845,   2.2424623,   2.2420030,   2.24254871]

The pattern of all S repeats being greater than all corresponding N repeats was true for 2 other runs.
msg313072 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-28 22:24
New changeset f0daa880a405c8de6743e44fa46006754aa145c9 by Terry Jan Reedy (Cheryl Sabella) in branch 'master':
bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862)
msg313073 - (view) Author: miss-islington (miss-islington) Date: 2018-02-28 23:08
New changeset 7e5763469e2fc9d08a3f6b6205f87f20a1bdd465 by Miss Islington (bot) in branch '3.7':
bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862)
msg313074 - (view) Author: miss-islington (miss-islington) Date: 2018-02-28 23:12
New changeset 32f5392f64f004382e26a988b1145d2dc96c4978 by Miss Islington (bot) in branch '3.6':
bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862)
Date User Action Args
2018-02-28 23:12:17miss-islingtonsetmessages: + msg313074
2018-02-28 23:08:24miss-islingtonsetnosy: + miss-islington
messages: + msg313073
2018-02-28 22:25:13miss-islingtonsetpull_requests: + pull_request5711
2018-02-28 22:24:12miss-islingtonsetstage: needs patch -> patch review
pull_requests: + pull_request5710
2018-02-28 22:24:00terry.reedysetmessages: + msg313072
2018-02-28 21:49:18terry.reedysetmessages: + msg313071
2018-02-28 20:23:43terry.reedysetmessages: + msg313068
2018-02-28 05:47:01serhiy.storchakasetmessages: + msg313048
2018-02-28 04:57:35terry.reedysetmessages: + msg313046
2018-02-25 10:33:20cheryl.sabellasetmessages: + msg312799
2018-02-25 10:18:39terry.reedysetmessages: + msg312796
2018-02-25 07:35:04serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg312778
2018-02-25 02:56:09terry.reedysetmessages: + msg312770
stage: patch review -> needs patch
2018-02-25 00:01:24cheryl.sabellasetmessages: + msg312768
2018-02-24 23:26:59cheryl.sabellasetstage: needs patch -> patch review
pull_requests: + pull_request5637
2018-02-24 22:53:33terry.reedysettitle: IDLE: pyparse - replace StringTranslatePseudoMapping with defaultdict -> IDLE: pyparse - simplify StringTranslatePseudoMapping
messages: + msg312764
stage: patch review -> needs patch
2018-02-24 20:07:12cheryl.sabellasetmessages: + msg312752
2018-02-24 18:29:08cheryl.sabellasettitle: IDLE: pyparse - replace StringTranslatePseudoMappingTest with defaultdict -> IDLE: pyparse - replace StringTranslatePseudoMapping with defaultdict
2018-02-24 17:45:13cheryl.sabellalinkissue32880 dependencies
2018-02-24 17:44:26cheryl.sabellasetkeywords: + patch
stage: patch review
pull_requests: + pull_request5629
2018-02-24 17:38:40cheryl.sabellacreate