Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDLE: pyparse - simplify StringTranslatePseudoMapping #77121

Closed
csabella opened this issue Feb 24, 2018 · 17 comments
Closed

IDLE: pyparse - simplify StringTranslatePseudoMapping #77121

csabella opened this issue Feb 24, 2018 · 17 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes topic-IDLE type-feature A feature request or enhancement

Comments

@csabella
Copy link
Contributor

BPO 32940
Nosy @terryjreedy, @serhiy-storchaka, @csabella, @miss-islington
PRs
  • bpo-32940: IDLE: in pyparse, replace mapping class with defaultdict #5854
  • bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse #5862
  • [3.7] bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862) #5940
  • [3.6] bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862) #5941
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/terryjreedy'
    closed_at = None
    created_at = <Date 2018-02-24.17:38:40.168>
    labels = ['3.8', 'expert-IDLE', 'type-feature', '3.7']
    title = 'IDLE: pyparse - simplify StringTranslatePseudoMapping'
    updated_at = <Date 2018-02-28.23:12:17.496>
    user = 'https://github.com/csabella'

    bugs.python.org fields:

    activity = <Date 2018-02-28.23:12:17.496>
    actor = 'miss-islington'
    assignee = 'terry.reedy'
    closed = False
    closed_date = None
    closer = None
    components = ['IDLE']
    creation = <Date 2018-02-24.17:38:40.168>
    creator = 'cheryl.sabella'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 32940
    keywords = ['patch']
    message_count = 15.0
    messages = ['312734', '312752', '312764', '312768', '312770', '312778', '312796', '312799', '313046', '313048', '313068', '313071', '313072', '313073', '313074']
    nosy_count = 4.0
    nosy_names = ['terry.reedy', 'serhiy.storchaka', 'cheryl.sabella', 'miss-islington']
    pr_nums = ['5854', '5862', '5940', '5941']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue32940'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @csabella
    Copy link
    Contributor Author

    Based on timing test on msg312733, StringTranslatePseudoMappingTest and a defaultdict have about the same performance, so this replaces the custom class with the stdlib functionality.

    @csabella csabella added 3.7 (EOL) end of life 3.8 only security fixes labels Feb 24, 2018
    @csabella csabella added topic-IDLE type-feature A feature request or enhancement labels Feb 24, 2018
    @csabella csabella changed the title IDLE: pyparse - replace StringTranslatePseudoMappingTest with defaultdict IDLE: pyparse - replace StringTranslatePseudoMapping with defaultdict Feb 24, 2018
    @csabella
    Copy link
    Contributor Author

    Tal had written this on the original bpo-21765:
    ----
    Finally, since the defaultdict is kept around as long as IDLE is running, I decided to avoid having it grow continually and consume memory unnecessarily. So I wrote a simple Mapping class, which wraps a normal dict and uses a custom default value instead of None, ord('x') in this case. Works like a charm :)
    -----
    So maybe I misunderstood and this shouldn't be a defaultdict?

    @terryjreedy
    Copy link
    Member

    I forget about this defaultdict behavior: "this value is inserted in the dictionary for the key, and returned." Reason: when default_factory returns a mutable, d[key] must return the same possibly mutated object with each call. I agree that defaultdict is not the right replacement.

    We need to pass to str.translate a dict that can be used by subscripting, newchar = d[char]. So partial(non-defaults.get, default_value) will not work. Instead, we need a __getitem__ that returns the same.

    In msg312444 I suggested simplifying STPM (including the name) because it has unneeded complexity. Remove the buggy .get override. Combine the _get stuff in __init__ (also removed) with current __getitem__ and simplify and we get what we actually need (untested, at yet).

    def __getitem__
    return self._non_defaults.get(self._default_value)

    Actually, we could hard-code the default value as 'X' as we never need anything else.

    How about ParseMap for the name?

    @terryjreedy terryjreedy changed the title IDLE: pyparse - replace StringTranslatePseudoMapping with defaultdict IDLE: pyparse - simplify StringTranslatePseudoMapping Feb 24, 2018
    @csabella
    Copy link
    Contributor Author

    New PR submitted. Really glad I worked on this today. I learned a lot of things that were new to me. :-)

    @terryjreedy
    Copy link
    Member

    I simplified ParseMap to a dict subclass with one override -- __getitem__ and then the tests. They run faster. I suspect translate is faster.

    @serhiy-storchaka
    Copy link
    Member

    For efficiency I suggest to initialize the mapping with dict.fromkeys(range(128), 'x') rather of an empty dict.

    It is also possible to use regular expressions:

    _trans = re.compile(r'''[^(){}\[]"'\\\n#]+''')
    code = _trans.sub('x', code)
    code = code.replace('{', '(')
    code = code.replace('}', ')')
    code = code.replace('[', '(')
    code = code.replace(']', '(')
    code = code.replace('\nx', '\n')

    I didn't check what way is more efficient.

    @terryjreedy
    Copy link
    Member

    To me, it is plausible but not slam-dunk obvious that preloading ascii to 'x' mappings will make ascii lookup faster.

    On bpo-21765, where the pyparse special translation was a side-issue, Tal Einat claimed that the unpublished regex he tried was 100x slower.

    @csabella
    Copy link
    Contributor Author

    A similar regular expression version was mentioned on bpo-21765 and I had run some tests on it yesterday to verify. On my system, it ran at a factor of 10x slower, so if the translate finished in 0.003, the regex took 0.03. This was consistent for me, regardless of how big I made the document.

    The reason for not using a defauldict was to keep the 'x' mappings out of the dictionary so that it wouldn't grow and take up space. Although, I did realize yesterday that it wasn't really boundless because most values in source code would be ASCII. Running both the version the doesn't add the 'x' mappings and the fromkeys, there doesn't seem to be much of a difference in time when processing the doc.

    @terryjreedy
    Copy link
    Member

    I settled on the following to compare ParseMap implementations.

    from idlelib.pyparse import Parser
    import timeit
    
    class ParseGet(dict):
        def __getitem__(self, key): return self.get(key, ord('x'))
    class ParseMis(dict):
        def __missing__(self, key): return ord('x')
    
    for P in (ParseGet, ParseMis):
        print(P.__name__, 'hit', 'miss')
        p = p=P({i:i for i in (10, 34, 35, 39, 40, 41, 91, 92, 93, 123, 125)})
        print(timeit.timeit(
            "p[10],p[34],p[35],p[39],p[40],p[41],p[91],p[92],p[93],p[125]",
            number=100000, globals = globals()))
        print(timeit.timeit(
            "p[11],p[33],p[36],p[45],p[50],p[61],p[71],p[82],p[99],p[125]",
            number=100000, globals = globals()))

    ParseGet hit miss
    1.104342376
    1.112531999
    ParseMis hit miss
    0.3530207070000002
    1.2165967760000003

    ParseGet hit miss
    1.185322191
    1.1915449519999999
    ParseMis hit miss
    0.3477272720000002
    1.317010653

    Avoiding custom code for all ascii chars will be a win. I am sure that calling __missing__ for non-ascii will be at least as fast as it is presently. I will commit a revision tomorrow.

    I may then compare to Serhiy's sub/replace suggestion. My experiments with 'code.translate(tran)' indicate that time grows sub-linearly up to 1000 or 10000 chars. This suggests that there are significant constant or log-like terms.

    @serhiy-storchaka
    Copy link
    Member

    Don't use ord('x'). Use just 'x' or b'x'[0].

    @terryjreedy
    Copy link
    Member

    Replacing an expression with a less clear equivalent expression makes no sense to me. Anyway, having __missing__ return 120 reduces the benchmark miss time from 1.2-1.3 to .93, making ParseMis always faster than ParseGet and reducing the penalty for non-ascii chars.

    re.sub + str.replace is slower than translate

    import re
    import timeit
    
    class ParseMap(dict):
        def __missing__(self, key): return 120  # ord('x')
    
    trans = ParseMap((i,120) for i in range(128))
    trans.update((ord(c), ord('(')) for c in "({[")
    trans.update((ord(c), ord(')')) for c in ")}]")
    trans.update((ord(c), ord(c)) for c in "\"'\\\n#")
    
    trans_re = re.compile(r'''[^(){}\[]"'\\\n#]+''')
    code='\t a([{b}])b"c\'d\n'*1000  # n = 1, 10, 100, 1000

    print(timeit.timeit(
    'code.translate(trans)',
    number=10000, globals = globals()))
    print(timeit.timeit(
    "code1 = trans_re.sub('x', code)\n"
    "code2 = code1.replace('{', '(')\n"
    "code3 = code2.replace('}', ')')\n"
    "code4 = code3.replace('[', '(')\n"
    "code5 = code4.replace(']', '(')\n"
    r"code6 = code5.replace('\nx', '\n')",
    number=10000, globals = globals()))

    n trans re
    1 .06 .09
    10 .08 .17
    100 .28 1.00
    1000 2.2 8.9

    Multiply by 100 to get microseconds or seconds for 1000000.

    @terryjreedy
    Copy link
    Member

    The mapping passed to str.translate must map ints representing codepoints to either either ints or strings. Translate can extract binary codepoints for the new string from either. Ints are slightly faster, so I am inclined not to switch.

    import timeit
    
    class ParseMapN(dict):
        def __missing__(self, key): return 120
    class ParseMapS(dict):
        def __missing__(self, key): return 'x'
    
    trans1 = ParseMapN.fromkeys(range(128), 120)
    trans1.update((ord(c), ord('(')) for c in "({[")
    trans1.update((ord(c), ord(')')) for c in ")}]")
    trans1.update((ord(c), ord(c)) for c in "\"'\\\n#")
    
    trans2 = ParseMapN.fromkeys(range(128), 'x')
    trans2.update((ord(c), '(') for c in "({[")
    trans2.update((ord(c), ')') for c in ")}]")
    trans2.update((ord(c), c) for c in "\"'\\\n#")
    
    for i in (1, 10, 100, 1000):
        code = '\t a([{b}])b"c\'d\n' * i
        print('N ', i)
        print(timeit.repeat(
            'code.translate(trans1)',
            number=10000, globals = globals()))
        print('S ', i)
        print(timeit.repeat(
            'code.translate(trans2)',
            number=10000, globals = globals()))
     
    N   1 [0.056562504, 0.056747570, 0.05654651,  0.056460749, 0.056428776]
    S   1 [0.060795346, 0.062304155, 0.061043432, 0.062349345, 0.061191301]

    N 10 [0.076474600, 0.076463227, 0.076560984, 0.076581582, 0.076010091]
    S 10 [0.080739106, 0.080798745, 0.08034192, 0.080987501, 0.080617369]

    N 100 [0.28529922, 0.28383868, 0.283949046, 0.284461512, 0.284291203]
    S 100 [0.289629521, 0.288535418, 0.289154560, 0.28811548, 0.28862180]

    N1000 [2.23882157, 2.2383192, 2.2384120, 2.2377972, 2.23854982]
    S1000 [2.24242237, 2.2426845, 2.2424623, 2.2420030, 2.24254871]

    The pattern of all S repeats being greater than all corresponding N repeats was true for 2 other runs.

    @terryjreedy
    Copy link
    Member

    New changeset f0daa88 by Terry Jan Reedy (Cheryl Sabella) in branch 'master':
    bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862)
    f0daa88

    @miss-islington
    Copy link
    Contributor

    New changeset 7e57634 by Miss Islington (bot) in branch '3.7':
    bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862)
    7e57634

    @miss-islington
    Copy link
    Contributor

    New changeset 32f5392 by Miss Islington (bot) in branch '3.6':
    bpo-32940: IDLE: Simplify StringTranslatePseudoMapping in pyparse (GH-5862)
    32f5392

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @erlend-aasland
    Copy link
    Contributor

    Are there more work to be done here, or can we close this issue?

    @terryjreedy
    Copy link
    Member

    As near as I can tell now, I tested all proposed alternatives before committing. So all done.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes topic-IDLE type-feature A feature request or enhancement
    Projects
    Status: Done
    Development

    No branches or pull requests

    5 participants