Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDLE: Fix and update and cleanup pyparse #77061

Open
terryjreedy opened this issue Feb 20, 2018 · 21 comments
Open

IDLE: Fix and update and cleanup pyparse #77061

terryjreedy opened this issue Feb 20, 2018 · 21 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes topic-IDLE type-bug An unexpected behavior, bug, or error

Comments

@terryjreedy
Copy link
Member

BPO 32880
Nosy @terryjreedy, @serhiy-storchaka, @csabella
Dependencies
  • bpo-32905: IDLE: remove unused code in pyparse module
  • bpo-32916: IDLE: change 'str' to 'code' in idlelib.pyparse.PyParse and users
  • bpo-32940: IDLE: pyparse - simplify StringTranslatePseudoMapping
  • bpo-32989: IDLE: Fix pyparse.find_good_parse_start
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/terryjreedy'
    closed_at = None
    created_at = <Date 2018-02-20.02:19:14.840>
    labels = ['3.8', 'expert-IDLE', 'type-bug', '3.7']
    title = 'IDLE: Fix and update and cleanup pyparse'
    updated_at = <Date 2018-03-03.00:44:56.033>
    user = 'https://github.com/terryjreedy'

    bugs.python.org fields:

    activity = <Date 2018-03-03.00:44:56.033>
    actor = 'cheryl.sabella'
    assignee = 'terry.reedy'
    closed = False
    closed_date = None
    closer = None
    components = ['IDLE']
    creation = <Date 2018-02-20.02:19:14.840>
    creator = 'terry.reedy'
    dependencies = ['32905', '32916', '32940', '32989']
    files = []
    hgrepos = []
    issue_num = 32880
    keywords = []
    message_count = 21.0
    messages = ['312394', '312395', '312398', '312444', '312464', '312474', '312477', '312525', '312526', '312544', '312549', '312568', '312607', '312726', '312737', '312741', '312744', '312830', '312875', '312876', '312877']
    nosy_count = 3.0
    nosy_names = ['terry.reedy', 'serhiy.storchaka', 'cheryl.sabella']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'test needed'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue32880'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @terryjreedy
    Copy link
    Member Author

    Pyparse was mostly written in the early 2000s, with the only one code change since 2007 in 2014. bpo-32874 will add tests for pyparse 'as is' (though with some docstring and comment changes). This issue will change pyparse code, and change or add test as needed.

    Here are two items to fix. More will be added. Some fixes might be separate PRs or spun off into separate issues.

    def dump: dump this; print does the same, without hardcoding sys.__stdout__, which might be None.

    _synchre: Only used in find_good_parse_start. Missing 'if' (mentioned in function docstring as 'popular' and 'for' and new things like 'with'. 'async' and 'await' are not generally popular, but is not any statement beginning a good parse start?

    @terryjreedy terryjreedy added 3.7 (EOL) end of life 3.8 only security fixes labels Feb 20, 2018
    @terryjreedy terryjreedy self-assigned this Feb 20, 2018
    @terryjreedy terryjreedy added topic-IDLE type-bug An unexpected behavior, bug, or error labels Feb 20, 2018
    @terryjreedy
    Copy link
    Member Author

    Let's consider the todo questions at the end of the class Parser code.
    ---

        # XXX - is this used?
        lastopenbracketpos = None
    
        def get_last_open_bracket_pos(self):
            "Return index of last open bracket or None."
            self._study2()
            return self.lastopenbracketpos
    
        # XXX - is this used?
        stmt_bracketing = None
    
        def get_last_stmt_bracketing(self):
            """Return a tuple of the structure of the bracketing of the last
            interesting statement.
        Tuple is in the format defined in _study2().
        """
        self._study2()
        return self.stmt_bracketing
    

    ---

    get_last_open_bracket_pos is not called anywhere. TODO remove it.
    get_last_stmt_bracketing is called once in hyperparser and could be replaced there by 'parser._study2(); parser.stmt_bracketing'. TODO?

    Integer lastopenbracketpos is only conditionally set as an instance attribute, which masks the class attribute, in _study2 with
    if stack: # of opener indices
    self.lastopenbracketpos = stack[-1]

    However, self.lastopenbracketpos in only accessed (as 'j') in compute_bracket_indent, which is only called when a statement is continued to the next line because of an open bracket. Indeed, the call is guarded by
    assert self.continuation == C_BRACKET # TODO: move up
    So the class binding is never used because it is always masked bebore being accessed. Its value None would raise if it were. (That may have been intentional.) TODO remove it.

    Stopping here would break the current tests. We could make _study2 return the index and change test according. Or we can make _study2 always set the attribute with (TODO this)
    self.lastopenbracketpos = stack[-1] if stack else None
    and remove the workaround test lines:
    p.lastopenbracketpos = None

    Since _study2 unconditionally sets stmt_bracketing as an instance attribute with
    self.stmt_bracketing = tuple(bracketing)
    and self.stmt_bracketing is only accessed, in the get function above (or possibly directly in the future), after such a call, its class setting is also useless. TODO remove it. No test change needed.

    @terryjreedy
    Copy link
    Member Author

    set_str sets self.str and self.study_level. After the first call, attempts to access unset instance parse attributes from other modules will raise AttributeError. Removing the unneeded class setting will just add 2 more names to the list of things that should not be called directly. The get_xyz functions (get_continuation_type is another) avoid AttributeError (unless self.str is not set) and ensure validity.

    If set_str is called more than once, instance parse attributes will initially be set but invalid. So it seems they should be either deleted or set to None. In the latter case, they should also be set to None initially. I will look more at how the module is used.

    @terryjreedy
    Copy link
    Member Author

    StringTranslatePseudoMapping (Mapping) is an awkward and not very descriptive name. It is similar to collections.defaultdict except that it uses a default value rather than a default factory. The latter is so defaults can be mutables. Would a subclass of similar to defaultdict be faster?

    Unlike, defaultdict, the initialized default is used in get as well as __getitem__. This means that STPM.get ignores a passed-in default, contrary to the definition of Mapping.get. The method override is not needed.

    The internal self._get is only needed because it is called in both __getitem__ and get. Without the get override, it's body (with 'self.' added, can become the body of __getitem__. I suspect that this would make __getitem__ faster. Test this.

    @terryjreedy
    Copy link
    Member Author

    In def find_good_parse_start:
    str, pos = self.str, None
    Using str as an attribute is ok, but using the bare built-in class name for an instance thereof is annoying. 's' would be better, or even 'sample' (to be analyzed).

    @terryjreedy
    Copy link
    Member Author

    Is
    str = str.replace('xxxxxxxx', 'x')
    str = str.replace('xxxx', 'x')
    str = str.replace('xx', 'x')
    str = str.replace('xx', 'x')
    really faster than
    _squashex = re.compile('x+').sub # top of file
    _squashex('x', str)

    And are the name bindings faster that regex.func(...)?

    @terryjreedy
    Copy link
    Member Author

    In _study2, let q instead of p be end of last good line and then set p back to beginning. Setting p to end, q to p, and p to start is confusing.

    @terryjreedy
    Copy link
    Member Author

    As noted in the test for find_good_parse_start and PR5755 discussion, a single line header on a line by itself in a multiline comment before a multiline header can prevent recognition of the latter.
     
    >>> P.set_str("'somethn=ig'\ndef g(a,\nb\n")	    
    >>> P.find_good_parse_start(lambda i: False) 
    13
    >>> P.set_str("'''\ndef f():\n pass'''\ndef g(a,\nb\n")	    
    >>> P.find_good_parse_start(lambda i:i < 15)
    >>>

    One alternative to the current algorithm would be to search the beginning of every line for a compound statement keyword, not just lines ending with ':'. I believe the concern is that this would require uselessly checking more lines within strings. I believe that the same concern is why 'if' and 'for' are missing from the keyword list.

    When the window is an editor rather than shell, find_good_parse_start is called in EditorWindow.newline_and_indent_event and Hyperparser.__init__. The call-specific in-string function is returned by EW._build_char_in_string_func. It calls EW.is_char_in_string, which returns False only if the char in the text widget has been examined by the colorizer and not tagged with STRING.

    The call to find_good_parse_start is always followed by a call to set_lo and and then _study1 (via a call to another function). _study1 replaces runs of non-essential chars with 'x', which means that string literals within the code string are mostly reduced to a single x per line. (It would be fine if they were emptied except for newlines.) This suggests starting find_good_parse_start with a partial reduction, of just string literals, saved for further reduction by _study, so that keywords would never occur within the reduced literal.

    The problem is that one cannot tell for sure whether ''' or """ is the beginning or end of a multiline literal without parsing from the beginning of the code (which colorizer does). An alternate way to reuse the colorizer work might be to use splitlines on the code and then get all string tag ranges.

    The code-context option picks out compound-statement header lines. When enabled, I believe that its last line may be the desired good parse start line.

    Any proposed speedup should be tested by parsing multiple chunks of multiple stdlib modules.

    @terryjreedy
    Copy link
    Member Author

    One sub-issue is to expand the current short doc section (2.1. Automatic indentation) the intended smart indenting. There is nothing about open parameter lists, not about multiline tuple/list/dict displays. It should say (IN CAPS?) that an indent that seems wrong likely indicates a syntax error that has mislead the indenter.

    @serhiy-storchaka
    Copy link
    Member

    Could this code be used in IDLE? For example highlighting the closed/open brace when cursor stay on open/closed brace or shortcuts for "go to open/closed brace"? Many editors have such functionality.

    @terryjreedy
    Copy link
    Member Author

    The ()[]{} match manager is in parenmatch.py. It imports Hyperparser, which uses pyparse.Parser. When one types a closer, the corresponding opener is also highlighted and, depending on the configured style, everything in between. Very handy for showing when enough closers have been typed. Edit => Show surrounding parens, ^0 on Windows, which is essentially ^) on US keyboards, highlights out to the nearest pair. One can select one end of a pair by putting the cursor just inside a fence char. (A tk Text cursor is between characters, not on.)

    I don't think I would like automatic flashes when merely moving the cursor, or maybe I don't understand what you propose.

    On the hand, moving to the nearest opener to the left or closer to the right could be useful. ^] is already used for indent region, but ^{ and ^} (control-shift-bracket) are available and mnemonic. What shortcuts do you know of?

    @csabella
    Copy link
    Contributor

    Terry,

    In order to not duplicate effort, are you going to be working on these or is it ok if I look at them? Should all the changes happen in one PR or do you want to break them down?

    Thanks!

    @terryjreedy
    Copy link
    Member Author

    Please go ahead, but coordinate. This issue contains miscellaneous pyparse issues that I thought of while reviewing the tests, plus Serhiy's idea.

    There should be multiple PRs on dependency issues (which might have more than one closely related PR). bpo-32905 and now bpo-32916 are examples. If you open a spinoff, I will assume that you are working on a PR.

    @csabella
    Copy link
    Contributor

    New issue:

    find_good_parse_start() call in editor.py has the wrong signature. There's actually a few things going on here. Here's the section in editor:

                if not self.context_use_ps1:
                    for context in self.num_context_lines:
                        startat = max(lno - context, 1)
                        startatindex = repr(startat) + ".0"
                        rawtext = text.get(startatindex, "insert")
                        y.set_code(rawtext)
                        bod = y.find_good_parse_start(
                                  self.context_use_ps1, self._build_char_in_string_func(startatindex))
                        if bod is not None or startat == 1:
                            break
                    y.set_lo(bod or 0)
                else:
                    r = text.tag_prevrange("console", "insert")
                    if r:
                        startatindex = r[1]
                    else:
                        startatindex = "1.0"
                    rawtext = text.get(startatindex, "insert")
                    y.set_code(rawtext)
                    y.set_lo(0)
    
    1. self.context_use_ps1 is always False. There's no where in IDLE that it can be set to True. I'll open an another issue for this since it's really not pyparse.

    2. The call to find_good_parse_start:

    bod = y.find_good_parse_start(
                                  self.context_use_ps1, self._build_char_in_string_func(startatindex))
    

    sends 3 parameters. And in pyparse, the signature allows 3. However, the signature is:

        def find_good_parse_start(self, is_char_in_string=None,
                                  _synchre=_synchre):
    

    This means that the False self.use_context_ps1 is the first value instead of the function, so pyparse is always executing:

            if not is_char_in_string:
                # no clue -- make the caller pass everything
                return None
    

    In the test, I had assumed this was for the default of None. Bad assumption. :-(

    Here's the commit that changed the signature:
    b175445

    @csabella
    Copy link
    Contributor

    For msg312444 on StringTranslatePseudoMapping:

    I ran a comparison of the current class vs defaultdict.

    This was my test:

    with open('/home/cheryl/cpython/Lib/idlelib/pyparse.py') as fd:
        code = fd.read()
    copies = 20000
    code *= copies
    for i in range(10):
        start = time.time()
        trans_code = code.translate(_tran)
        end = time.time()
        print(f'copies: {copies}   translate time: {end - start}')
    

    I ran the current _tran first and saved the results.

    _tran = {}
    _tran.update((ord(c), ord('(')) for c in "({[")
    _tran.update((ord(c), ord(')')) for c in ")}]")
    _tran.update((ord(c), ord(c)) for c in "\"'\\\n#")
    _tran = StringTranslatePseudoMapping(_tran, default_value=ord('x'))
    

    I know time.time isn't perfect, but thought it might be good enough. The results:
    copies: 20000 translate time: 0.8162932395935059
    copies: 20000 translate time: 0.610985517501831
    copies: 20000 translate time: 0.8164870738983154
    copies: 20000 translate time: 0.6125986576080322
    copies: 20000 translate time: 0.8143167495727539
    copies: 20000 translate time: 0.612929105758667
    copies: 20000 translate time: 0.8299245834350586
    copies: 20000 translate time: 0.6127865314483643
    copies: 20000 translate time: 0.812185525894165
    copies: 20000 translate time: 0.6151354312896729

    Then I changed it to a defaultdict:

    _tran = defaultdict(lambda: 'x')
     # _tran = {}
    _tran.update((ord(c), ord('(')) for c in "({[")
    _tran.update((ord(c), ord(')')) for c in ")}]")
    _tran.update((ord(c), ord(c)) for c in "\"'\\\n#")
    # _tran = StringTranslatePseudoMapping(_tran, default_value=ord('x'))
    

    I compared the results to make sure the defaultdict produced the same output as the mapping, which it did.

    The results:
    copies: 20000 translate time: 0.8172969818115234
    copies: 20000 translate time: 0.6214878559112549
    copies: 20000 translate time: 0.8143007755279541
    copies: 20000 translate time: 0.6127951145172119
    copies: 20000 translate time: 0.8154017925262451
    copies: 20000 translate time: 0.6144123077392578
    copies: 20000 translate time: 0.8128812313079834
    copies: 20000 translate time: 0.6167266368865967
    copies: 20000 translate time: 0.8143749237060547
    copies: 20000 translate time: 0.6116495132446289

    So, the results are simliar, down to the alternating of .61ish and .81ish times.

    @csabella
    Copy link
    Contributor

    I wish I could delete my last message (and duplicate posting of the one before. On the last one, I didn't move my timings. I feel like an idiot. Anyway, there is a difference.

    start = time.time()
    for i in range(1000000):
        _tran = defaultdict(lambda: 'x')
        _tran.update((ord(c), ord('(')) for c in "({[")
        _tran.update((ord(c), ord(')')) for c in ")}]")
        _tran.update((ord(c), ord(c)) for c in "\"'\\\n#")
    end = time.time()
    print(f'translate time: {end - start}')
    

    translate time: 7.443669319152832

    start = time.time()
    for i in range(1000000):
        _tran = defaultdict(lambda: 'x')
        _tran.update({40: 40,    # ord('(')
                      91: 40,    # ord('[')
                      123: 40,   # ord('{')
                      41: 41,    # ord(')')
                      93: 41,    # ord(']')
                      125: 41,   # ord('}')
                      34: 34,    # ord('"')
                      39: 39,    # ord("'")
                      92: 92,    # ord("\\")
                      10: 10,    # ord("\n")
                      35: 35,    # ord("#")
                      })
    end = time.time()
    print(f'translate time: {end - start}')
    

    translate time: 1.7251780033111572

    It's still probably negligible since it's only done once and not a million times. :-)

    @csabella
    Copy link
    Contributor

    For msg312474 regarding replace vs re.

    Running the 20,000 copy version of the translated text (only translating once gives the following timings:
    copies: 20000 translate time: 5.591020822525024
    copies: 20000 translate time: 5.5614333152771
    copies: 20000 translate time: 5.561311483383179
    copies: 20000 translate time: 5.558183670043945
    copies: 20000 translate time: 5.580726385116577
    copies: 20000 translate time: 5.588990688323975
    copies: 20000 translate time: 5.570690155029297
    copies: 20000 translate time: 5.601408004760742
    copies: 20000 translate time: 5.76714825630188
    copies: 20000 translate time: 5.697475910186768

    And the re version gives:
    copies: 20000 translate time: 5.935032844543457
    copies: 20000 translate time: 5.939348220825195
    copies: 20000 translate time: 5.933218240737915
    copies: 20000 translate time: 6.070481538772583
    copies: 20000 translate time: 6.319685935974121
    copies: 20000 translate time: 6.4209065437316895
    copies: 20000 translate time: 6.476579666137695
    copies: 20000 translate time: 6.520790100097656
    copies: 20000 translate time: 6.541554927825928
    copies: 20000 translate time: 6.620612859725952

    So, it's a little slower on a big string. It also gives slightly different results because of the last replace:
    code.replace('\nx', '\n') isn't in the re.

    On a more practical size document, they are about the same:
    replace:
    copies: 20 translate time: 0.0058782100677490234
    copies: 20 translate time: 0.006024599075317383
    copies: 20 translate time: 0.0056345462799072266
    copies: 20 translate time: 0.005848884582519531
    copies: 20 translate time: 0.005696296691894531
    copies: 20 translate time: 0.00574946403503418
    copies: 20 translate time: 0.005642890930175781
    copies: 20 translate time: 0.005755901336669922
    copies: 20 translate time: 0.0058023929595947266
    copies: 20 translate time: 0.005713939666748047

    re:
    copies: 20 translate time: 0.005833148956298828
    copies: 20 translate time: 0.005682229995727539
    copies: 20 translate time: 0.00565028190612793
    copies: 20 translate time: 0.005823850631713867
    copies: 20 translate time: 0.0057680606842041016
    copies: 20 translate time: 0.0058100223541259766
    copies: 20 translate time: 0.005717277526855469
    copies: 20 translate time: 0.005885601043701172
    copies: 20 translate time: 0.005852460861206055
    copies: 20 translate time: 0.005867958068847656

    It appears the time for the replace is linear and the re is just a little more than linear. Maybe it's just my computer.

    @csabella
    Copy link
    Contributor

    Looking at the creation of the instances of pyparse.PyParse() and hyperparser.HyperParser(), I was a little surprised that they (the instances) are local variables to the methods and aren't instance variables. Since they are called fairly often, wouldn't it be more efficient to use an instance variable (for example self.hp in calltips instead of hp) and update the attributes when something like open_calltips() is executed? Maybe the overhead of creating a class is neglible compared to the storage of not destroying it every time?

    @terryjreedy
    Copy link
    Member Author

    If you [view] a message, there is an unlink button at the bottom for the author (and believe) and tracker gardeners. I did this for the duplicate and incomplete messages.

    In the future, lets put timing data for an idea on a separate issue for the idea. This puts everything about the idea in a one place. It is OK if we close the separate issue as rejected because the data show the idea not worthwhile.

    Unfortunately, there is no way to just move a message to another issue. But it can be copied to another and unlinked from the first.

    @terryjreedy
    Copy link
    Member Author

    As I mentioned before, Parser.set_code strongly suggests that Parser instances were intended to be reused. But you discovered the bug that prevented this, at least for one code result. With that fixed, it should be possible to have one Parser instance per editor.

    Since hyperparser is used to match each ), ], and } we could look at that too.

    @terryjreedy
    Copy link
    Member Author

    from idlelib.pyparse import Parser
    import timeit
    code='def f():\n'
    print(timeit.timeit("statement",  # for example
                        "p=Parser(4,4)", globals = globals()))

    statement microseconds
    Parser .1 # test timeit code
    Parser() 2.2
    p.set_code('') 1.4
    p.set_code(code) 1.8
    '('.translate(map1) 1.5 or 2.2 depending on ParseMap

    Translate time is longer for real code and Parser() instance creation time must be well less than 1/10, maybe 1/100 of any answer time. If we don't reuse instances, though, set_code should be part of __init__ (and tests changed). Either change is low priority.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes topic-IDLE type-bug An unexpected behavior, bug, or error
    Projects
    Status: No status
    Development

    No branches or pull requests

    3 participants