IDLE: Fix and update and cleanup pyparse #77061

terryjreedy · 2018-02-20T02:19:15Z

BPO	32880
Nosy	@terryjreedy, @serhiy-storchaka, @csabella
Dependencies	bpo-32905: IDLE: remove unused code in pyparse module bpo-32916: IDLE: change 'str' to 'code' in idlelib.pyparse.PyParse and users bpo-32940: IDLE: pyparse - simplify StringTranslatePseudoMapping bpo-32989: IDLE: Fix pyparse.find_good_parse_start

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/terryjreedy'
closed_at = None
created_at = <Date 2018-02-20.02:19:14.840>
labels = ['3.8', 'expert-IDLE', 'type-bug', '3.7']
title = 'IDLE: Fix and update and cleanup pyparse'
updated_at = <Date 2018-03-03.00:44:56.033>
user = 'https://github.com/terryjreedy'

bugs.python.org fields:

activity = <Date 2018-03-03.00:44:56.033>
actor = 'cheryl.sabella'
assignee = 'terry.reedy'
closed = False
closed_date = None
closer = None
components = ['IDLE']
creation = <Date 2018-02-20.02:19:14.840>
creator = 'terry.reedy'
dependencies = ['32905', '32916', '32940', '32989']
files = []
hgrepos = []
issue_num = 32880
keywords = []
message_count = 21.0
messages = ['312394', '312395', '312398', '312444', '312464', '312474', '312477', '312525', '312526', '312544', '312549', '312568', '312607', '312726', '312737', '312741', '312744', '312830', '312875', '312876', '312877']
nosy_count = 3.0
nosy_names = ['terry.reedy', 'serhiy.storchaka', 'cheryl.sabella']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'test needed'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue32880'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

terryjreedy · 2018-02-20T02:19:14Z

Pyparse was mostly written in the early 2000s, with the only one code change since 2007 in 2014. bpo-32874 will add tests for pyparse 'as is' (though with some docstring and comment changes). This issue will change pyparse code, and change or add test as needed.

Here are two items to fix. More will be added. Some fixes might be separate PRs or spun off into separate issues.

def dump: dump this; print does the same, without hardcoding sys.__stdout__, which might be None.

_synchre: Only used in find_good_parse_start. Missing 'if' (mentioned in function docstring as 'popular' and 'for' and new things like 'with'. 'async' and 'await' are not generally popular, but is not any statement beginning a good parse start?

terryjreedy · 2018-02-20T03:29:31Z

Let's consider the todo questions at the end of the class Parser code.
---

    # XXX - is this used?
    lastopenbracketpos = None

    def get_last_open_bracket_pos(self):
        "Return index of last open bracket or None."
        self._study2()
        return self.lastopenbracketpos

    # XXX - is this used?
    stmt_bracketing = None

    def get_last_stmt_bracketing(self):
        """Return a tuple of the structure of the bracketing of the last
        interesting statement.

    Tuple is in the format defined in _study2().
    """
    self._study2()
    return self.stmt_bracketing

---

get_last_open_bracket_pos is not called anywhere. TODO remove it.
get_last_stmt_bracketing is called once in hyperparser and could be replaced there by 'parser._study2(); parser.stmt_bracketing'. TODO?

Integer lastopenbracketpos is only conditionally set as an instance attribute, which masks the class attribute, in _study2 with
if stack: # of opener indices
self.lastopenbracketpos = stack[-1]

However, self.lastopenbracketpos in only accessed (as 'j') in compute_bracket_indent, which is only called when a statement is continued to the next line because of an open bracket. Indeed, the call is guarded by
assert self.continuation == C_BRACKET # TODO: move up
So the class binding is never used because it is always masked bebore being accessed. Its value None would raise if it were. (That may have been intentional.) TODO remove it.

Stopping here would break the current tests. We could make _study2 return the index and change test according. Or we can make _study2 always set the attribute with (TODO this)
self.lastopenbracketpos = stack[-1] if stack else None
and remove the workaround test lines:
p.lastopenbracketpos = None

Since _study2 unconditionally sets stmt_bracketing as an instance attribute with
self.stmt_bracketing = tuple(bracketing)
and self.stmt_bracketing is only accessed, in the get function above (or possibly directly in the future), after such a call, its class setting is also useless. TODO remove it. No test change needed.

terryjreedy · 2018-02-20T04:16:23Z

set_str sets self.str and self.study_level. After the first call, attempts to access unset instance parse attributes from other modules will raise AttributeError. Removing the unneeded class setting will just add 2 more names to the list of things that should not be called directly. The get_xyz functions (get_continuation_type is another) avoid AttributeError (unless self.str is not set) and ensure validity.

If set_str is called more than once, instance parse attributes will initially be set but invalid. So it seems they should be either deleted or set to None. In the latter case, they should also be set to None initially. I will look more at how the module is used.

terryjreedy · 2018-02-20T23:33:07Z

StringTranslatePseudoMapping (Mapping) is an awkward and not very descriptive name. It is similar to collections.defaultdict except that it uses a default value rather than a default factory. The latter is so defaults can be mutables. Would a subclass of similar to defaultdict be faster?

Unlike, defaultdict, the initialized default is used in get as well as __getitem__. This means that STPM.get ignores a passed-in default, contrary to the definition of Mapping.get. The method override is not needed.

The internal self._get is only needed because it is called in both __getitem__ and get. Without the get override, it's body (with 'self.' added, can become the body of __getitem__. I suspect that this would make __getitem__ faster. Test this.

terryjreedy · 2018-02-21T05:52:03Z

In def find_good_parse_start:
str, pos = self.str, None
Using str as an attribute is ok, but using the bare built-in class name for an instance thereof is annoying. 's' would be better, or even 'sample' (to be analyzed).

terryjreedy · 2018-02-21T06:51:12Z

Is
str = str.replace('xxxxxxxx', 'x')
str = str.replace('xxxx', 'x')
str = str.replace('xx', 'x')
str = str.replace('xx', 'x')
really faster than
_squashex = re.compile('x+').sub # top of file
_squashex('x', str)

And are the name bindings faster that regex.func(...)?

terryjreedy · 2018-02-21T08:23:18Z

In _study2, let q instead of p be end of last good line and then set p back to beginning. Setting p to end, q to p, and p to start is confusing.

terryjreedy · 2018-02-22T01:31:06Z

As noted in the test for find_good_parse_start and PR5755 discussion, a single line header on a line by itself in a multiline comment before a multiline header can prevent recognition of the latter.
 
>>> P.set_str("'somethn=ig'\ndef g(a,\nb\n")	    
>>> P.find_good_parse_start(lambda i: False) 
13
>>> P.set_str("'''\ndef f():\n pass'''\ndef g(a,\nb\n")	    
>>> P.find_good_parse_start(lambda i:i < 15)
>>>

One alternative to the current algorithm would be to search the beginning of every line for a compound statement keyword, not just lines ending with ':'. I believe the concern is that this would require uselessly checking more lines within strings. I believe that the same concern is why 'if' and 'for' are missing from the keyword list.

When the window is an editor rather than shell, find_good_parse_start is called in EditorWindow.newline_and_indent_event and Hyperparser.__init__. The call-specific in-string function is returned by EW._build_char_in_string_func. It calls EW.is_char_in_string, which returns False only if the char in the text widget has been examined by the colorizer and not tagged with STRING.

The call to find_good_parse_start is always followed by a call to set_lo and and then _study1 (via a call to another function). _study1 replaces runs of non-essential chars with 'x', which means that string literals within the code string are mostly reduced to a single x per line. (It would be fine if they were emptied except for newlines.) This suggests starting find_good_parse_start with a partial reduction, of just string literals, saved for further reduction by _study, so that keywords would never occur within the reduced literal.

The problem is that one cannot tell for sure whether ''' or """ is the beginning or end of a multiline literal without parsing from the beginning of the code (which colorizer does). An alternate way to reuse the colorizer work might be to use splitlines on the code and then get all string tag ranges.

The code-context option picks out compound-statement header lines. When enabled, I believe that its last line may be the desired good parse start line.

Any proposed speedup should be tested by parsing multiple chunks of multiple stdlib modules.

terryjreedy · 2018-02-22T01:42:03Z

One sub-issue is to expand the current short doc section (2.1. Automatic indentation) the intended smart indenting. There is nothing about open parameter lists, not about multiline tuple/list/dict displays. It should say (IN CAPS?) that an indent that seems wrong likely indicates a syntax error that has mislead the indenter.

serhiy-storchaka · 2018-02-22T08:56:57Z

Could this code be used in IDLE? For example highlighting the closed/open brace when cursor stay on open/closed brace or shortcuts for "go to open/closed brace"? Many editors have such functionality.

terryjreedy · 2018-02-22T10:29:10Z

The ()[]{} match manager is in parenmatch.py. It imports Hyperparser, which uses pyparse.Parser. When one types a closer, the corresponding opener is also highlighted and, depending on the configured style, everything in between. Very handy for showing when enough closers have been typed. Edit => Show surrounding parens, ^0 on Windows, which is essentially ^) on US keyboards, highlights out to the nearest pair. One can select one end of a pair by putting the cursor just inside a fence char. (A tk Text cursor is between characters, not on.)

I don't think I would like automatic flashes when merely moving the cursor, or maybe I don't understand what you propose.

On the hand, moving to the nearest opener to the left or closer to the right could be useful. ^] is already used for indent region, but ^{ and ^} (control-shift-bracket) are available and mnemonic. What shortcuts do you know of?

csabella · 2018-02-22T14:31:00Z

Terry,

In order to not duplicate effort, are you going to be working on these or is it ok if I look at them? Should all the changes happen in one PR or do you want to break them down?

Thanks!

terryjreedy · 2018-02-23T00:59:58Z

Please go ahead, but coordinate. This issue contains miscellaneous pyparse issues that I thought of while reviewing the tests, plus Serhiy's idea.

There should be multiple PRs on dependency issues (which might have more than one closely related PR). bpo-32905 and now bpo-32916 are examples. If you open a spinoff, I will assume that you are working on a PR.

csabella · 2018-02-24T15:38:34Z

New issue:

find_good_parse_start() call in editor.py has the wrong signature. There's actually a few things going on here. Here's the section in editor:

            if not self.context_use_ps1:
                for context in self.num_context_lines:
                    startat = max(lno - context, 1)
                    startatindex = repr(startat) + ".0"
                    rawtext = text.get(startatindex, "insert")
                    y.set_code(rawtext)
                    bod = y.find_good_parse_start(
                              self.context_use_ps1, self._build_char_in_string_func(startatindex))
                    if bod is not None or startat == 1:
                        break
                y.set_lo(bod or 0)
            else:
                r = text.tag_prevrange("console", "insert")
                if r:
                    startatindex = r[1]
                else:
                    startatindex = "1.0"
                rawtext = text.get(startatindex, "insert")
                y.set_code(rawtext)
                y.set_lo(0)

self.context_use_ps1 is always False. There's no where in IDLE that it can be set to True. I'll open an another issue for this since it's really not pyparse.
The call to find_good_parse_start:

bod = y.find_good_parse_start(
                              self.context_use_ps1, self._build_char_in_string_func(startatindex))

sends 3 parameters. And in pyparse, the signature allows 3. However, the signature is:

    def find_good_parse_start(self, is_char_in_string=None,
                              _synchre=_synchre):

This means that the False self.use_context_ps1 is the first value instead of the function, so pyparse is always executing:

        if not is_char_in_string:
            # no clue -- make the caller pass everything
            return None

In the test, I had assumed this was for the default of None. Bad assumption. :-(

Here's the commit that changed the signature:
b175445

csabella · 2018-02-24T17:45:13Z

For msg312444 on StringTranslatePseudoMapping:

I ran a comparison of the current class vs defaultdict.

This was my test:

with open('/home/cheryl/cpython/Lib/idlelib/pyparse.py') as fd:
    code = fd.read()
copies = 20000
code *= copies
for i in range(10):
    start = time.time()
    trans_code = code.translate(_tran)
    end = time.time()
    print(f'copies: {copies}   translate time: {end - start}')

I ran the current _tran first and saved the results.

_tran = {}
_tran.update((ord(c), ord('(')) for c in "({[")
_tran.update((ord(c), ord(')')) for c in ")}]")
_tran.update((ord(c), ord(c)) for c in "\"'\\\n#")
_tran = StringTranslatePseudoMapping(_tran, default_value=ord('x'))

I know time.time isn't perfect, but thought it might be good enough. The results:
copies: 20000 translate time: 0.8162932395935059
copies: 20000 translate time: 0.610985517501831
copies: 20000 translate time: 0.8164870738983154
copies: 20000 translate time: 0.6125986576080322
copies: 20000 translate time: 0.8143167495727539
copies: 20000 translate time: 0.612929105758667
copies: 20000 translate time: 0.8299245834350586
copies: 20000 translate time: 0.6127865314483643
copies: 20000 translate time: 0.812185525894165
copies: 20000 translate time: 0.6151354312896729

Then I changed it to a defaultdict:

_tran = defaultdict(lambda: 'x')
 # _tran = {}
_tran.update((ord(c), ord('(')) for c in "({[")
_tran.update((ord(c), ord(')')) for c in ")}]")
_tran.update((ord(c), ord(c)) for c in "\"'\\\n#")
# _tran = StringTranslatePseudoMapping(_tran, default_value=ord('x'))

I compared the results to make sure the defaultdict produced the same output as the mapping, which it did.

The results:
copies: 20000 translate time: 0.8172969818115234
copies: 20000 translate time: 0.6214878559112549
copies: 20000 translate time: 0.8143007755279541
copies: 20000 translate time: 0.6127951145172119
copies: 20000 translate time: 0.8154017925262451
copies: 20000 translate time: 0.6144123077392578
copies: 20000 translate time: 0.8128812313079834
copies: 20000 translate time: 0.6167266368865967
copies: 20000 translate time: 0.8143749237060547
copies: 20000 translate time: 0.6116495132446289

So, the results are simliar, down to the alternating of .61ish and .81ish times.

csabella · 2018-02-24T18:56:45Z

I wish I could delete my last message (and duplicate posting of the one before. On the last one, I didn't move my timings. I feel like an idiot. Anyway, there is a difference.

start = time.time()
for i in range(1000000):
    _tran = defaultdict(lambda: 'x')
    _tran.update((ord(c), ord('(')) for c in "({[")
    _tran.update((ord(c), ord(')')) for c in ")}]")
    _tran.update((ord(c), ord(c)) for c in "\"'\\\n#")
end = time.time()
print(f'translate time: {end - start}')

translate time: 7.443669319152832

start = time.time()
for i in range(1000000):
    _tran = defaultdict(lambda: 'x')
    _tran.update({40: 40,    # ord('(')
                  91: 40,    # ord('[')
                  123: 40,   # ord('{')
                  41: 41,    # ord(')')
                  93: 41,    # ord(']')
                  125: 41,   # ord('}')
                  34: 34,    # ord('"')
                  39: 39,    # ord("'")
                  92: 92,    # ord("\\")
                  10: 10,    # ord("\n")
                  35: 35,    # ord("#")
                  })
end = time.time()
print(f'translate time: {end - start}')

translate time: 1.7251780033111572

It's still probably negligible since it's only done once and not a million times. :-)

csabella · 2018-02-24T19:06:56Z

For msg312474 regarding replace vs re.

Running the 20,000 copy version of the translated text (only translating once gives the following timings:
copies: 20000 translate time: 5.591020822525024
copies: 20000 translate time: 5.5614333152771
copies: 20000 translate time: 5.561311483383179
copies: 20000 translate time: 5.558183670043945
copies: 20000 translate time: 5.580726385116577
copies: 20000 translate time: 5.588990688323975
copies: 20000 translate time: 5.570690155029297
copies: 20000 translate time: 5.601408004760742
copies: 20000 translate time: 5.76714825630188
copies: 20000 translate time: 5.697475910186768

And the re version gives:
copies: 20000 translate time: 5.935032844543457
copies: 20000 translate time: 5.939348220825195
copies: 20000 translate time: 5.933218240737915
copies: 20000 translate time: 6.070481538772583
copies: 20000 translate time: 6.319685935974121
copies: 20000 translate time: 6.4209065437316895
copies: 20000 translate time: 6.476579666137695
copies: 20000 translate time: 6.520790100097656
copies: 20000 translate time: 6.541554927825928
copies: 20000 translate time: 6.620612859725952

So, it's a little slower on a big string. It also gives slightly different results because of the last replace:
code.replace('\nx', '\n') isn't in the re.

On a more practical size document, they are about the same:
replace:
copies: 20 translate time: 0.0058782100677490234
copies: 20 translate time: 0.006024599075317383
copies: 20 translate time: 0.0056345462799072266
copies: 20 translate time: 0.005848884582519531
copies: 20 translate time: 0.005696296691894531
copies: 20 translate time: 0.00574946403503418
copies: 20 translate time: 0.005642890930175781
copies: 20 translate time: 0.005755901336669922
copies: 20 translate time: 0.0058023929595947266
copies: 20 translate time: 0.005713939666748047

re:
copies: 20 translate time: 0.005833148956298828
copies: 20 translate time: 0.005682229995727539
copies: 20 translate time: 0.00565028190612793
copies: 20 translate time: 0.005823850631713867
copies: 20 translate time: 0.0057680606842041016
copies: 20 translate time: 0.0058100223541259766
copies: 20 translate time: 0.005717277526855469
copies: 20 translate time: 0.005885601043701172
copies: 20 translate time: 0.005852460861206055
copies: 20 translate time: 0.005867958068847656

It appears the time for the replace is linear and the re is just a little more than linear. Maybe it's just my computer.

csabella · 2018-02-25T18:30:17Z

Looking at the creation of the instances of pyparse.PyParse() and hyperparser.HyperParser(), I was a little surprised that they (the instances) are local variables to the methods and aren't instance variables. Since they are called fairly often, wouldn't it be more efficient to use an instance variable (for example self.hp in calltips instead of hp) and update the attributes when something like open_calltips() is executed? Maybe the overhead of creating a class is neglible compared to the storage of not destroying it every time?

terryjreedy · 2018-02-26T04:07:52Z

If you [view] a message, there is an unlink button at the bottom for the author (and believe) and tracker gardeners. I did this for the duplicate and incomplete messages.

In the future, lets put timing data for an idea on a separate issue for the idea. This puts everything about the idea in a one place. It is OK if we close the separate issue as rejected because the data show the idea not worthwhile.

Unfortunately, there is no way to just move a message to another issue. But it can be copied to another and unlinked from the first.

terryjreedy · 2018-02-26T04:16:29Z

As I mentioned before, Parser.set_code strongly suggests that Parser instances were intended to be reused. But you discovered the bug that prevented this, at least for one code result. With that fixed, it should be possible to have one Parser instance per editor.

Since hyperparser is used to match each ), ], and } we could look at that too.

terryjreedy · 2018-02-26T05:44:51Z

from idlelib.pyparse import Parser
import timeit
code='def f():\n'
print(timeit.timeit("statement",  # for example
                    "p=Parser(4,4)", globals = globals()))

statement microseconds
Parser .1 # test timeit code
Parser() 2.2
p.set_code('') 1.4
p.set_code(code) 1.8
'('.translate(map1) 1.5 or 2.2 depending on ParseMap

Translate time is longer for real code and Parser() instance creation time must be well less than 1/10, maybe 1/100 of any answer time. If we don't reuse instances, though, set_code should be part of __init__ (and tests changed). Either change is low priority.

terryjreedy added 3.7 (EOL) end of life 3.8 only security fixes labels Feb 20, 2018

terryjreedy self-assigned this Feb 20, 2018

terryjreedy added topic-IDLE type-bug An unexpected behavior, bug, or error labels Feb 20, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IDLE: Fix and update and cleanup pyparse #77061

IDLE: Fix and update and cleanup pyparse #77061

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 21, 2018

terryjreedy commented Feb 21, 2018

terryjreedy commented Feb 21, 2018

terryjreedy commented Feb 22, 2018

terryjreedy commented Feb 22, 2018

serhiy-storchaka commented Feb 22, 2018

terryjreedy commented Feb 22, 2018

csabella commented Feb 22, 2018

terryjreedy commented Feb 23, 2018

csabella commented Feb 24, 2018

csabella commented Feb 24, 2018

csabella commented Feb 24, 2018

csabella commented Feb 24, 2018

csabella commented Feb 25, 2018

terryjreedy commented Feb 26, 2018

terryjreedy commented Feb 26, 2018

terryjreedy commented Feb 26, 2018

IDLE: Fix and update and cleanup pyparse #77061

IDLE: Fix and update and cleanup pyparse #77061

Comments

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 20, 2018

terryjreedy commented Feb 21, 2018

terryjreedy commented Feb 21, 2018

terryjreedy commented Feb 21, 2018

terryjreedy commented Feb 22, 2018

terryjreedy commented Feb 22, 2018

serhiy-storchaka commented Feb 22, 2018

terryjreedy commented Feb 22, 2018

csabella commented Feb 22, 2018

terryjreedy commented Feb 23, 2018

csabella commented Feb 24, 2018

csabella commented Feb 24, 2018

csabella commented Feb 24, 2018

csabella commented Feb 24, 2018

csabella commented Feb 25, 2018

terryjreedy commented Feb 26, 2018

terryjreedy commented Feb 26, 2018

terryjreedy commented Feb 26, 2018