Issue 32880: IDLE: Fix and update and cleanup pyparse

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77061

classification

Title:	IDLE: Fix and update and cleanup pyparse
Type:	behavior	Stage:	test needed
Components:	IDLE	Versions:	Python 3.8, Python 3.7, Python 3.6

process

Status:	open	Resolution:
Dependencies:	32905 32916 32940 32989	Superseder:
Assigned To:	terry.reedy	Nosy List:	cheryl.sabella, serhiy.storchaka, terry.reedy
Priority:	normal	Keywords:

Created on 2018-02-20 02:19 by terry.reedy, last changed 2022-04-11 14:58 by admin.

Messages (21)
msg312394 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-20 02:19
Pyparse was mostly written in the early 2000s, with the only one code change since 2007 in 2014. #32874 will add tests for pyparse 'as is' (though with some docstring and comment changes). This issue will change pyparse code, and change or add test as needed. Here are two items to fix. More will be added. Some fixes might be separate PRs or spun off into separate issues. def dump: dump this; print does the same, without hardcoding sys.__stdout__, which might be None. _synchre: Only used in find_good_parse_start. Missing 'if' (mentioned in function docstring as 'popular' and 'for' and new things like 'with'. 'async' and 'await' are not generally popular, but is not any statement beginning a good parse start?
msg312395 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-20 03:29
Let's consider the todo questions at the end of the class Parser code. --- # XXX - is this used? lastopenbracketpos = None def get_last_open_bracket_pos(self): "Return index of last open bracket or None." self._study2() return self.lastopenbracketpos # XXX - is this used? stmt_bracketing = None def get_last_stmt_bracketing(self): """Return a tuple of the structure of the bracketing of the last interesting statement. Tuple is in the format defined in _study2(). """ self._study2() return self.stmt_bracketing --- get_last_open_bracket_pos is not called anywhere. TODO remove it. get_last_stmt_bracketing is called once in hyperparser and could be replaced there by 'parser._study2(); parser.stmt_bracketing'. TODO? Integer lastopenbracketpos is only conditionally set as an instance attribute, which masks the class attribute, in _study2 with if stack: # of opener indices self.lastopenbracketpos = stack[-1] However, self.lastopenbracketpos in only accessed (as 'j') in compute_bracket_indent, which is only called when a statement is continued to the next line because of an open bracket. Indeed, the call is guarded by assert self.continuation == C_BRACKET # TODO: move up So the class binding is never used because it is always masked bebore being accessed. Its value None would raise if it were. (That may have been intentional.) TODO remove it. Stopping here would break the current tests. We could make _study2 return the index and change test according. Or we can make _study2 always set the attribute with (TODO this) self.lastopenbracketpos = stack[-1] if stack else None and remove the workaround test lines: p.lastopenbracketpos = None Since _study2 unconditionally sets stmt_bracketing as an instance attribute with self.stmt_bracketing = tuple(bracketing) and self.stmt_bracketing is only accessed, in the get function above (or possibly directly in the future), after such a call, its class setting is also useless. TODO remove it. No test change needed.
msg312398 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-20 04:16
set_str sets self.str and self.study_level. After the first call, attempts to access unset instance parse attributes from other modules will raise AttributeError. Removing the unneeded class setting will just add 2 more names to the list of things that should not be called directly. The get_xyz functions (get_continuation_type is another) avoid AttributeError (unless self.str is not set) and ensure validity. If set_str is called more than once, instance parse attributes will initially be set but invalid. So it seems they should be either deleted or set to None. In the latter case, they should also be set to None initially. I will look more at how the module is used.
msg312444 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-20 23:33
StringTranslatePseudoMapping (Mapping) is an awkward and not very descriptive name. It is similar to collections.defaultdict except that it uses a default value rather than a default factory. The latter is so defaults can be mutables. Would a subclass of similar to defaultdict be faster? Unlike, defaultdict, the initialized default is used in get as well as __getitem__. This means that STPM.get ignores a passed-in default, contrary to the definition of Mapping.get. The method override is not needed. The internal self._get is only needed because it is called in both __getitem__ and get. Without the get override, it's body (with 'self.' added, can become the body of __getitem__. I suspect that this would make __getitem__ faster. Test this.
msg312464 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-21 05:52
In def find_good_parse_start: str, pos = self.str, None Using str as an attribute is ok, but using the bare built-in class name for an instance thereof is annoying. 's' would be better, or even 'sample' (to be analyzed).
msg312474 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-21 06:51
Is str = str.replace('xxxxxxxx', 'x') str = str.replace('xxxx', 'x') str = str.replace('xx', 'x') str = str.replace('xx', 'x') really faster than _squashex = re.compile('x+').sub # top of file _squashex('x', str) And are the name bindings faster that regex.func(...)?
msg312477 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-21 08:23
In _study2, let q instead of p be end of last good line and then set p back to beginning. Setting p to end, q to p, and p to start is confusing.
msg312525 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-22 01:31
As noted in the test for find_good_parse_start and PR5755 discussion, a single line header on a line by itself in a multiline comment before a multiline header can prevent recognition of the latter. >>> P.set_str("'somethn=ig'\ndef g(a,\nb\n") >>> P.find_good_parse_start(lambda i: False) 13 >>> P.set_str("'''\ndef f():\n pass'''\ndef g(a,\nb\n") >>> P.find_good_parse_start(lambda i:i < 15) >>> One alternative to the current algorithm would be to search the beginning of every line for a compound statement keyword, not just lines ending with ':'. I believe the concern is that this would require uselessly checking more lines within strings. I believe that the same concern is why 'if' and 'for' are missing from the keyword list. When the window is an editor rather than shell, find_good_parse_start is called in EditorWindow.newline_and_indent_event and Hyperparser.__init__. The call-specific in-string function is returned by EW._build_char_in_string_func. It calls EW.is_char_in_string, which returns False only if the char in the text widget has been examined by the colorizer and not tagged with STRING. The call to find_good_parse_start is always followed by a call to set_lo and and then _study1 (via a call to another function). _study1 replaces runs of non-essential chars with 'x', which means that string literals within the code string are mostly reduced to a single x per line. (It would be fine if they were emptied except for newlines.) This suggests starting find_good_parse_start with a partial reduction, of just string literals, saved for further reduction by _study, so that keywords would never occur within the reduced literal. The problem is that one cannot tell for sure whether ''' or """ is the beginning or end of a multiline literal without parsing from the beginning of the code (which colorizer does). An alternate way to reuse the colorizer work might be to use splitlines on the code and then get all string tag ranges. The code-context option picks out compound-statement header lines. When enabled, I believe that its last line may be the desired good parse start line. Any proposed speedup should be tested by parsing multiple chunks of multiple stdlib modules.
msg312526 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-22 01:42
One sub-issue is to expand the current short doc section (2.1. Automatic indentation) the intended smart indenting. There is nothing about open parameter lists, not about multiline tuple/list/dict displays. It should say (IN CAPS?) that an indent that seems wrong likely indicates a syntax error that has mislead the indenter.
msg312544 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-02-22 08:56
Could this code be used in IDLE? For example highlighting the closed/open brace when cursor stay on open/closed brace or shortcuts for "go to open/closed brace"? Many editors have such functionality.
msg312549 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-22 10:29
The ()[]{} match manager is in parenmatch.py. It imports Hyperparser, which uses pyparse.Parser. When one types a closer, the corresponding opener is also highlighted and, depending on the configured style, everything in between. Very handy for showing when enough closers have been typed. Edit => Show surrounding parens, ^0 on Windows, which is essentially ^) on US keyboards, highlights out to the nearest pair. One can select one end of a pair by putting the cursor just inside a fence char. (A tk Text cursor is between characters, not on.) I don't think I would like automatic flashes when merely moving the cursor, or maybe I don't understand what you propose. On the hand, moving to the nearest opener to the left or closer to the right could be useful. ^] is already used for indent region, but ^{ and ^} (control-shift-bracket) are available and mnemonic. What shortcuts do you know of?
msg312568 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-22 14:30
Terry, In order to not duplicate effort, are you going to be working on these or is it ok if I look at them? Should all the changes happen in one PR or do you want to break them down? Thanks!
msg312607 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-23 00:59
Please go ahead, but coordinate. This issue contains miscellaneous pyparse issues that I thought of while reviewing the tests, plus Serhiy's idea. There should be multiple PRs on dependency issues (which might have more than one closely related PR). #32905 and now #32916 are examples. If you open a spinoff, I will assume that you are working on a PR.
msg312726 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-24 15:38
New issue: find_good_parse_start() call in editor.py has the wrong signature. There's actually a few things going on here. Here's the section in editor: ``` if not self.context_use_ps1: for context in self.num_context_lines: startat = max(lno - context, 1) startatindex = repr(startat) + ".0" rawtext = text.get(startatindex, "insert") y.set_code(rawtext) bod = y.find_good_parse_start( self.context_use_ps1, self._build_char_in_string_func(startatindex)) if bod is not None or startat == 1: break y.set_lo(bod or 0) else: r = text.tag_prevrange("console", "insert") if r: startatindex = r[1] else: startatindex = "1.0" rawtext = text.get(startatindex, "insert") y.set_code(rawtext) y.set_lo(0) ``` 1. self.context_use_ps1 is always False. There's no where in IDLE that it can be set to True. I'll open an another issue for this since it's really not pyparse. 2. The call to find_good_parse_start: ``` bod = y.find_good_parse_start( self.context_use_ps1, self._build_char_in_string_func(startatindex)) ``` sends 3 parameters. And in pyparse, the signature allows 3. However, the signature is: ``` def find_good_parse_start(self, is_char_in_string=None, _synchre=_synchre): ``` This means that the False self.use_context_ps1 is the first value instead of the function, so pyparse is always executing: ``` if not is_char_in_string: # no clue -- make the caller pass everything return None ``` In the test, I had assumed this was for the default of None. Bad assumption. :-( Here's the commit that changed the signature: https://github.com/python/cpython/commit/b17544551fc8dfd1304d5679c6e444cad4d34d97
msg312737 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-24 17:45
For msg312444 on StringTranslatePseudoMapping: I ran a comparison of the current class vs defaultdict. This was my test: ``` with open('/home/cheryl/cpython/Lib/idlelib/pyparse.py') as fd: code = fd.read() copies = 20000 code *= copies for i in range(10): start = time.time() trans_code = code.translate(_tran) end = time.time() print(f'copies: {copies} translate time: {end - start}') ``` I ran the current `_tran` first and saved the results. ``` _tran = {} _tran.update((ord(c), ord('(')) for c in "({[") _tran.update((ord(c), ord(')')) for c in ")}]") _tran.update((ord(c), ord(c)) for c in "\"'\\\n#") _tran = StringTranslatePseudoMapping(_tran, default_value=ord('x')) ``` I know time.time isn't perfect, but thought it might be good enough. The results: copies: 20000 translate time: 0.8162932395935059 copies: 20000 translate time: 0.610985517501831 copies: 20000 translate time: 0.8164870738983154 copies: 20000 translate time: 0.6125986576080322 copies: 20000 translate time: 0.8143167495727539 copies: 20000 translate time: 0.612929105758667 copies: 20000 translate time: 0.8299245834350586 copies: 20000 translate time: 0.6127865314483643 copies: 20000 translate time: 0.812185525894165 copies: 20000 translate time: 0.6151354312896729 Then I changed it to a defaultdict: ``` _tran = defaultdict(lambda: 'x') # _tran = {} _tran.update((ord(c), ord('(')) for c in "({[") _tran.update((ord(c), ord(')')) for c in ")}]") _tran.update((ord(c), ord(c)) for c in "\"'\\\n#") # _tran = StringTranslatePseudoMapping(_tran, default_value=ord('x')) ``` I compared the results to make sure the defaultdict produced the same output as the mapping, which it did. The results: copies: 20000 translate time: 0.8172969818115234 copies: 20000 translate time: 0.6214878559112549 copies: 20000 translate time: 0.8143007755279541 copies: 20000 translate time: 0.6127951145172119 copies: 20000 translate time: 0.8154017925262451 copies: 20000 translate time: 0.6144123077392578 copies: 20000 translate time: 0.8128812313079834 copies: 20000 translate time: 0.6167266368865967 copies: 20000 translate time: 0.8143749237060547 copies: 20000 translate time: 0.6116495132446289 So, the results are simliar, down to the alternating of .61ish and .81ish times.
msg312741 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-24 18:56
I wish I could delete my last message (and duplicate posting of the one before. On the last one, I didn't move my timings. I feel like an idiot. Anyway, there is a difference. ``` start = time.time() for i in range(1000000): _tran = defaultdict(lambda: 'x') _tran.update((ord(c), ord('(')) for c in "({[") _tran.update((ord(c), ord(')')) for c in ")}]") _tran.update((ord(c), ord(c)) for c in "\"'\\\n#") end = time.time() print(f'translate time: {end - start}') ``` translate time: 7.443669319152832 ``` start = time.time() for i in range(1000000): _tran = defaultdict(lambda: 'x') _tran.update({40: 40, # ord('(') 91: 40, # ord('[') 123: 40, # ord('{') 41: 41, # ord(')') 93: 41, # ord(']') 125: 41, # ord('}') 34: 34, # ord('"') 39: 39, # ord("'") 92: 92, # ord("\\") 10: 10, # ord("\n") 35: 35, # ord("#") }) end = time.time() print(f'translate time: {end - start}') ``` translate time: 1.7251780033111572 It's still probably negligible since it's only done once and not a million times. :-)
msg312744 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-24 19:06
For msg312474 regarding replace vs re. Running the 20,000 copy version of the translated text (only translating once gives the following timings: copies: 20000 translate time: 5.591020822525024 copies: 20000 translate time: 5.5614333152771 copies: 20000 translate time: 5.561311483383179 copies: 20000 translate time: 5.558183670043945 copies: 20000 translate time: 5.580726385116577 copies: 20000 translate time: 5.588990688323975 copies: 20000 translate time: 5.570690155029297 copies: 20000 translate time: 5.601408004760742 copies: 20000 translate time: 5.76714825630188 copies: 20000 translate time: 5.697475910186768 And the re version gives: copies: 20000 translate time: 5.935032844543457 copies: 20000 translate time: 5.939348220825195 copies: 20000 translate time: 5.933218240737915 copies: 20000 translate time: 6.070481538772583 copies: 20000 translate time: 6.319685935974121 copies: 20000 translate time: 6.4209065437316895 copies: 20000 translate time: 6.476579666137695 copies: 20000 translate time: 6.520790100097656 copies: 20000 translate time: 6.541554927825928 copies: 20000 translate time: 6.620612859725952 So, it's a little slower on a big string. It also gives slightly different results because of the last replace: code.replace('\nx', '\n') isn't in the re. On a more practical size document, they are about the same: replace: copies: 20 translate time: 0.0058782100677490234 copies: 20 translate time: 0.006024599075317383 copies: 20 translate time: 0.0056345462799072266 copies: 20 translate time: 0.005848884582519531 copies: 20 translate time: 0.005696296691894531 copies: 20 translate time: 0.00574946403503418 copies: 20 translate time: 0.005642890930175781 copies: 20 translate time: 0.005755901336669922 copies: 20 translate time: 0.0058023929595947266 copies: 20 translate time: 0.005713939666748047 re: copies: 20 translate time: 0.005833148956298828 copies: 20 translate time: 0.005682229995727539 copies: 20 translate time: 0.00565028190612793 copies: 20 translate time: 0.005823850631713867 copies: 20 translate time: 0.0057680606842041016 copies: 20 translate time: 0.0058100223541259766 copies: 20 translate time: 0.005717277526855469 copies: 20 translate time: 0.005885601043701172 copies: 20 translate time: 0.005852460861206055 copies: 20 translate time: 0.005867958068847656 It appears the time for the replace is linear and the re is just a little more than linear. Maybe it's just my computer.
msg312830 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-02-25 18:30
Looking at the creation of the instances of pyparse.PyParse() and hyperparser.HyperParser(), I was a little surprised that they (the instances) are local variables to the methods and aren't instance variables. Since they are called fairly often, wouldn't it be more efficient to use an instance variable (for example self.hp in calltips instead of hp) and update the attributes when something like open_calltips() is executed? Maybe the overhead of creating a class is neglible compared to the storage of not destroying it every time?
msg312875 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-26 04:07
If you [view] a message, there is an unlink button at the bottom for the author (and believe) and tracker gardeners. I did this for the duplicate and incomplete messages. In the future, lets put timing data for an idea on a separate issue for the idea. This puts everything about the idea in a one place. It is OK if we close the separate issue as rejected because the data show the idea not worthwhile. Unfortunately, there is no way to just move a message to another issue. But it can be copied to another and unlinked from the first.
msg312876 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-26 04:16
As I mentioned before, Parser.set_code strongly suggests that Parser instances were intended to be reused. But you discovered the bug that prevented this, at least for one code result. With that fixed, it should be possible to have one Parser instance per editor. Since hyperparser is used to match each ), ], and } we could look at that too.
msg312877 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-26 05:44
from idlelib.pyparse import Parser import timeit code='def f():\n' print(timeit.timeit("statement", # for example "p=Parser(4,4)", globals = globals())) statement microseconds Parser .1 # test timeit code Parser() 2.2 p.set_code('') 1.4 p.set_code(code) 1.8 '('.translate(map1) 1.5 or 2.2 depending on ParseMap Translate time is longer for real code and Parser() instance creation time must be well less than 1/10, maybe 1/100 of any answer time. If we don't reuse instances, though, set_code should be part of __init__ (and tests changed). Either change is low priority.

History
Date	User	Action	Args
2022-04-11 14:58:58	admin	set	github: 77061
2018-03-03 00:44:56	cheryl.sabella	set	dependencies: + IDLE: Fix pyparse.find_good_parse_start
2018-02-26 05:44:51	terry.reedy	set	messages: + msg312877
2018-02-26 04:16:29	terry.reedy	set	messages: + msg312876
2018-02-26 04:07:52	terry.reedy	set	messages: + msg312875
2018-02-25 18:30:16	cheryl.sabella	set	messages: + msg312830
2018-02-24 21:22:09	terry.reedy	set	messages: - msg312739
2018-02-24 21:21:33	terry.reedy	set	messages: - msg312733
2018-02-24 19:06:56	cheryl.sabella	set	messages: + msg312744
2018-02-24 18:56:45	cheryl.sabella	set	messages: + msg312741
2018-02-24 18:23:23	cheryl.sabella	set	messages: + msg312739
2018-02-24 17:45:13	cheryl.sabella	set	dependencies: + IDLE: pyparse - simplify StringTranslatePseudoMapping messages: + msg312737
2018-02-24 17:30:55	cheryl.sabella	set	messages: + msg312733
2018-02-24 15:38:34	cheryl.sabella	set	messages: + msg312726
2018-02-23 00:59:58	terry.reedy	set	dependencies: + IDLE: change 'str' to 'code' in idlelib.pyparse.PyParse and users messages: + msg312607
2018-02-22 14:30:59	cheryl.sabella	set	messages: + msg312568
2018-02-22 10:29:10	terry.reedy	set	messages: + msg312549
2018-02-22 08:56:56	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg312544
2018-02-22 05:06:47	terry.reedy	set	dependencies: + IDLE: remove unused code in pyparse module
2018-02-22 01:42:03	terry.reedy	set	messages: + msg312526
2018-02-22 01:31:08	terry.reedy	set	messages: + msg312525
2018-02-21 08:23:17	terry.reedy	set	messages: + msg312477
2018-02-21 06:51:11	terry.reedy	set	messages: + msg312474
2018-02-21 05:52:02	terry.reedy	set	messages: + msg312464
2018-02-20 23:33:06	terry.reedy	set	messages: + msg312444
2018-02-20 04:16:22	terry.reedy	set	messages: + msg312398
2018-02-20 03:29:31	terry.reedy	set	messages: + msg312395
2018-02-20 02:19:14	terry.reedy	create