Message 312525 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	cheryl.sabella, terry.reedy
Date	2018-02-22.01:31:06
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1519263068.47.0.467229070634.issue32880@psf.upfronthosting.co.za>
In-reply-to

Content
As noted in the test for find_good_parse_start and PR5755 discussion, a single line header on a line by itself in a multiline comment before a multiline header can prevent recognition of the latter. >>> P.set_str("'somethn=ig'\ndef g(a,\nb\n") >>> P.find_good_parse_start(lambda i: False) 13 >>> P.set_str("'''\ndef f():\n pass'''\ndef g(a,\nb\n") >>> P.find_good_parse_start(lambda i:i < 15) >>> One alternative to the current algorithm would be to search the beginning of every line for a compound statement keyword, not just lines ending with ':'. I believe the concern is that this would require uselessly checking more lines within strings. I believe that the same concern is why 'if' and 'for' are missing from the keyword list. When the window is an editor rather than shell, find_good_parse_start is called in EditorWindow.newline_and_indent_event and Hyperparser.__init__. The call-specific in-string function is returned by EW._build_char_in_string_func. It calls EW.is_char_in_string, which returns False only if the char in the text widget has been examined by the colorizer and not tagged with STRING. The call to find_good_parse_start is always followed by a call to set_lo and and then _study1 (via a call to another function). _study1 replaces runs of non-essential chars with 'x', which means that string literals within the code string are mostly reduced to a single x per line. (It would be fine if they were emptied except for newlines.) This suggests starting find_good_parse_start with a partial reduction, of just string literals, saved for further reduction by _study, so that keywords would never occur within the reduced literal. The problem is that one cannot tell for sure whether ''' or """ is the beginning or end of a multiline literal without parsing from the beginning of the code (which colorizer does). An alternate way to reuse the colorizer work might be to use splitlines on the code and then get all string tag ranges. The code-context option picks out compound-statement header lines. When enabled, I believe that its last line may be the desired good parse start line. Any proposed speedup should be tested by parsing multiple chunks of multiple stdlib modules.

As noted in the test for find_good_parse_start and PR5755 discussion, a single line header on a line by itself in a multiline comment before a multiline header can prevent recognition of the latter.
 
>>> P.set_str("'somethn=ig'\ndef g(a,\nb\n")	    
>>> P.find_good_parse_start(lambda i: False) 
13
>>> P.set_str("'''\ndef f():\n pass'''\ndef g(a,\nb\n")	    
>>> P.find_good_parse_start(lambda i:i < 15)
>>>

One alternative to the current algorithm would be to search the beginning of every line for a compound statement keyword, not just lines ending with ':'.  I believe the concern is that this would require uselessly checking more lines within strings.  I believe that the same concern is why 'if' and 'for' are missing from the keyword list.

When the window is an editor rather than shell, find_good_parse_start is called in EditorWindow.newline_and_indent_event and Hyperparser.__init__.  The call-specific in-string function is returned by EW._build_char_in_string_func. It calls EW.is_char_in_string, which returns False only if the char in the text widget has been examined by the colorizer and not tagged with STRING.

The call to find_good_parse_start is always followed by a call to set_lo and and then _study1 (via a call to another function).  _study1 replaces runs of non-essential chars with 'x', which means that string literals within the code string are mostly reduced to a single x per line.  (It would be fine if they were emptied except for newlines.)  This suggests starting find_good_parse_start with a partial reduction, of just string literals, saved for further reduction by _study, so that keywords would never occur within the reduced literal.

The problem is that one cannot tell for sure whether ''' or """ is the beginning or end of a multiline literal without parsing from the beginning of the code (which colorizer does).  An alternate way to reuse the colorizer work might be to use splitlines on the code and then get all string tag ranges.

The code-context option picks out compound-statement header lines.  When enabled, I believe that its last line may be the desired good parse start line.

Any proposed speedup should be tested by parsing multiple chunks of multiple stdlib modules.

History
Date	User	Action	Args
2018-02-22 01:31:08	terry.reedy	set	recipients: + terry.reedy, cheryl.sabella
2018-02-22 01:31:08	terry.reedy	set	messageid: <1519263068.47.0.467229070634.issue32880@psf.upfronthosting.co.za>
2018-02-22 01:31:08	terry.reedy	link	issue32880 messages
2018-02-22 01:31:06	terry.reedy	create