Issue712900
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2003-03-31 19:54 by glchapman, last changed 2022-04-10 16:07 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
srepatch.txt | glchapman, 2003-03-31 19:55 |
Messages (12) | |||
---|---|---|---|
msg43246 - (view) | Author: Greg Chapman (glchapman) | Date: 2003-03-31 19:54 | |
The attached patch fixes two bugs in _sre.c; it also does a bit of reorganization. First the bugs. 672491 points out that lastindex is calculated differently in 2.3 than in previous versions. This patch restores the previous behavior. Since lastindex cannot be restored (when backtracking) from lastmark alone, it is now saved and restored independently (by the LASTMARK_SAVE and RESTORE macros). The second bug appears when minimizing repeats are combined with assertions: >>> re.match('([ab]*?)(?=(b)?)c', 'abc').groups() ('ab', 'b') The second group should be None, since the 'b' is consumed by the first group. To fix this, it is necessary to save lastmark before attempting to match the tail in OP_MIN_UNTIL and to restore it if the tail fails to match. The reorganization has to do with the handling of the SRE_STATE's lastmark and mark array. The mark array tracks the start and end of capturing groups; lastmark is the highest index in the array so far encountered. Previously, whenever lastmark was restored back to a lower value (in 2.3a2 this is done in the lastmark_restore function), the tail of the mark array was NULLed out (using memset). This patch adopts the rule that all indexes greater than lastmark are invalid, so restoring lastmark does not also require clearing the tail. To ensure that indexes <= lastmark have valid pointers, OP_MARK checks if lastmark is being increased by more than one; if so, it NULLs out the intervening pointers. This rule also required changes to the GROUPREF opcodes and the state_getslice function to ensure that they do not access indexes greater than lastmark. For consistency, lastmark is now initialized to –1, to indicate that no entries in the mark array are valid. Needless to say, the reorganization is not necessary to fix the bugs; it may be a bad idea. It seems to be marginally faster than a version that fixes the bugs but is similar to the current code (including a memset inside the LASTMARK_RESTORE macro). One other thing. I have removed a test for string == Py_None from state_getslice, since I can’t find any way for string to be Py_None at that point (string is always the object providing the text to be searched; if it were Py_None, an exception should be raised by the getstring function called by state_init). Perhaps I missed something? |
|||
msg43247 - (view) | Author: Gustavo Niemeyer (niemeyer) * ![]() |
Date: 2003-04-19 23:53 | |
Logged In: YES user_id=7887 Greg, this patch doesn't seem to compile in the latest CVS. Are you able to update it? |
|||
msg43248 - (view) | Author: Gustavo Niemeyer (niemeyer) * ![]() |
Date: 2003-04-20 01:51 | |
Logged In: YES user_id=7887 Greg, thanks for mentioning that. I have fixed the first bug by myself, by backing out some of the changes I made in _sre.c:2.84. Can you please check it out and verify if it has the behavior you'd expect now? |
|||
msg43249 - (view) | Author: Gustavo Niemeyer (niemeyer) * ![]() |
Date: 2003-04-20 02:35 | |
Logged In: YES user_id=7887 Backing out the changes was not enough. I'll evaluate your solution. |
|||
msg43250 - (view) | Author: Andrew I MacIntyre (aimacintyre) * ![]() |
Date: 2003-04-20 05:08 | |
Logged In: YES user_id=250749 Note that Guido checked in patch #720991, which incorporates some of Greg's work (at least that's what patch author Gary Herron notes) on April 14. That checkin makes _sre.c (more) sensitive to optimisation settings for me (see my comment against that patch). |
|||
msg43251 - (view) | Author: Gustavo Niemeyer (niemeyer) * ![]() |
Date: 2003-04-20 07:47 | |
Logged In: YES user_id=7887 Greg, thank you very much for the patch. I've included the fixes without introducing the reorganization mentioned, for the sake of stability. Also, the second fix mentioned in your report don't fix the mentioned problem anymore, because of the change introduced by patch #720991. The new fix wasn't complicated though, and is included as well. About the reorganization, I don't have a strong opinion about it, and I'm also not sure about the real impact the new "mark reset" code will cause in the module. IMO we need to test it further. Andrew, that explains why the second fix greg mentions doesn't work anymore. Thanks for mentioning it. The changes were applied as: Modules/_sre.c: 2.91 Lib/test/re_tests.py: 1.35 Lib/test/test_sre.py: 1.41 |
|||
msg43252 - (view) | Author: Greg Chapman (glchapman) | Date: 2003-04-20 12:33 | |
Logged In: YES user_id=86307 Gustavo, thanks for reworking and applying the patch. And thanks for catching the bug in the MIN_REPEAT_ONE where I was not calling lastmark_restore in the right place; it does need to be inside the loop (though I think it could be at the bottom -- it's not necessary if SRE_COUNT returns 0). I note in your checkin message you are concerned about whether all calls to SRE_MATCH need to be protected by LASTMATCH_SAVE/RESTORE. These should only be necessary where SRE_MATCH returns 0 but nevertheless matching continues (i.e., a backtrack point or an alternative). So they are clearly not necessary for ASSERT and REPEAT. ASSERT_NOT is an interesting case. Right now, a capturing group (inside the negative assertion) which matches before the subpattern fails is reported as having matched; this is the way Perl works. You could argue that it would be more consistent for these groups always to report as unmatched (Jeffrey Friedl implies as much in the first edition of his book), in which case the ASSERT_NOT should be surrounded by LASTMARK_SAVE/RESTORE. Anyway, it should not be necessary to protect the first part of the MAX_UNTIL and MIN_UNTIL opcodes (the part which assures the repeat has matched the minimum number of times); if this part fails, it causes this invocation of SRE_MATCH to return 0. I note that you have protected this part of MAX_UNTIL but not MIN_UNTIL. We should probably be consistent here and remove the protection from MAX_UNTIL. |
|||
msg43253 - (view) | Author: Gustavo Niemeyer (niemeyer) * ![]() |
Date: 2003-04-22 15:23 | |
Logged In: YES user_id=7887 Greg, what I meant is that I *think* that every SRE_MATCH() recursion with *failing* possibilities should be protected by SAVE/RESTORE(), as it can potentially change lastmark/lastindex before returning, depending on what is being executed. Do you belive the first part of MAX_UNTIL shouldn't be protected, even though it could match a subpattern including marks and fail? If so, can you explain? OTOH, I agree with you that MAX_UNTIL and MIN_UNTIL are not coherent. I'll fix that in the safe side for now, including mark protection in both. If we end up discovering it was not necessary, we simply drop it. |
|||
msg43254 - (view) | Author: Greg Chapman (glchapman) | Date: 2003-04-22 19:21 | |
Logged In: YES user_id=86307 Gustavo, in my understanding, it doesn't matter if the first part of MAX_UNTIL is protected because if its call to SRE_MATCH returns 0, it (the first part of MAX_UNTIL) simply returns 0 to its caller. If that caller is a backtrack point (a point which may resume matching even though a call to SRE_MATCH returned 0), it should have already saved lastmark, etc. and so should be ready to restore them to its saved values. If that caller is not a backtrack point, it is either the original invocation of SRE_MATCH (in which case it doesn't matter what is in the state since the match has failed) or it is someplace which is going to return 0 to its caller. FYI, I have uploaded a couple of more sre bug reports/patches in the last couple of days; I'd appreciate it if you look at them. http://www.python.org/sf/725106 http://www.python.org/sf/725149 In particular, the second patch changes ASSERT_NOT so that (I hope) it no longer violates the assumptions above. |
|||
msg43255 - (view) | Author: Gustavo Niemeyer (niemeyer) * ![]() |
Date: 2003-04-22 20:05 | |
Logged In: YES user_id=7887 Hummm.. interesting. Indeed, we could use two different strategies. We could either restore marks whenever failling, so that backtrackings don't have to worry about that (it knows it was restored if the recursion failed), or use your approach and protect only when *not* failling. While I was thinking about the first one, the later (your suggestion) seems more logic, and fast. Either way, we clearly have a mix of both today (for example, we have no mark protection in OP_REPEAT), and that's not very smart. I'll study that code further, to make sure that we're really on the right track, and also check your bugs. Thank you very much for explaining and discussing. |
|||
msg43256 - (view) | Author: Gustavo Niemeyer (niemeyer) * ![]() |
Date: 2003-04-22 20:11 | |
Logged In: YES user_id=7887 Nevermind about OP_REPEAT. It won't backtrack so it doesn't matter. |
|||
msg43257 - (view) | Author: Neal Norwitz (nnorwitz) * ![]() |
Date: 2004-10-21 03:04 | |
Logged In: YES user_id=33168 It seems as both problems addressed in this patch have been fixed. Closing. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:07:58 | admin | set | github: 38246 |
2003-03-31 19:54:33 | glchapman | create |