classification
Title: The new parser segfaults when parsing invalid input
Type: crash Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Zac Hatfield-Dodds, gvanrossum, lys.nikolaou, pablogsal, vstinner
Priority: deferred blocker Keywords: patch

Created on 2020-05-17 16:01 by pablogsal, last changed 2020-05-18 17:32 by pablogsal. This issue is now closed.

Files
File name Uploaded Description Edit
reproducer.py pablogsal, 2020-05-17 16:01
Pull Requests
URL Status Linked Edit
PR 20159 closed gvanrossum, 2020-05-17 18:14
PR 20165 merged lys.nikolaou, 2020-05-17 23:11
Messages (13)
msg369132 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-17 16:01
The new peg parser segfaults when parsing the attached reproducer.

this is due to the fact that the exception set by `tokenizer_error` is not properly propagated.
msg369133 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-17 16:04
I think we may need to test for the error indicator (and maybe PyErr_Ocurred for safety) before every alternative. Something like:

diff --git a/Tools/peg_generator/pegen/c_generator.py b/Tools/peg_generator/pegen/c_generator.py
index 8f9972bb41..61cb694628 100644
--- a/Tools/peg_generator/pegen/c_generator.py
+++ b/Tools/peg_generator/pegen/c_generator.py
@@ -468,10 +468,6 @@ class CParserGenerator(ParserGenerator, GrammarVisitor):
         memoize = self._should_memoize(node)

         with self.indent():
-            self.print("if (p->error_indicator) {")
-            with self.indent():
-                self.print("return NULL;")
-            self.print("}")
             self.print(f"{result_type} _res = NULL;")
             if memoize:
                 self.print(f"if (_PyPegen_is_memoized(p, {node.name}_type, &_res))")
@@ -685,6 +681,12 @@ class CParserGenerator(ParserGenerator, GrammarVisitor):
     def visit_Alt(
         self, node: Alt, is_loop: bool, is_gather: bool, rulename: Optional[str]
     ) -> None:
+        self.print("if (p->error_indicator == 1 || PyErr_Occurred()) {")
+        with self.indent():
+            self.print("p->error_indicator = 1;")
+            self.print("return NULL;")
+        self.print("}")
+
         self.print(f"{{ // {node}")
         with self.indent():
             # Prepare variable declarations for the alternative
msg369134 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-17 16:05
Indeed, that diff solves the problem
msg369136 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-17 16:19
How costly is PyErr_Occurred()? That worries me most, otherwise I’d accept this right away.
msg369142 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-17 18:00
A quick benchmark using xxl.py:

Base time (master):
Time: 9.386 seconds on an average of 20 runs

With the patch in this issue:
Time: 9.498 seconds on an average of 20 runs

Sadly I could not test with PGO/LTO and without CPU isolation, so it would be great if someone could double-check these numbers.

Also, I will be unable to do a PR until this night/tomorrow morning (London time) :(
msg369144 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-17 18:11
I see almost no time difference for 'make time_stdlib': before 3.471, after 3.451. But I see a serious difference for 'make time_compile': before 3.474, after 4.996. That's over 40% slower (on the extreme case, xxl.py).

I'll prepare a PR just in case.
msg369164 - (view) Author: Zac Hatfield-Dodds (Zac Hatfield-Dodds) * Date: 2020-05-18 02:25
I understand from Paul Ganssle that this bug was found using Hypothesmith in my stdlib property tests (reported at https://github.com/Zac-HD/stdlib-property-tests/issues/14).

As discussed in https://github.com/we-like-parsers/cpython/issues/91 and https://pyfound.blogspot.com/2020/05/property-based-testing-for-python.html I'm keen to help out how I can, so if there's anything more specific than "write tools, write test, and wait" please let me know!

Best,
Zac
msg369176 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-18 04:45
Zac: The reproducer here apparently uses a long string of weird accented characters. I'm not sure how to generalize from that to other things that Hyothes* could find. But maybe this helps: https://github.com/python/cpython/pull/20106#issuecomment-629742075
msg369181 - (view) Author: Zac Hatfield-Dodds (Zac Hatfield-Dodds) * Date: 2020-05-18 06:04
I know what else it might find either, but I still think it's worth running property-based tests in CI to find out!  The demo I wrote for my language summit talk doesn't have any parser tests, but still would have caught this bug in the pull request that introduced it.


The specific reproducer here is odd, because it's reported as an internal error in Hypothesmith - I use the `compile()` builtin to check that strings valid against an approximate grammar are actually valid.

It's structurally less complex than typical outputs because it's only a fragment of the tree being generated, but because shrinking doesn't run for generation-time errors it's also much harder to interpret than usual.
msg369235 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-18 14:42
Unfortunately, I do not understand enough about how Hypothes* works to be helpful here, other than by offering encouragement. (And please don't try to educate me. I have too many other things. Sorry.)
msg369249 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-18 15:34
I don't think that such bug should block Python 3.9 beta1 (I'm talking about the "release blocker" priority). There is still time before 3.9 final to fix it.
msg369263 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-18 16:17
Okay, deferring the blocker. But I'll still land Lysandros' fix ASAP.
msg369280 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-05-18 17:32
New changeset 7b7a21bc4fd063b26a2d1882fddc458861497812 by Lysandros Nikolaou in branch 'master':
bpo-40661: Fix segfault when parsing invalid input (GH-20165)
https://github.com/python/cpython/commit/7b7a21bc4fd063b26a2d1882fddc458861497812
History
Date User Action Args
2020-05-18 17:32:26pablogsalsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-05-18 17:32:09pablogsalsetmessages: + msg369280
2020-05-18 16:17:56gvanrossumsetpriority: release blocker -> deferred blocker

messages: + msg369263
2020-05-18 15:34:19vstinnersetnosy: + vstinner
messages: + msg369249
2020-05-18 14:42:58gvanrossumsetmessages: + msg369235
2020-05-18 06:04:35Zac Hatfield-Doddssetmessages: + msg369181
2020-05-18 04:45:36gvanrossumsetmessages: + msg369176
2020-05-18 02:25:25Zac Hatfield-Doddssetnosy: + Zac Hatfield-Dodds
messages: + msg369164
2020-05-17 23:11:07lys.nikolaousetpull_requests: + pull_request19464
2020-05-17 18:14:46gvanrossumsetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request19462
2020-05-17 18:11:42gvanrossumsetmessages: + msg369144
2020-05-17 18:00:48pablogsalsetmessages: + msg369142
2020-05-17 16:19:13gvanrossumsetmessages: + msg369136
2020-05-17 16:05:09pablogsalsetmessages: + msg369134
2020-05-17 16:04:49pablogsalsetmessages: + msg369133
2020-05-17 16:01:10pablogsalcreate