Issue 9319: imp.find_module('test/badsyntax_pep3120') causes segfault

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53565

classification

Title:	imp.find_module('test/badsyntax_pep3120') causes segfault
Type:	crash	Stage:
Components:	Library (Lib)	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Trundle, benjamin.peterson, ezio.melotti, jcea, lukasz.langa, pitrou, python-dev, r.david.murray, ron_adam, skrah, vstinner
Priority:	high	Keywords:	patch

Created on 2010-07-21 06:29 by ron_adam, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
p3k_i9313.diff	ron_adam, 2010-08-04 19:33	Test and Fix
tokenizer_encoding_filename.patch	vstinner, 2010-12-23 17:26		review

Messages (24)
msg111009 - (view)	Author: Ron Adam (ron_adam) *	Date: 2010-07-21 06:29
help('modules spam') causes segfault. When pydoc tries goes though the files it does the following in the ModuleScanner class. (minimal example) >>> for importer, modname, ispkg in pkgutil.walk_packages(): ... if modname == 'test.badsyntax_pep3120': ... loader = importer.find_module(modname) ... Segmentation fault Adding: if modname=='test.badsyntax_pep3120': continue At the top of the for loop will suppress the segfault in pydoc by skipping that file. A bit further probing narrowed it down to this... >>> loader = pkgutil.get_loader('test.badsyntax_pep3120') Segmentation fault I'm not familiar with the pkgutil module so I didn't go further.
msg111032 - (view)	Author: Stefan Krah (skrah) *	Date: 2010-07-21 10:08
Looks like tok->filename isn't set in PyTokenizer_FromFile: Index: Parser/tokenizer.c =================================================================== --- Parser/tokenizer.c (revision 82984) +++ Parser/tokenizer.c (working copy) @@ -818,6 +818,7 @@ tok->cur = tok->inp = tok->buf; tok->end = tok->buf + BUFSIZ; tok->fp = fp; + tok->filename = "XXX"; tok->prompt = ps1; tok->nextprompt = ps2; if (enc != NULL) {
msg111052 - (view)	Author: Stefan Krah (skrah) *	Date: 2010-07-21 13:26
To be a little clearer: Since tok->filename is used as a flag in other places, I'm not sure where to set it without breaking other things. This is the location of the segfault and the following would "fix" it (using a placeholder for the name): Index: Parser/tokenizer.c =================================================================== --- Parser/tokenizer.c (revision 83019) +++ Parser/tokenizer.c (working copy) @@ -582,6 +582,8 @@ if (badchar) { /* Need to add 1 to the line number, since this line has not been counted, yet. */ + if (tok->filename == NULL) + tok->filename = "<file>"; PyErr_Format(PyExc_SyntaxError, "Non-UTF-8 code starting with '\\x%.2x' " "in file %.200s on line %i, "
msg111518 - (view)	Author: Ron Adam (ron_adam) *	Date: 2010-07-25 01:56
Trying to look at this a bit further... The PyErr_Format function then calls PyUnicode_FromFormatV in Objects/unicodeobject.c to do the actual formating. It looks like there have been a number of issues of this type. Search the tracker for unicodeobject.c and PyUnicode_FromFormatV. This may be related to http://bugs.python.org/issue7330.
msg111952 - (view)	Author: Ron Adam (ron_adam) *	Date: 2010-07-29 15:37
The error happens when Null is passed to strlen in (unicodeobject.c, line 860) Passing NULL to a string format function is probably in the category of don't do that. Stefans solution of checking for NULL before calling PyErr_Format looks to me to be correct. As he notes, the same technique is used in another place. It might be good to add a note somewhere not to pass NULL to C string format functions.
msg112870 - (view)	Author: Ron Adam (ron_adam) *	Date: 2010-08-04 19:33
This is by far the simplest fix for this. See patch file. This patch is what Stefan Krah suggested and I agree unless someone a lot more familiar with the import process can take a look at this and re-factor things so the filename is passed along with the file descriptor so the error can include it. I tried and each change required other changes in order to get the correct failure and error messages. I gave up when the tests for non existent file started failing with ']' on the first line. I can't commit this, so could someone who can take a look at the patch and move this along.
msg112882 - (view)	Author: Stefan Krah (skrah) *	Date: 2010-08-04 20:20
The diff which I posted was not really meant as the ultimate fix. It's a workaround. I think the real filename should be passed down to PyTokenizer_FromFile or PyTokenizer_FindEncoding.
msg112922 - (view)	Author: Ron Adam (ron_adam) *	Date: 2010-08-04 23:34
I added you to this Victor because it looks like what your doing to rewrite the imports to work with Unicode (issue:9425) overlaps this. See the test in the patch. Your rewrite may fix this as the segfault has to do with getting the file encoding. My apologies if this isn't the case. Ron
msg119114 - (view)	Author: Ron Adam (ron_adam) *	Date: 2010-10-19 03:53
The test in the patch isn't quite right. The following still fails. Python 3.2a3+ (py3k:85719, Oct 18 2010, 22:32:47) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import imp >>> imp.find_module('test/badsyntax_pep3120') Segmentation fault
msg124018 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-12-15 13:53
help no longer segfaults, but the find_module call still does; updating title. The patch does cure the segfault, but as Stefan says it isn't the best fix since having '<file>' in the error message instead of the real file name isn't very useful.
msg124033 - (view)	Author: Ron Adam (ron_adam) *	Date: 2010-12-15 17:27
Pydoc skips the badsysntax_pep3120 file for now. When this gets fixed that workaround should be removed. The work around is commented and refers to this issue #.
msg124558 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-12-23 17:26
p3k_i9313.diff is just a workaround, not the correct fix. The problem is that PyTokenizer_FindEncoding() doesn't get the filename. I wrote tokenizer_encoding_filename.patch which add PyTokenizer_FindEncodingFilename() and patch import.c and traceback.c to pass the filename. Hum, I'm not sure that my patch works if the locale encoding is not UTF-8: import.c manipulates path in the filesystem encoding, whereas PyTokenizer_FindEncodingFilename() expects UTF-8 filename. See this patch as a draft, I will try to fix the encoding issue.
msg124559 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-12-23 17:28
See also #9738 (Document the encoding of functions bytes arguments of the C API) to check which encoding is expected :-p
msg130644 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-03-11 23:39
I'm working on #3080 which changes the import machinery to use Unicode instead of byte strings, and so it will be possible to fix this issue correctly.
msg132989 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-04-04 23:48
New changeset 7b8d625eb6e4 by Victor Stinner in branch 'default': Issue #9319: Include the filename in "Non-UTF8 code ..." syntax error. http://hg.python.org/cpython/rev/7b8d625eb6e4
msg132992 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-04-04 23:56
> Hum, I'm not sure that my patch works if the locale encoding is not > UTF-8: import.c manipulates path in the filesystem encoding, whereas > PyTokenizer_FindEncodingFilename() expects UTF-8 filename. Thanks to my work on #3080, the import machinery (and other functions using Python modules and filenames) manipulates filenames as Unicode, and so I don't have to care about the filename encoding anymore. 6e9dc970ac0e (of issue #10785) fixed the crash, 7b8d625eb6e4 added the filename into the SyntaxError.
msg133984 - (view)	Author: Andreas Stührk (Trundle) *	Date: 2011-04-18 17:19
How about applying the workaround patch to Python 3.2? An unprecise error message is way better than a segfault.
msg134281 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-04-22 22:42
New changeset fa5e348889c2 by Victor Stinner in branch '3.2': Issue #9319: Fix a crash on parsing a Python source code without encoding http://hg.python.org/cpython/rev/fa5e348889c2
msg134282 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-04-22 22:44
Fixed in 3.2 (fa5e348889c2) and 3.3 (7b8d625eb6e4). The bug is a regression introduced in Python 3.2, so Python 3.1 doesn't need to be fixed.
msg134283 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-04-22 23:27
New changeset 701069f9888c by Victor Stinner in branch '3.2': Issue #9319: Fix the unit test http://hg.python.org/cpython/rev/701069f9888c
msg134365 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-04-25 01:47
New changeset bb62908896fe by Jesus Cea in branch 'default': Correctly merging #9319 into 3.3? http://hg.python.org/cpython/rev/bb62908896fe
msg134381 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-04-25 13:55
> Correctly merging #9319 into 3.3? No, this issue was fixed differently in Python 3.3. I see that you reverted (bb62908896fe) this merge (except the test, which is a good idea).
msg144430 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-09-22 23:01
Can the workaround be removed from Lib/pydoc.py:2001 ? I tried to remove it from 3.2 and help('modules spam') seems to work fine.
msg165522 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-07-15 12:17
New changeset ce0687a8383b by Nick Coghlan in branch 'default': Issue #9319: Remove the workaround for this since fixed problem from pydoc http://hg.python.org/cpython/rev/ce0687a8383b

History
Date	User	Action	Args
2022-04-11 14:57:04	admin	set	github: 53565
2012-07-15 12:17:17	python-dev	set	messages: + msg165522
2011-09-22 23:01:15	ezio.melotti	set	nosy: + ezio.melotti messages: + msg144430
2011-04-25 13:55:44	vstinner	set	messages: + msg134381
2011-04-25 01:49:42	jcea	set	nosy: + jcea
2011-04-25 01:47:49	python-dev	set	messages: + msg134365
2011-04-22 23:27:45	python-dev	set	messages: + msg134283
2011-04-22 22:44:08	vstinner	set	status: open -> closed messages: + msg134282
2011-04-22 22:42:27	python-dev	set	messages: + msg134281
2011-04-22 20:25:06	r.david.murray	set	status: closed -> open
2011-04-18 17:19:12	Trundle	set	messages: + msg133984
2011-04-04 23:56:16	vstinner	set	status: open -> closed resolution: accepted -> fixed messages: + msg132992
2011-04-04 23:48:20	python-dev	set	nosy: + python-dev messages: + msg132989
2011-03-11 23:39:52	vstinner	set	nosy: pitrou, vstinner, ron_adam, benjamin.peterson, r.david.murray, Trundle, skrah, lukasz.langa messages: + msg130644
2011-03-11 11:25:35	lukasz.langa	set	nosy: + lukasz.langa
2010-12-23 17:28:11	vstinner	set	nosy: pitrou, vstinner, ron_adam, benjamin.peterson, r.david.murray, Trundle, skrah messages: + msg124559
2010-12-23 17:26:42	vstinner	set	files: + tokenizer_encoding_filename.patch nosy: pitrou, vstinner, ron_adam, benjamin.peterson, r.david.murray, Trundle, skrah messages: + msg124558
2010-12-15 17:52:28	Trundle	set	nosy: + Trundle
2010-12-15 17:27:33	ron_adam	set	nosy: pitrou, vstinner, ron_adam, benjamin.peterson, r.david.murray, skrah messages: + msg124033
2010-12-15 13:53:41	r.david.murray	set	nosy: + r.david.murray messages: + msg124018 title: segfault when searching modules with help() -> imp.find_module('test/badsyntax_pep3120') causes segfault
2010-10-19 03:53:55	ron_adam	set	messages: + msg119114
2010-08-04 23:34:27	ron_adam	set	nosy: + vstinner messages: + msg112922
2010-08-04 20:20:24	skrah	set	messages: + msg112882
2010-08-04 19:33:06	ron_adam	set	files: + p3k_i9313.diff keywords: + patch messages: + msg112870
2010-07-29 15:37:00	ron_adam	set	messages: + msg111952
2010-07-25 01:56:19	ron_adam	set	messages: + msg111518
2010-07-21 13:26:34	skrah	set	messages: + msg111052
2010-07-21 13:03:07	r.david.murray	set	nosy: + benjamin.peterson
2010-07-21 10:08:19	skrah	set	priority: normal -> high nosy: + skrah, pitrou messages: + msg111032 resolution: accepted
2010-07-21 06:29:24	ron_adam	create