Issue 6922: Interpreter hangs up while trying to decode invalid utf32 stream.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51171

classification

Title:	Interpreter hangs up while trying to decode invalid utf32 stream.
Type:	crash	Stage:	patch review
Components:	Interpreter Core, Library (Lib), Unicode, Windows	Versions:	Python 3.0, Python 3.1, Python 3.2, Python 2.7, Python 2.6

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	georg.brandl	Nosy List:	barry, benjamin.peterson, georg.brandl, jgsack, lemburg, mwizard
Priority:	release blocker	Keywords:	patch

Created on 2009-09-16 17:38 by mwizard, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fix-utf32-errorhandling.diff	georg.brandl, 2009-09-17 10:04

Messages (7)
msg92704 - (view)	Author: Alex (mwizard)	Date: 2009-09-16 17:38
* Prerequisites: Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit (Intel)] on win32 * Description: 'utf_32_le' and 'utf_32_be' codecs are overconsuming memory when input data are damaged and kwarg 'errors' to str.decode is other than 'strict'. * Steps: 1. Start interpreter 2. Type: '\x01'.decode('utf_32_le', 'replace') or '\x01'.decode('utf32', 'ignore') or ('something'.encode('utf32') + '\x00').decode('utf32', 'ignore') 3. Execute * Notes: 1. seems like any stream raising UnicodeDecodeError in 'strict' mode causes hangup in 'ignore' or 'replace'. * Expected result: 1. AssertionError on "assert errors == 'strict'" raised, just as bz2_codec does, if utf32 cannot be partially decoded at all. 2. Behaviour that 'utf8' and 'utf16' implement for such cases. * Received result: 1. Interpreter hangs, uses up to 100% of CPU kernel and starts to consume RAM. 2. Grows large enough to consume all the RAM it could get (takes up to several minutes on my machine). 3. Produces following traceback: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python26\lib\encodings\utf_32_be.py", line 11, in decode return codecs.utf_32_be_decode(input, errors, True) MemoryError 4. Sometimes traceback is printed, but text "MemoryError" is not, just leaving blank line in the place.
msg92752 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2009-09-17 10:04
This patch fixes it (seems like a refactoring oversight, I used the UTF16 decoder for reference, where it works fine) and adds a test, assigning to MAL for review. Marking as a release blocker so that 2.6.3 won't get released without a fix.
msg92754 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-09-17 10:23
The patch looks good. Thanks. Aside: This is what you get when using too many single character variable names in a function... The function should really do just one cast to (unsigned char *) at the very top and then work with that variable all along.
msg92756 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2009-09-17 11:33
I'm leaving a refactoring to someone with more time :) Committed in r74869, backported to 2.6 in r74870.
msg96468 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2009-12-15 22:06
It seems that on my Fedora 11 AMD X86_64, the problem still exists. In test_codecs.UTF32Test, test_handlers() seems to run forever, gobbling memory to 99+% and then activating swap until it fills up swap. tested by svn up -r 74869 rm /tmp/pynexttest .python -Ebb -Lib/tests/regrtest.py -vvs test_codecs disabling test_handlers() in UTF32Test allows the test to pass and it completes very fast. It is puzzling that UTF16Test test_handlers works with what looks like similar code in unidoceobject.c ~jim
msg96470 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2009-12-15 22:19
Clarification of my last message (msg96468): The test_handler in the current revision (76850) also exhibits the same memory-gobbling behavior. I only refered to -r 74869 because that's where the test was introduced, ostensibly to verify the patch to unicodeobject.c. ~jim
msg96479 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2009-12-16 07:14
IMPORTANT Correction: Please disregard msg 96468 & 96470. I was forgetting to do ./configure and make, and evidently getting bogus failures. test_codecs works fine now, ..sorry for the false alarm. ~jim

History
Date	User	Action	Args
2022-04-11 14:56:53	admin	set	nosy: + barry, benjamin.peterson github: 51171
2009-12-16 07:14:18	jgsack	set	messages: + msg96479
2009-12-15 22:19:14	jgsack	set	messages: + msg96470
2009-12-15 22:06:43	jgsack	set	nosy: + jgsack messages: + msg96468
2009-09-17 11:33:38	georg.brandl	set	status: open -> closed messages: + msg92756
2009-09-17 10:23:34	lemburg	set	assignee: lemburg -> georg.brandl messages: + msg92754
2009-09-17 10:04:53	georg.brandl	set	files: + fix-utf32-errorhandling.diff resolution: accepted assignee: lemburg keywords: + patch stage: patch review versions: + Python 3.0, Python 3.1, Python 2.7, Python 3.2 nosy: + georg.brandl, lemburg messages: + msg92752 priority: release blocker type: crash
2009-09-16 17:38:15	mwizard	create