This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Interpreter hangs up while trying to decode invalid utf32 stream.
Type: crash Stage: patch review
Components: Interpreter Core, Library (Lib), Unicode, Windows Versions: Python 3.0, Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: georg.brandl Nosy List: barry, benjamin.peterson, georg.brandl, jgsack, lemburg, mwizard
Priority: release blocker Keywords: patch

Created on 2009-09-16 17:38 by mwizard, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
fix-utf32-errorhandling.diff georg.brandl, 2009-09-17 10:04
Messages (7)
msg92704 - (view) Author: Alex (mwizard) Date: 2009-09-16 17:38
*** Prerequisites:
Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit
(Intel)] on win32

*** Description:
'utf_32_le' and 'utf_32_be' codecs are overconsuming memory when input
data are damaged and kwarg 'errors' to str.decode is other than 'strict'.

*** Steps:
1. Start interpreter
2. Type:
   '\x01'.decode('utf_32_le', 'replace')
or
   '\x01'.decode('utf32', 'ignore')
or
   ('something'.encode('utf32') + '\x00').decode('utf32', 'ignore')
3. Execute

*** Notes:
1. seems like any stream raising UnicodeDecodeError in 'strict' mode
causes hangup in 'ignore' or 'replace'.

*** Expected result:
1. AssertionError on "assert errors == 'strict'" raised, just as
bz2_codec does, if utf32 cannot be partially decoded at all.
2. Behaviour that 'utf8' and 'utf16' implement for such cases.

*** Received result:
1. Interpreter hangs, uses up to 100% of CPU kernel and starts to
consume RAM.
2. Grows large enough to consume all the RAM it could get (takes up to
several minutes on my machine).
3. Produces following traceback:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python26\lib\encodings\utf_32_be.py", line 11, in decode
    return codecs.utf_32_be_decode(input, errors, True)
MemoryError
4. Sometimes traceback is printed, but text "MemoryError" is not, just
leaving blank line in the place.
msg92752 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-09-17 10:04
This patch fixes it (seems like a refactoring oversight, I used the
UTF16 decoder for reference, where it works fine) and adds a test,
assigning to MAL for review.

Marking as a release blocker so that 2.6.3 won't get released without a fix.
msg92754 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-09-17 10:23
The patch looks good. Thanks.

Aside: This is what you get when using too many single character
variable names in a function...

The function should really do just one cast to (unsigned char *) at the
very top and then work with that variable all along.
msg92756 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-09-17 11:33
I'm leaving a refactoring to someone with more time :)

Committed in r74869, backported to 2.6 in r74870.
msg96468 - (view) Author: James G. sack (jim) (jgsack) Date: 2009-12-15 22:06
It seems that on my Fedora 11 AMD X86_64, the problem still exists. In 
test_codecs.UTF32Test, test_handlers() seems to run forever, gobbling 
memory to 99+% and then activating swap until it fills up swap.

tested by 
  svn up -r 74869
  rm /tmp/pynexttest
  .python -Ebb -Lib/tests/regrtest.py -vvs test_codecs

disabling test_handlers() in UTF32Test allows the test to pass and it 
completes very fast. It is puzzling that UTF16Test test_handlers works 
with what looks like similar code in unidoceobject.c

~jim
msg96470 - (view) Author: James G. sack (jim) (jgsack) Date: 2009-12-15 22:19
Clarification of my last message (msg96468):

The test_handler in the current revision (76850) also exhibits the same 
memory-gobbling behavior. I only refered to -r 74869 because that's where 
the test was introduced, ostensibly to verify the patch to 
unicodeobject.c. 

~jim
msg96479 - (view) Author: James G. sack (jim) (jgsack) Date: 2009-12-16 07:14
IMPORTANT Correction: Please disregard msg 96468 & 96470.

I was forgetting to do ./configure and make, and evidently getting bogus 
failures.

test_codecs works fine now, ..sorry for the false alarm.

~jim
History
Date User Action Args
2022-04-11 14:56:53adminsetnosy: + barry, benjamin.peterson
github: 51171
2009-12-16 07:14:18jgsacksetmessages: + msg96479
2009-12-15 22:19:14jgsacksetmessages: + msg96470
2009-12-15 22:06:43jgsacksetnosy: + jgsack
messages: + msg96468
2009-09-17 11:33:38georg.brandlsetstatus: open -> closed

messages: + msg92756
2009-09-17 10:23:34lemburgsetassignee: lemburg -> georg.brandl
messages: + msg92754
2009-09-17 10:04:53georg.brandlsetfiles: + fix-utf32-errorhandling.diff

resolution: accepted

assignee: lemburg
keywords: + patch
stage: patch review
versions: + Python 3.0, Python 3.1, Python 2.7, Python 3.2
nosy: + georg.brandl, lemburg
messages: + msg92752
priority: release blocker
type: crash
2009-09-16 17:38:15mwizardcreate