MemoryError with custom error handlers and multibyte codecs #67404

alexer · 2015-01-10T03:33:04Z

BPO	23215
Nosy	@malemburg, @loewis, @vstinner, @bitdancer, @alexer, @serhiy-storchaka
Files	python_codec_crasher.py python_codec_crash_fix.patch python_codec_crash_fix_2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2015-02-20.23:28:25.263>
created_at = <Date 2015-01-10.03:33:03.633>
labels = ['interpreter-core', 'performance']
title = 'MemoryError with custom error handlers and multibyte codecs'
updated_at = <Date 2015-02-20.23:28:25.262>
user = 'https://github.com/alexer'

bugs.python.org fields:

activity = <Date 2015-02-20.23:28:25.262>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2015-02-20.23:28:25.263>
closer = 'serhiy.storchaka'
components = ['Interpreter Core']
creation = <Date 2015-01-10.03:33:03.633>
creator = 'alexer'
dependencies = []
files = ['37659', '37660', '38150']
hgrepos = []
issue_num = 23215
keywords = ['patch']
message_count = 5.0
messages = ['233800', '234687', '234689', '236054', '236343']
nosy_count = 7.0
nosy_names = ['lemburg', 'loewis', 'vstinner', 'r.david.murray', 'alexer', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue23215'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

alexer · 2015-01-10T03:33:02Z

Using a multibyte codec and a custom error handler that ignores errors to encode a string that contains characters not representable in said encoding causes exponential growth of the output buffer, raising MemoryError.

The problem is in multibytecodec_encerror() and REQUIRE_ENCODEBUFFER() in Modules/cjkcodecs/multibytecodec.c. multibytecodec_encerror() always uses REQUIRE_ENCODEBUFFER() to ensure there's enough space for the replacement string, and if more space is needed, REQUIRE_ENCODEBUFFER() calls expand_encodebuffer(), which in turn always grows the buffer by at least 50%. However, if size < 1, REQUIRE_ENCODEBUFFER() doesn't check if more space is actually needed. (It's used with negative values in other places)

I have no idea why the condition was originally size < 1 instead of size < 0, but changing it seems to fix this. The replacement string case is also the only use of the macro that may use 0 as the argument.

In the patch, I've instead wrapped the REQUIRE_ENCODEBUFFER() (and memcpy) in a if(size > 0), since that's what the corresponding part in multibytecodec_decerror() did in the past:
https://hg.python.org/cpython/file/1c3f8d044589/Modules/cjkcodecs/multibytecodec.c#l438

Not sure which one makes more sense.

As for the tests, I'm not sure if 1) all of the affected encodings should be tested or only one (or even all encodings, affected or not?) and 2) whether it should be a new test or if I should just add it to test_longstrings in Lib/test/test_codeccallbacks.py. (Structurally it's a perfect fit, but it really isn't a "long string" test as it can happen with <50 characters) At the moment, the patch is testing affected encodings in a separate test.

Is the test philosophy "as thorough as possible" or "as fast as possible"?

python-dev · 2015-01-25T20:49:58Z

New changeset 3a9b1e5fe179 by R David Murray in branch '3.4':
bpo-23215: reflow paragraph.
https://hg.python.org/cpython/rev/3a9b1e5fe179

New changeset 52a06812d5da by R David Murray in branch 'default':
Merge: bpo-23215: note that time.sleep affects the current thread only.
https://hg.python.org/cpython/rev/52a06812d5da

bitdancer · 2015-01-25T20:51:59Z

Oops, typoed the issue number. That should have been 23251.

serhiy-storchaka · 2015-02-15T17:45:42Z

Thank you for your patch Aleksi. It LGTM in general. Updated patch just moves the test to Lib/test/test_multibytecodec.py where it can reuse ALL_CJKENCODINGS and fixes few other minor bugs in multibyte codecs.

python-dev · 2015-02-20T23:23:33Z

New changeset af8089217cc6 by Serhiy Storchaka in branch '2.7':
Issue bpo-23215: Multibyte codecs with custom error handlers that ignores errors
https://hg.python.org/cpython/rev/af8089217cc6

New changeset 4dc8b7ed8973 by Serhiy Storchaka in branch '3.4':
Issue bpo-23215: Multibyte codecs with custom error handlers that ignores errors
https://hg.python.org/cpython/rev/4dc8b7ed8973

New changeset 5620691ce26b by Serhiy Storchaka in branch 'default':
Issue bpo-23215: Multibyte codecs with custom error handlers that ignores errors
https://hg.python.org/cpython/rev/5620691ce26b

alexer mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jan 10, 2015

serhiy-storchaka self-assigned this Feb 15, 2015

serhiy-storchaka closed this as completed Feb 20, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError with custom error handlers and multibyte codecs #67404

MemoryError with custom error handlers and multibyte codecs #67404

alexer mannequin commented Jan 10, 2015

alexer mannequin commented Jan 10, 2015

python-dev mannequin commented Jan 25, 2015

bitdancer commented Jan 25, 2015

serhiy-storchaka commented Feb 15, 2015

python-dev mannequin commented Feb 20, 2015

MemoryError with custom error handlers and multibyte codecs #67404

MemoryError with custom error handlers and multibyte codecs #67404

Comments

alexer mannequin commented Jan 10, 2015

alexer mannequin commented Jan 10, 2015

python-dev mannequin commented Jan 25, 2015

bitdancer commented Jan 25, 2015

serhiy-storchaka commented Feb 15, 2015

python-dev mannequin commented Feb 20, 2015