Crash during encoding using UTF-16/32 and custom error handler #81000

ATalaba · 2019-05-06T18:51:21Z

BPO	36819
Nosy	@malemburg, @doerwalter, @vstinner, @ezio-melotti, @serhiy-storchaka, @tonybaloney, @ATalaba, @iritkatriel
PRs	bpo-36819: Fix out-of-bounds writes in encoders #13134 bpo-36819: Fix crashes in built-in encoders with weird error handlers #28593
Files	encode_crash.py: Source code to replicate the crash

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = None
created_at = <Date 2019-05-06.18:51:21.426>
labels = ['interpreter-core', 'type-crash', '3.10', '3.11', 'expert-unicode', '3.9']
title = 'Crash during encoding using UTF-16/32 and custom error handler'
updated_at = <Date 2021-09-29.15:17:06.551>
user = 'https://github.com/ATalaba'

bugs.python.org fields:

activity = <Date 2021-09-29.15:17:06.551>
actor = 'doerwalter'
assignee = 'serhiy.storchaka'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core', 'Unicode']
creation = <Date 2019-05-06.18:51:21.426>
creator = 'atalaba'
dependencies = []
files = ['48305']
hgrepos = []
issue_num = 36819
keywords = ['patch']
message_count = 9.0
messages = ['341599', '341658', '402170', '402726', '402826', '402830', '402832', '402833', '402883']
nosy_count = 8.0
nosy_names = ['lemburg', 'doerwalter', 'vstinner', 'ezio.melotti', 'serhiy.storchaka', 'anthonypjshaw', 'atalaba', 'iritkatriel']
pr_nums = ['13134', '28593']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue36819'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

ATalaba · 2019-05-06T18:51:21Z

The CPython interpreter write out-of-bounds of allocated memory in certain edge cases in the utf-16 and utf-32 encoders.

The attached script registers two error handlers that either write one ascii character, or two bytes, and tells the encoder to start again from the start of the encoding error. The script then tries to encode an invalid codepoint in either utf-16 or utf-32. Each of the calls to encode independently cause segfaults

Since the encoder starts over again and keeps trying to append the result of the error handler, the lack of proper re-allocations leads to a buffer overflow, and corrupts the stack.

tonybaloney · 2019-05-06T22:53:12Z

Easily reproduced on master, thanks

(lldb) run encode_crash.py
Process 14743 launched: '/Users/anthonyshaw/repo/cpython/python.exe' (x86_64)
Objects/unicodeobject.c:448: _PyUnicode_CheckConsistency: Assertion "((((((PyObject*)(op))->ob_type))->tp_flags & ((1UL << 28))) != 0)" failed
Enable tracemalloc to get the memory block allocation traceback

object : Process 14743 stopped

thread Support "bpo-" in Misc/NEWS #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x00000001000b5c15 python.exe`PyObject_Repr(v=0x0000000101376f90) at object.c:535:11
532 infinitely. */
533 if (Py_EnterRecursiveCall(" while getting the repr of an object"))
534 return NULL;
-> 535 res = (*v->ob_type->tp_repr)(v);
536 Py_LeaveRecursiveCall();
537 if (res == NULL)
538 return NULL;

iritkatriel · 2021-09-19T21:50:46Z

Reproduced on 3.11.

serhiy-storchaka · 2021-09-27T15:02:53Z

I am working on it, since it is more complex issue, and PR 13134 does not solve it.

This bug affects also other codecs implemented in C: ASCII, Latin1, UTF-8, etc.
It still crashes in UTF-16/32 encoders if the error handler returns a position less than the current position.
Incorrect exception can be raised if the error handler returns invalid string/bytes: a non-ASCII string or a bytes object consisting of not a whole number of units.
The code for standard error handlers and for decoders needs a revision too. I have some suspects.

We could just forbid error handlers returning position not in the range (start , end], but it can break some code, so it is better to do this only in a new release.

vstinner · 2021-09-29T07:35:23Z

We could just forbid error handlers returning position not in the range (start , end]

Yeah, that sounds like a reasonable solution. I don't see the point of returning a position outside this range. What would be the use case?

For me, the only corner case is the "ignore" error handler which returns an empty string, but it returns a position in this range, no?

it can break some code, so it is better to do this only in a new release.

Implementing custom error handlers is a rare use case, so it should only affect a minority of users. Moreover, IMO returning a position outside the valid range is a bug. It's common that security fixes change the behavior, like rejecting values which were previously acceptd, to prevent a Python crash.

malemburg · 2021-09-29T08:23:46Z

Looking at the specs in PEP-293 (https://www.python.org/dev/peps/pep-0293/), it is certainly possible for the error handler to return a newpos outside the range start - end, meaning in most cases: a value >= end.

There's a good reason for this: the codec may not be able to correctly determine the end of the sequence and so the end value presented by the codec is not necessarily a valid start to continue encoding/decoding. The error handler can e.g. choose to skip more input characters by trying to find the next valid sequence.

In the example script, the handler returns start, so the value is within the range. A limit would not solve the problem.

It seems that the reallocation logic of the codecs is the main problem here.

serhiy-storchaka · 2021-09-29T08:41:30Z

Restricting the returned position to be strictly larger than start would solve the problem with infinite loop and OOM. But this is a different issue.

malemburg · 2021-09-29T08:49:16Z

On 29.09.2021 10:41, Serhiy Storchaka wrote:

Restricting the returned position to be strictly larger than start would solve the problem with infinite loop and OOM. But this is a different issue.

Yes, this would make sense, since having the codec process
the same error location over and over again will not resolve the
error, so it's clearly a bug in the error handler.

doerwalter · 2021-09-29T15:17:06Z

The original specification (PEP-293) required that an error handler called for encoding *must* return a replacement string (not bytes). This returned string must then be encoded again. Only if this fails an exception must be raised.

Returning bytes from the encoding error handler is an extension specified by PEP-383:

The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again (also see the discussion below).

So for 3. in Serhiy's problem list

Incorrect exception can be raised if the error handler returns invalid string/bytes: a non-ASCII string or a bytes object consisting of not a whole number of units.

I get:

🐚 ~/ ❯ python
Python 3.9.7 (default, Sep  3 2021, 12:37:55)
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def bad(exc):
...  return ('\udbc0', exc.start)
...
>>> import codecs
>>> codecs.register_error('bad', bad)
>>> '\udbc0'.encode('utf-16', 'bad')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError

I would have expected an exception message that basically looks like the one I'd get, if I had used the strict error handler.

But otherwise returning a replacement that is unencodable is allowed and should raise an exception (which happens here, but with a missing exception message). (Returning something unencodable might make sense when the error handler is able to create replacement characters for some unencodable input, but not for other, but of course the error handler can always raise an exception directly).

Returning invalid bytes is not an issue, they simply get written to the output. That's exactly the use case of PEP-383: The bytes couldn't be decoded in the specified encoding, so they are "invalid", but the surrogateescape error handler encodes them back to the same "invalid" bytes. So the error handler is allowed to output bytes that can't be decoded again with the same encoding.

Returning a restart position outside the valid range of the length of the original string should raise an IndexError according to PEP-293:

If the callback does not raise an exception (either the one passed in, or a different one), it must return a tuple: (replacement, newpos)
replacement is a unicode object that the encoder will encode and emit instead of the unencodable object[start:end] part, newpos specifies
a new position within object, where (after encoding the replacement) the encoder will continue encoding.
Negative values for newpos are treated as being relative to end of object. If newpos is out of bounds the encoder will raise an IndexError.

Of course we could retroactively reinterpret "out of bounds" as outside of range(exc.start + 1, len(object)), instead of outside range(0, len(object)). An error handler that never advances is broken anyway. But we can't detect "never".

However it would probably be OK to reject pathological error handlers (i.e. those that don't advance (i.e. return at least exc.start + 1 as the restart position)). But I'm not sure how that's different from an error handler that skips ahead much farther (i.e. returns something like (exc.start+len(object))//2 or max(exc.start+1, len(object)-10)): The returned restart position leads to a certain expectation of how many bytes the encoder might have to output until everything is encoded and must adjust accordingly.

zhuofeng6 · 2022-09-09T01:37:20Z

any question about this issue? if not, i think it is better to close it

serhiy-storchaka · 2022-09-16T15:38:14Z

The original issue has been fixed, but we had a discussion about a deeper issue. I opened #96872 for this.

vstinner · 2022-09-16T15:51:42Z

The original issue has been fixed

Fixed by commit 18b07d7 GH-28593 according to git bisect :-)

ATalaba mannequin added 3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels May 6, 2019

serhiy-storchaka self-assigned this May 6, 2019

iritkatriel added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes and removed 3.7 (EOL) end of life 3.8 only security fixes labels Sep 19, 2021

malemburg added the topic-unicode label Sep 29, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

This was referenced May 2, 2022

bpo-36819: Fix crashes in built-in encoders with weird error handlers #28593

Merged

Cannot round-trip UCS-2 with unpaired surrogates using UTF-16 and surrogateescape #92100

Open

serhiy-storchaka mentioned this issue Sep 16, 2022

Codec error handling callback should return newpos > start #96872

Open

serhiy-storchaka closed this as completed Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash during encoding using UTF-16/32 and custom error handler #81000

Crash during encoding using UTF-16/32 and custom error handler #81000

ATalaba mannequin commented May 6, 2019

ATalaba mannequin commented May 6, 2019

tonybaloney mannequin commented May 6, 2019

iritkatriel commented Sep 19, 2021

serhiy-storchaka commented Sep 27, 2021

vstinner commented Sep 29, 2021

malemburg commented Sep 29, 2021

serhiy-storchaka commented Sep 29, 2021

malemburg commented Sep 29, 2021

doerwalter commented Sep 29, 2021

zhuofeng6 commented Sep 9, 2022

serhiy-storchaka commented Sep 16, 2022

vstinner commented Sep 16, 2022

Crash during encoding using UTF-16/32 and custom error handler #81000

Crash during encoding using UTF-16/32 and custom error handler #81000

Comments

ATalaba mannequin commented May 6, 2019

ATalaba mannequin commented May 6, 2019

tonybaloney mannequin commented May 6, 2019

iritkatriel commented Sep 19, 2021

serhiy-storchaka commented Sep 27, 2021

vstinner commented Sep 29, 2021

malemburg commented Sep 29, 2021

serhiy-storchaka commented Sep 29, 2021

malemburg commented Sep 29, 2021

doerwalter commented Sep 29, 2021

zhuofeng6 commented Sep 9, 2022

serhiy-storchaka commented Sep 16, 2022

vstinner commented Sep 16, 2022