New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash during encoding using UTF-16/32 and custom error handler #81000
Comments
The CPython interpreter write out-of-bounds of allocated memory in certain edge cases in the utf-16 and utf-32 encoders. The attached script registers two error handlers that either write one ascii character, or two bytes, and tells the encoder to start again from the start of the encoding error. The script then tries to encode an invalid codepoint in either utf-16 or utf-32. Each of the calls to encode independently cause segfaults Since the encoder starts over again and keeps trying to append the result of the error handler, the lack of proper re-allocations leads to a buffer overflow, and corrupts the stack. |
Easily reproduced on master, thanks (lldb) run encode_crash.py object : Process 14743 stopped
|
Reproduced on 3.11. |
I am working on it, since it is more complex issue, and PR 13134 does not solve it.
We could just forbid error handlers returning position not in the range (start , end], but it can break some code, so it is better to do this only in a new release. |
Yeah, that sounds like a reasonable solution. I don't see the point of returning a position outside this range. What would be the use case? For me, the only corner case is the "ignore" error handler which returns an empty string, but it returns a position in this range, no?
Implementing custom error handlers is a rare use case, so it should only affect a minority of users. Moreover, IMO returning a position outside the valid range is a bug. It's common that security fixes change the behavior, like rejecting values which were previously acceptd, to prevent a Python crash. |
Looking at the specs in PEP-293 (https://www.python.org/dev/peps/pep-0293/), it is certainly possible for the error handler to return a newpos outside the range start - end, meaning in most cases: a value >= end. There's a good reason for this: the codec may not be able to correctly determine the end of the sequence and so the end value presented by the codec is not necessarily a valid start to continue encoding/decoding. The error handler can e.g. choose to skip more input characters by trying to find the next valid sequence. In the example script, the handler returns start, so the value is within the range. A limit would not solve the problem. It seems that the reallocation logic of the codecs is the main problem here. |
Restricting the returned position to be strictly larger than start would solve the problem with infinite loop and OOM. But this is a different issue. |
On 29.09.2021 10:41, Serhiy Storchaka wrote:
Yes, this would make sense, since having the codec process |
The original specification (PEP-293) required that an error handler called for encoding *must* return a replacement string (not bytes). This returned string must then be encoded again. Only if this fails an exception must be raised. Returning bytes from the encoding error handler is an extension specified by PEP-383:
So for 3. in Serhiy's problem list
I get: 🐚 ~/ ❯ python
Python 3.9.7 (default, Sep 3 2021, 12:37:55)
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def bad(exc):
... return ('\udbc0', exc.start)
...
>>> import codecs
>>> codecs.register_error('bad', bad)
>>> '\udbc0'.encode('utf-16', 'bad')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError I would have expected an exception message that basically looks like the one I'd get, if I had used the strict error handler. But otherwise returning a replacement that is unencodable is allowed and should raise an exception (which happens here, but with a missing exception message). (Returning something unencodable might make sense when the error handler is able to create replacement characters for some unencodable input, but not for other, but of course the error handler can always raise an exception directly). Returning invalid bytes is not an issue, they simply get written to the output. That's exactly the use case of PEP-383: The bytes couldn't be decoded in the specified encoding, so they are "invalid", but the surrogateescape error handler encodes them back to the same "invalid" bytes. So the error handler is allowed to output bytes that can't be decoded again with the same encoding. Returning a restart position outside the valid range of the length of the original string should raise an IndexError according to PEP-293:
Of course we could retroactively reinterpret "out of bounds" as outside of However it would probably be OK to reject pathological error handlers (i.e. those that don't advance (i.e. return at least |
any question about this issue? if not, i think it is better to close it |
The original issue has been fixed, but we had a discussion about a deeper issue. I opened #96872 for this. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: