This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: py3k: out of bounds read in PyUnicode_DecodeUnicodeEscape
Type: Stage:
Components: Interpreter Core Versions: Python 3.0
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: amaury.forgeotdarc, gvanrossum
Priority: normal Keywords:

Created on 2007-10-29 21:21 by amaury.forgeotdarc, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicodeEscape.diff amaury.forgeotdarc, 2007-10-29 21:21
Messages (5)
msg56933 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-10-29 21:21
A correction for the problem found by GvR in change 58692:

> There's one mystery: if I remove ob_sstate from the PyStringObject struct,
> some (unicode) string literals are mutilated, e.g. ('\\1', '\1') prints
> ('\\1', '\t').  This must be an out of bounds write or something that I
> can't track down.  (It doesn't help that it doesn't occur in debug mode.
> And no, make clean + recompilation doesn't help either.)
> 
> So, in the mean time, I just keep the field, renamed to 'ob_placeholder'.

I think I found the problem. It reproduces on Windows, with a slightly
different input
    >>> ('\\2','\1')
    ('\\2', '\n')
(the win32 release build is of the kind "optimized with debug info", so
using the debugger is possible)

The problem is in unicodeobject.c::PyUnicode_DecodeUnicodeEscape:
- the input buffer is not null-terminated
- when decoding octal escape, we increment s without checking if it is
still in the limits.
In my case, the "\1" was followed by a "2" in memory, hence the bogus
chr(0o12) == '\n'.

Also corrected a potential problem when the string ends with a \:
PyUnicode_DecodeUnicodeEscape("\\t", 1) should return an error.
msg56935 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-29 21:45
Thanks!!

The patch as given doesn't quite work -- it gives an error message on
string literals like '\s'.  By reverting some of your code and only
keeping the bounds checks for the octal escape code I got it to work
though.  See my checkin.

Committed revision 58707.

This needs to be backported too; it seems a pure coincidence it didn't
trigger before!
msg56937 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-29 21:58
(Hold on, the assert I added triggers in debug mode.)
msg56938 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-10-29 22:01
I don't like this assert either: this function is part of the public
API, and should not crash the interpreter just because of a trailing \.

To test easily:

import ctypes
decode = ctypes.pythonapi.PyUnicodeUCS2_DecodeUnicodeEscape
decode.restype = ctypes.py_object
decode(b'\\1', 1, None)

This should gently raise a UnicodeDecodeError IMO.
msg56939 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-29 22:08
Better fix:
Committed revision 58708.
History
Date User Action Args
2022-04-11 14:56:27adminsetgithub: 45700
2007-10-30 00:29:54gvanrossumsetstatus: open -> closed
2007-10-29 22:08:51gvanrossumsetmessages: + msg56939
2007-10-29 22:01:58amaury.forgeotdarcsetmessages: + msg56938
2007-10-29 21:58:39gvanrossumsetstatus: closed -> open
messages: + msg56937
2007-10-29 21:45:50gvanrossumsetstatus: open -> closed
assignee: gvanrossum
resolution: accepted
messages: + msg56935
2007-10-29 21:21:29amaury.forgeotdarccreate