Title: PyUnicode_FromEncodedObject() uses PyObject_AsCharBuffer()
Type: crash Stage: resolved
Components: Interpreter Core Versions: Python 3.1, Python 3.2
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, lemburg, pitrou, scoder, vstinner
Priority: normal Keywords: patch

Created on 2009-12-01 06:19 by scoder, last changed 2010-09-01 15:17 by pitrou. This issue is now closed.

File name Uploaded Description Edit
unicodeobject-PyUnicode_FromEncodedObject-buffer.patch scoder, 2010-08-20 16:43 Patch to use the new buffer protocol
unicodeobject-PyUnicode_FromEncodedObject-buffer2.patch scoder, 2010-08-20 20:04 Updated patch that additionally removes the PyByteArray special case
unicodeobject-PyUnicode_FromEncodedObject-buffer-refactored.patch scoder, 2010-08-20 20:45 Refactored patch that splits all special cases into separate code blocks
Messages (12)
msg95847 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2009-12-01 06:19
PyUnicode_FromEncodedObject() currently calls PyObject_AsCharBuffer() to
get the buffer pointer and length of a buffer supporting object. It
should be changed to support the buffer protocol correctly instead.

I filed this as a crash bug as the buffer protocol allows a buffer
supporting object to discard its buffer when the release function is
called. The decode function uses the buffer only *after* releasing it,
thus provoking a crash for objects that implement the buffer protocol
correctly in that they do not allow access to the buffer after the release.
msg112895 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-04 21:07
@Stefan can you provide a patch for this?
msg114429 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2010-08-20 16:43
Here's a patch against the latest py3k. The following will call the new code, for example:

  str(memoryview(b'abc'), 'ASCII')

whereas bytes and bytesarray continue to use their own special casing code (which has also changed a bit since I wanted to avoid code duplication).

For testing, I wrote a short Cython module that implements the buffer protocol in an extension type and freshly allocates a new bytes object as buffer on each access:

  from cpython.ref cimport Py_INCREF, Py_DECREF, PyObject

  cdef class Test:
      def __getbuffer__(self, Py_buffer* buffer, int flags):
          s = b'abcdefg' * 10
          buffer.buf = <char*> s
          buffer.obj = self
          buffer.len = len(s)
          buffer.internal = <void*> s

      def __releasebuffer__(self, Py_buffer* buffer):

Put it into a file "buftest.pyx", build it, start up Python 3.x and call

    >>> import buftest
    >>> print(len( str(buftest.Test(), "ASCII") ))

Under the unpatched Py3, this raises a decoding exception for me when it tries to decode data from the deallocated bytes object. Other systems may happily crash here. The patched Python runtime prints '70' as expected.
msg114440 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-20 18:37
I think the bytearray special-casing should be removed. Otherwise one can reallocate the buffer in another thread while it is being used for decoding.
msg114444 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2010-08-20 19:52
Doesn't the GIL protect the bytearray buffer? Or does decoding free the GIL?
msg114446 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2010-08-20 20:04
Regardless of the answer, I think Antoine is right, special cases aren't special enough to break the rules, and this is a special case that's more safely handled as part of the normal buffer case.

Updated patch uploaded.
msg114447 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-20 20:08
> Doesn't the GIL protect the bytearray buffer? Or does decoding free the GIL?

Well, decoding can call arbitrary Python code and therefore, yes,
release the GIL.

Ironically, PyUnicode_Decode() itself (called from
PyUnicode_FromEncodedObject()) fills a dummy Py_buffer object before
wrapping it into a memoryview... ;)
msg114448 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2010-08-20 20:26
... and another complete patch that refactors the complete function to make it clearer what happens. Includes a small code duplication for the bytes object case, which I think it acceptable.
msg114453 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2010-08-20 20:45
Another updated patch with a readability fix (replacing the last one).
msg114454 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-08-20 20:55
Stefan Behnel wrote:
> Stefan Behnel <> added the comment:
> Another updated patch with a readability fix (replacing the last one).

While you're at it, you might as well remove references to the
"char buffer" - there's no such thing in Python3 anymore.

We only have read buffers in Python3.
msg114456 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2010-08-20 21:01
When I read the comments and exception texts in the function, it didn't occur to me that "char buffer" could have been used as a name for the old Py2 buffer interface. From the context, it totally makes sense to me that the function (which decodes a byte sequence into a unicode string) complains about not getting a "bytes object or char buffer" as input. Admittedly, this might sound slightly different when read in Python space.
msg115313 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-01 15:17
Patch committed in r84394 (py3k) and r84396 (3.1). Thank you Stefan!
Date User Action Args
2010-09-01 15:17:12pitrousetstatus: open -> closed
versions: - Python 2.7
messages: + msg115313

resolution: fixed
stage: needs patch -> resolved
2010-08-21 22:54:11pitrousetnosy: + vstinner
2010-08-20 21:01:38scodersetmessages: + msg114456
2010-08-20 20:55:59lemburgsetnosy: + lemburg
messages: + msg114454
2010-08-20 20:45:40scodersetfiles: + unicodeobject-PyUnicode_FromEncodedObject-buffer-refactored.patch

messages: + msg114453
2010-08-20 20:42:59scodersetfiles: - unicodeobject-PyUnicode_FromEncodedObject-buffer-refactored.patch
2010-08-20 20:26:52scodersetfiles: + unicodeobject-PyUnicode_FromEncodedObject-buffer-refactored.patch

messages: + msg114448
2010-08-20 20:08:15pitrousetmessages: + msg114447
2010-08-20 20:04:42scodersetfiles: + unicodeobject-PyUnicode_FromEncodedObject-buffer2.patch

messages: + msg114446
2010-08-20 19:52:56scodersetmessages: + msg114444
versions: + Python 2.7
2010-08-20 18:37:02pitrousetnosy: + pitrou

messages: + msg114440
versions: - Python 2.7
2010-08-20 16:43:10scodersetfiles: + unicodeobject-PyUnicode_FromEncodedObject-buffer.patch
keywords: + patch
messages: + msg114429
2010-08-04 21:07:42BreamoreBoysetversions: + Python 2.7, - Python 3.0
nosy: + BreamoreBoy

messages: + msg112895

stage: needs patch
2009-12-01 06:19:13scodercreate