Issue 7415: PyUnicode_FromEncodedObject() uses PyObject_AsCharBuffer()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51664

classification

Title:	PyUnicode_FromEncodedObject() uses PyObject_AsCharBuffer()
Type:	crash	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.1, Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	BreamoreBoy, lemburg, pitrou, scoder, vstinner
Priority:	normal	Keywords:	patch

Created on 2009-12-01 06:19 by scoder, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicodeobject-PyUnicode_FromEncodedObject-buffer.patch	scoder, 2010-08-20 16:43	Patch to use the new buffer protocol
unicodeobject-PyUnicode_FromEncodedObject-buffer2.patch	scoder, 2010-08-20 20:04	Updated patch that additionally removes the PyByteArray special case
unicodeobject-PyUnicode_FromEncodedObject-buffer-refactored.patch	scoder, 2010-08-20 20:45	Refactored patch that splits all special cases into separate code blocks

Messages (12)
msg95847 - (view)	Author: Stefan Behnel (scoder) *	Date: 2009-12-01 06:19
PyUnicode_FromEncodedObject() currently calls PyObject_AsCharBuffer() to get the buffer pointer and length of a buffer supporting object. It should be changed to support the buffer protocol correctly instead. I filed this as a crash bug as the buffer protocol allows a buffer supporting object to discard its buffer when the release function is called. The decode function uses the buffer only after releasing it, thus provoking a crash for objects that implement the buffer protocol correctly in that they do not allow access to the buffer after the release.
msg112895 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-04 21:07
@Stefan can you provide a patch for this?
msg114429 - (view)	Author: Stefan Behnel (scoder) *	Date: 2010-08-20 16:43
Here's a patch against the latest py3k. The following will call the new code, for example: str(memoryview(b'abc'), 'ASCII') whereas bytes and bytesarray continue to use their own special casing code (which has also changed a bit since I wanted to avoid code duplication). For testing, I wrote a short Cython module that implements the buffer protocol in an extension type and freshly allocates a new bytes object as buffer on each access: from cpython.ref cimport Py_INCREF, Py_DECREF, PyObject cdef class Test: def __getbuffer__(self, Py_buffer* buffer, int flags): s = b'abcdefg' * 10 buffer.buf = <char> s buffer.obj = self buffer.len = len(s) Py_INCREF(s) buffer.internal = <void> s def __releasebuffer__(self, Py_buffer* buffer): Py_DECREF(<object>buffer.internal) Put it into a file "buftest.pyx", build it, start up Python 3.x and call >>> import buftest >>> print(len( str(buftest.Test(), "ASCII") )) Under the unpatched Py3, this raises a decoding exception for me when it tries to decode data from the deallocated bytes object. Other systems may happily crash here. The patched Python runtime prints '70' as expected.
msg114440 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-08-20 18:37
I think the bytearray special-casing should be removed. Otherwise one can reallocate the buffer in another thread while it is being used for decoding.
msg114444 - (view)	Author: Stefan Behnel (scoder) *	Date: 2010-08-20 19:52
Doesn't the GIL protect the bytearray buffer? Or does decoding free the GIL?
msg114446 - (view)	Author: Stefan Behnel (scoder) *	Date: 2010-08-20 20:04
Regardless of the answer, I think Antoine is right, special cases aren't special enough to break the rules, and this is a special case that's more safely handled as part of the normal buffer case. Updated patch uploaded.
msg114447 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-08-20 20:08
> Doesn't the GIL protect the bytearray buffer? Or does decoding free the GIL? Well, decoding can call arbitrary Python code and therefore, yes, release the GIL. Ironically, PyUnicode_Decode() itself (called from PyUnicode_FromEncodedObject()) fills a dummy Py_buffer object before wrapping it into a memoryview... ;)
msg114448 - (view)	Author: Stefan Behnel (scoder) *	Date: 2010-08-20 20:26
... and another complete patch that refactors the complete function to make it clearer what happens. Includes a small code duplication for the bytes object case, which I think it acceptable.
msg114453 - (view)	Author: Stefan Behnel (scoder) *	Date: 2010-08-20 20:45
Another updated patch with a readability fix (replacing the last one).
msg114454 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-08-20 20:55
Stefan Behnel wrote: > > Stefan Behnel <scoder@users.sourceforge.net> added the comment: > > Another updated patch with a readability fix (replacing the last one). While you're at it, you might as well remove references to the "char buffer" - there's no such thing in Python3 anymore. We only have read buffers in Python3.
msg114456 - (view)	Author: Stefan Behnel (scoder) *	Date: 2010-08-20 21:01
When I read the comments and exception texts in the function, it didn't occur to me that "char buffer" could have been used as a name for the old Py2 buffer interface. From the context, it totally makes sense to me that the function (which decodes a byte sequence into a unicode string) complains about not getting a "bytes object or char buffer" as input. Admittedly, this might sound slightly different when read in Python space.
msg115313 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-01 15:17
Patch committed in r84394 (py3k) and r84396 (3.1). Thank you Stefan!

History
Date	User	Action	Args
2022-04-11 14:56:55	admin	set	github: 51664
2010-09-01 15:17:12	pitrou	set	status: open -> closed versions: - Python 2.7 messages: + msg115313 resolution: fixed stage: needs patch -> resolved
2010-08-21 22:54:11	pitrou	set	nosy: + vstinner
2010-08-20 21:01:38	scoder	set	messages: + msg114456
2010-08-20 20:55:59	lemburg	set	nosy: + lemburg messages: + msg114454
2010-08-20 20:45:40	scoder	set	files: + unicodeobject-PyUnicode_FromEncodedObject-buffer-refactored.patch messages: + msg114453
2010-08-20 20:42:59	scoder	set	files: - unicodeobject-PyUnicode_FromEncodedObject-buffer-refactored.patch
2010-08-20 20:26:52	scoder	set	files: + unicodeobject-PyUnicode_FromEncodedObject-buffer-refactored.patch messages: + msg114448
2010-08-20 20:08:15	pitrou	set	messages: + msg114447
2010-08-20 20:04:42	scoder	set	files: + unicodeobject-PyUnicode_FromEncodedObject-buffer2.patch messages: + msg114446
2010-08-20 19:52:56	scoder	set	messages: + msg114444 versions: + Python 2.7
2010-08-20 18:37:02	pitrou	set	nosy: + pitrou messages: + msg114440 versions: - Python 2.7
2010-08-20 16:43:10	scoder	set	files: + unicodeobject-PyUnicode_FromEncodedObject-buffer.patch keywords: + patch messages: + msg114429
2010-08-04 21:07:42	BreamoreBoy	set	versions: + Python 2.7, - Python 3.0 nosy: + BreamoreBoy messages: + msg112895 stage: needs patch
2009-12-01 06:19:13	scoder	create