Issue 41377: memoryview of str (unicode)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/85549

classification

Title:	memoryview of str (unicode)
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.10, Python 3.9, Python 3.8

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	eric.smith, gvanrossum, jakirkham, rhettinger, serhiy.storchaka, skrah
Priority:	normal	Keywords:

Created on 2020-07-23 20:46 by jakirkham, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (7)
msg374147 - (view)	Author: (jakirkham)	Date: 2020-07-23 20:46
When working with lower level C/C++ code, the Python Buffer Protocol[1] has been immensely useful as it allows common Python `bytes`-like objects to expose the underlying memory buffer in a pointer that C/C++ code can easily work with zero-copy. In fact `memoryview` objects can be quite handy when facilitating coercion of Python objects supporting the Python Buffer Protocol to something that Python and/or C/C++ code can use easily. This works with several Python objects, many Python APIs, and in is relied on heavily by many performance conscious 3rd party libraries. However one object that gets a lot of use in Python that doesn't support this API is the Python `str` (previously `unicode`) object (see code below). ```python In [1]: s = "Hello World!" In [2]: mv = memoryview(s) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-2-3403c1ca3811> in <module> ----> 1 mv = memoryview(s) TypeError: memoryview: a bytes-like object is required, not 'str' ``` The canonical answer today is [to encode to `bytes` first]( https://stackoverflow.com/a/54449407 ) and decode to `str` later. While this is ok for a smallish piece of text, it can start to slowdown considerably for larger pieces of text. So being able to skip this encode/decode step can be quite impactful. ```python In [1]: s = "Hello World!" In [2]: %timeit s.encode(); 54.9 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) In [3]: s = 100_000_000 * "Hello World!" In [4]: %timeit s.encode(); 729 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` AIUI (though I could be misunderstanding things) `str` objects do use some kind of typed array of unicode characters (either 16-bit narrow or 32-bit wide). So it seems like it should be possible to expose this as a 1-D contiguous array that C/C++ code could use. Though I may be misunderstanding how `str`s actually work under-the-hood (if so apologies). It would be quite helpful to bypass this encoding/decoding step and instead work directly with the underlying buffer in these situations where C/C++ is involved to help performance critical code. [1]: https://docs.python.org/3/c-api/buffer.html
msg374148 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-07-23 21:51
> AIUI (though I could be misunderstanding things) `str` objects do use some kind of typed array of unicode characters (either 16-bit narrow or 32-bit wide). It's somewhat more complicated. The string data is stored differently depending on the maximum code point in the string. See PEP 393. The "kind" field describes this as: 1 byte (Latin-1) 2 byte (UCS-2) 4 byte (UCS-4)
msg374149 - (view)	Author: (jakirkham)	Date: 2020-07-23 22:06
Thanks for the clarification, Eric! :) Is this the sort of thing that we could capture in the `format`[1] field (like with `"B"`, `"H"`, and `"I"`[2]) or are there potential issues there? [1]: https://docs.python.org/3/c-api/buffer.html#c.Py_buffer.format [2]: https://docs.python.org/3/library/struct.html#format-characters
msg374152 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-07-23 22:30
I don't think there's a python-level api to find out the "kind", but I can't say I've looked closely. And there are no doubt problems with doing so and alternate implementations other than CPython. I'm not sure we want to expose this implementation detail, but maybe it's the case that all implementations could expose this. For example, JPython could always just say "I'm UCS-2", or something.
msg374154 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-07-24 00:25
I think we can close this. AFAICT, if we exposed the raw internal object with a memory view, there would be no practical way to use the data without a user having to substantially recreate the logic already present in encode() and the other string methods.
msg374158 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-07-24 04:27
I concur with Raymond. Also, it could not help to caught bugs when you get a string instead expected bytes object. It may "work" in tests while string is ASCII, but fail miserably on real-world non-ASCII data.
msg374202 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-07-24 18:35
We should not do this, it would expose internals that we need to keep private. The right approach would be to keep things as bytes.

History
Date	User	Action	Args
2022-04-11 14:59:34	admin	set	github: 85549
2020-07-24 18:35:45	gvanrossum	set	status: open -> closed nosy: + gvanrossum messages: + msg374202 resolution: wont fix stage: resolved
2020-07-24 04:27:45	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg374158
2020-07-24 00:25:43	rhettinger	set	nosy: + rhettinger messages: + msg374154
2020-07-24 00:10:10	xtreak	set	nosy: + skrah
2020-07-23 22:30:05	eric.smith	set	messages: + msg374152
2020-07-23 22:06:11	jakirkham	set	messages: + msg374149
2020-07-23 21:51:21	eric.smith	set	nosy: + eric.smith messages: + msg374148
2020-07-23 20:46:27	jakirkham	create