Title: memoryview of str (unicode)
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.10, Python 3.9, Python 3.8
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, gvanrossum, jakirkham, rhettinger, serhiy.storchaka, skrah
Priority: normal Keywords:

Created on 2020-07-23 20:46 by jakirkham, last changed 2020-07-24 18:35 by gvanrossum. This issue is now closed.

Messages (7)
msg374147 - (view) Author: (jakirkham) Date: 2020-07-23 20:46
When working with lower level C/C++ code, the Python Buffer Protocol[1] has been immensely useful as it allows common Python `bytes`-like objects to expose the underlying memory buffer in a pointer that C/C++ code can easily work with zero-copy. In fact `memoryview` objects can be quite handy when facilitating coercion of Python objects supporting the Python Buffer Protocol to something that Python and/or C/C++ code can use easily. This works with several Python objects, many Python APIs, and in is relied on heavily by many performance conscious 3rd party libraries.

However one object that gets a lot of use in Python that doesn't support this API is the Python `str` (previously `unicode`) object (see code below).

In [1]: s = "Hello World!"                                                      

In [2]: mv = memoryview(s)                                                      
TypeError                                 Traceback (most recent call last)
<ipython-input-2-3403c1ca3811> in <module>
----> 1 mv = memoryview(s)

TypeError: memoryview: a bytes-like object is required, not 'str'

The canonical answer today is [to encode to `bytes` first]( ) and decode to `str` later. While this is ok for a smallish piece of text, it can start to slowdown considerably for larger pieces of text. So being able to skip this encode/decode step can be quite impactful.

In [1]: s = "Hello World!"                                                      

In [2]: %timeit s.encode();                                                     
54.9 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: s = 100_000_000 * "Hello World!"                                        

In [4]: %timeit s.encode();                                                     
729 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

AIUI (though I could be misunderstanding things) `str` objects do use some kind of typed array of unicode characters (either 16-bit narrow or 32-bit wide). So it seems like it *should* be possible to expose this as a 1-D contiguous array that C/C++ code could use. Though I may be misunderstanding how `str`s actually work under-the-hood (if so apologies).

It would be quite helpful to bypass this encoding/decoding step and instead work directly with the underlying buffer in these situations where C/C++ is involved to help performance critical code.

msg374148 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-07-23 21:51
> AIUI (though I could be misunderstanding things) `str` objects do use some kind of typed array of unicode characters (either 16-bit narrow or 32-bit wide). 

It's somewhat more complicated. The string data is stored differently depending on the maximum code point in the string. See PEP 393.

The "kind" field describes this as:
1 byte (Latin-1)
2 byte (UCS-2)
4 byte (UCS-4)
msg374149 - (view) Author: (jakirkham) Date: 2020-07-23 22:06
Thanks for the clarification, Eric! :)

Is this the sort of thing that we could capture in the `format`[1] field (like with `"B"`, `"H"`, and `"I"`[2]) or are there potential issues there?

msg374152 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-07-23 22:30
I don't think there's a python-level api to find out the "kind", but I can't say I've looked closely. And there are no doubt problems with doing so and alternate implementations other than CPython. I'm not sure we want to expose this implementation detail, but maybe it's the case that all implementations could expose this. For example, JPython could always just say "I'm UCS-2", or something.
msg374154 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-07-24 00:25
I think we can close this.  AFAICT, if we exposed the raw internal object with a memory view, there would be no practical way to use the data without a user having to substantially recreate the logic already present in encode() and the other string methods.
msg374158 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-07-24 04:27
I concur with Raymond.

Also, it could not help to caught bugs when you get a string instead expected bytes object. It may "work" in tests while string is ASCII, but fail miserably on real-world non-ASCII data.
msg374202 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-07-24 18:35
We should not do this, it would expose internals that we need to keep private. The right approach would be to keep things as bytes.
Date User Action Args
2020-07-24 18:35:45gvanrossumsetstatus: open -> closed

nosy: + gvanrossum
messages: + msg374202

resolution: wont fix
stage: resolved
2020-07-24 04:27:45serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg374158
2020-07-24 00:25:43rhettingersetnosy: + rhettinger
messages: + msg374154
2020-07-24 00:10:10xtreaksetnosy: + skrah
2020-07-23 22:30:05eric.smithsetmessages: + msg374152
2020-07-23 22:06:11jakirkhamsetmessages: + msg374149
2020-07-23 21:51:21eric.smithsetnosy: + eric.smith
messages: + msg374148
2020-07-23 20:46:27jakirkhamcreate