classification
Title: [C API] No efficient C API to get UTF-8 string from unicode object.
Type: enhancement Stage: resolved
Components: C API Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: methane, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2019-12-18 12:10 by methane, last changed 2020-03-14 11:00 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
bench-asutf8.patch methane, 2020-02-03 12:14
Pull Requests
URL Status Linked Edit
PR 17659 merged methane, 2019-12-19 12:36
PR 17683 closed methane, 2019-12-23 11:56
PR 18327 merged methane, 2020-02-03 10:57
PR 18984 closed methane, 2020-03-14 03:45
PR 18985 merged methane, 2020-03-14 04:27
Messages (17)
msg358623 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-12-18 12:10
Assume you are writing an extension module that reads string.  For example, HTML escape or JSON encode.

There are two courses:

(a) Support three KINDs in the flexible unicode representation.
(b) Get UTF-8 data from the unicode.

(a) will be the fastest on CPython, but there are few drawbacks:

 * This is tightly coupled with CPython implementation.  It will be slow on PyPy.
 * CPython may change the internal representation to UTF-8 in the future, like PyPy.
 * You can not easily reuse algorithms written in C that handle `char*`.

So I believe (b) should be the preferred way.
But CPython doesn't provide an efficient way to get UTF-8 from the unicode object.

 * PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache.  The cache will be remained for longer than required.  And there is additional malloc + memcpy to create the cache.

 * PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already.

For speed and efficiency, I propose a new API:

```
  /* Borrow the UTF-8 C string from the unicode.
   *
   * Store a pointer to the UTF-8 encoding of the unicode to *utf8* and its size to *size*.
   * The returned object is the owner of the *utf8*.  You need to Py_DECREF() it after
   * you finished to using the *utf8*.  The owner may be not the unicode.
   * Returns NULL when the error occurred while decoding the unicode.
   */
  PyObject* PyUnicode_BorrowUTF8(PyObject *unicode, const char **utf8, Py_ssize_t *len);
```

When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it.
Otherwise, this API calls `_PyUnicode_AsUTF8String(unicode, NULL)` and return it.
msg358662 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-12-19 09:43
Do you mean some concrete code? Several times I wished similar feature. To get a UTF-8 cache if it exists and encode to UTF-8 without creating a cache otherwise. 

The private _PyUnicode_UTF8() macro could help

if ((s = _PyUnicode_UTF8(str))) {
    size = _PyUnicode_UTF8_LENGTH(str);
    tmpbytes = NULL;
}
else {
    tmpbytes = _PyUnicode_AsUTF8String(str, "replace");
    s = PyBytes_AS_STRING(tmpbytes);
    size = PyBytes_GET_SIZE(tmpbytes);
}

but it is not even available outside of unicodeobject.c.

PyUnicode_BorrowUTF8() looks too complex for the public API. I am not sure that it will be easy to implement it in PyPy. It also does not cover all use cases -- sometimes you want to convert to UTF-8 but does not use any memory allocation at all (either use an existing buffer or raise an error if there is no cached UTF-8 or the string is not ASCII).
msg358663 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-12-19 09:46
> The returned object is the owner of the *utf8*.  You need to Py_DECREF() it after
> you finished to using the *utf8*.  The owner may be not the unicode.

Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"?

Py_buffer would be nice since it already has a pointer attribute (data) and a length attribute, and there is an API to "release" a Py_buffer. It can be marked as read-only, etc.
msg358664 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-12-19 10:01
> Would it be possible to use a "container" object like a Py_buffer?

Looks like a good idea.

int PyUnicode_GetUTF8Buffer(Py_buffer *view, const char *errors)
msg358665 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-12-19 10:04
> Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"?

It looks nice idea!  Py_buffer.obj is decref-ed when releasing the buffer.
https://docs.python.org/3/c-api/buffer.html#c.PyBuffer_Release


int PyUnicode_GetUTF8Buffer(PyObject *unicode, Py_buffer *view)
{
    if (!PyUnicode_Check(unicode)) {
        PyErr_BadArgument();
        return NULL;
    }
    if (PyUnicode_READY(unicode) == -1) {
        return NULL;
    }

    if (PyUnicode_UTF8(unicode) != NULL) {
        return PyBuffer_FillInfo(view, unicode,
                                 PyUnicode_UTF8(unicode),
                                 PyUnicode_UTF8_LENGTH(unicode),
                                 1, PyBUF_CONTIG_RO);
    }
    PyObject *bytes = _PyUnicode_AsUTF8String(unicode, NULL);
    if (bytes == NULL) {
        return NULL;
    }
    return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);
}
msg358666 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-12-19 10:05
s/return NULL/return -1/g
msg358670 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-12-19 10:37
return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);

Don't you need to DECREF bytes somehow, at least, in case of failure?
msg358673 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-12-19 11:20
> Don't you need to DECREF bytes somehow, at least, in case of failure?

Thanks.  I will create a pull request with suggested changes.
msg358778 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-12-21 18:57
I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier.
msg358860 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-12-25 07:41
> I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier.

I created a post about this issue in discuss.python.org.
https://discuss.python.org/t/better-api-for-encoding-unicode-objects-with-utf-8/2909
msg361284 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-02-03 12:10
I am still not sure about we should add new API only for avoiding cache.

* PyUnicode_AsUTF8String : When we need bytes or want to avoid cache.
* PyUnicode_AsUTF8AndSize : When we need C string, and cache is acceptable.


With PR-18327, PyUnicode_AsUTF8AndSize become 10+% faster than master branch, and same speed to PyUnicode_AsUTF8String.


## vs master

$ ./python -m pyperf timeit --compare-to=../cpython/python --python-names master:patched -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
master: ..................... 96.6 us +- 3.3 us
patched: ..................... 83.3 us +- 0.3 us

Mean +- std dev: [master] 96.6 us +- 3.3 us -> [patched] 83.3 us +- 0.3 us: 1.16x faster (-14%)


## vs AsUTF8String

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
.....................
Mean +- std dev: 83.2 us +- 0.2 us

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "こんにちは")'
.....................
Mean +- std dev: 81.9 us +- 2.1 us


## vs AsUTF8String (ASCII)

If we can not accept cache, PyUnicode_AsUTF8String is slower than PyUnicode_AsUTF8 when the unicode is ASCII string.  PyUnicode_GetUTF8Buffer helps only this case.

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "world")'
.....................
Mean +- std dev: 37.5 us +- 1.7 us

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "world")'
.....................
Mean +- std dev: 46.4 us +- 1.6 us
msg361285 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-02-03 12:14
Attached patch is the benchmark function I used in previous post.
msg362766 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-02-27 04:49
New changeset 02a4d57263a9846de35b0db12763ff9e7326f62c by Inada Naoki in branch 'master':
bpo-39087: Optimize PyUnicode_AsUTF8AndSize() (GH-18327)
https://github.com/python/cpython/commit/02a4d57263a9846de35b0db12763ff9e7326f62c
msg364141 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-03-14 03:43
New changeset c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b by Inada Naoki in branch 'master':
bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659)
https://github.com/python/cpython/commit/c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b
msg364142 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-03-14 04:24
I'm sorry about merging PR 18327, but I can not find enough usage example of the _PyUnicode_GetUTF8Buffer.

PyUnicode_AsUTF8AndSize is optimized, and utf8_cache is not so bad in most case.  So _PyUnicode_GetUTF8Buffer seems not worth enough.

I will revert PR 18327.
msg364146 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-03-14 06:59
New changeset 3a8c56295d6272ad2177d2de8af4c3f824f3ef92 by Inada Naoki in branch 'master':
Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" (GH-18985)
https://github.com/python/cpython/commit/3a8c56295d6272ad2177d2de8af4c3f824f3ef92
msg364151 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-03-14 11:00
I though there are at least 3-4 use cases in the core and stdlib.
History
Date User Action Args
2020-03-14 11:00:07serhiy.storchakasetmessages: + msg364151
2020-03-14 06:59:47methanesetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-03-14 06:59:31methanesetmessages: + msg364146
2020-03-14 04:27:50methanesetpull_requests: + pull_request18333
2020-03-14 04:24:17methanesetmessages: + msg364142
2020-03-14 03:45:35methanesetpull_requests: + pull_request18332
2020-03-14 03:43:26methanesetmessages: + msg364141
2020-02-27 04:49:03methanesetmessages: + msg362766
2020-02-03 12:14:13methanesetfiles: + bench-asutf8.patch

messages: + msg361285
2020-02-03 12:10:03methanesetmessages: + msg361284
2020-02-03 10:57:31methanesetpull_requests: + pull_request17701
2019-12-25 07:41:22methanesetmessages: + msg358860
2019-12-23 11:56:16methanesetpull_requests: + pull_request17140
2019-12-21 18:57:08serhiy.storchakasetmessages: + msg358778
2019-12-19 12:36:22methanesetkeywords: + patch
stage: patch review
pull_requests: + pull_request17127
2019-12-19 11:20:38methanesetmessages: + msg358673
2019-12-19 10:37:20vstinnersetmessages: + msg358670
2019-12-19 10:05:12methanesetmessages: + msg358666
2019-12-19 10:04:29methanesetnosy: - skrah
messages: + msg358665
2019-12-19 10:01:31serhiy.storchakasetnosy: + skrah
messages: + msg358664
2019-12-19 09:46:19vstinnersetmessages: + msg358663
2019-12-19 09:43:13serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg358662
2019-12-18 15:32:26vstinnersettitle: No efficient API to get UTF-8 string from unicode object. -> [C API] No efficient C API to get UTF-8 string from unicode object.
2019-12-18 15:29:27vstinnersetnosy: + vstinner
2019-12-18 12:10:15methanecreate