[C API] No efficient C API to get UTF-8 string from unicode object. #83268

methane · 2019-12-18T12:10:15Z

BPO	39087
Nosy	@vstinner, @methane, @serhiy-storchaka
PRs	bpo-39087: Add _PyUnicode_GetUTF8Buffer() #17659 bpo-39087: Make PyUnicode_AsUTF8AndSize faster. #17683 bpo-39087: Optimize PyUnicode_AsUTF8AndSize(). #18327 bpo-39087: Use _PyUnicode_GetUTF8Buffer() in fileobject.c #18984 Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" #18985
Files	bench-asutf8.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-03-14.06:59:47.614>
created_at = <Date 2019-12-18.12:10:15.464>
labels = ['expert-C-API', 'type-feature', '3.9']
title = '[C API] No efficient C API to get UTF-8 string from unicode object.'
updated_at = <Date 2020-03-14.11:00:07.223>
user = 'https://github.com/methane'

bugs.python.org fields:

activity = <Date 2020-03-14.11:00:07.223>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2020-03-14.06:59:47.614>
closer = 'methane'
components = ['C API']
creation = <Date 2019-12-18.12:10:15.464>
creator = 'methane'
dependencies = []
files = ['48879']
hgrepos = []
issue_num = 39087
keywords = ['patch']
message_count = 17.0
messages = ['358623', '358662', '358663', '358664', '358665', '358666', '358670', '358673', '358778', '358860', '361284', '361285', '362766', '364141', '364142', '364146', '364151']
nosy_count = 3.0
nosy_names = ['vstinner', 'methane', 'serhiy.storchaka']
pr_nums = ['17659', '17683', '18327', '18984', '18985']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue39087'
versions = ['Python 3.9']

methane · 2019-12-18T12:10:15Z

Assume you are writing an extension module that reads string. For example, HTML escape or JSON encode.

There are two courses:

(a) Support three KINDs in the flexible unicode representation.
(b) Get UTF-8 data from the unicode.

(a) will be the fastest on CPython, but there are few drawbacks:

This is tightly coupled with CPython implementation. It will be slow on PyPy.
CPython may change the internal representation to UTF-8 in the future, like PyPy.
You can not easily reuse algorithms written in C that handle char*.

So I believe (b) should be the preferred way.
But CPython doesn't provide an efficient way to get UTF-8 from the unicode object.

PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache. The cache will be remained for longer than required. And there is additional malloc + memcpy to create the cache.
PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already.

For speed and efficiency, I propose a new API:

  /* Borrow the UTF-8 C string from the unicode.
   *
   * Store a pointer to the UTF-8 encoding of the unicode to *utf8* and its size to *size*.
   * The returned object is the owner of the *utf8*.  You need to Py_DECREF() it after
   * you finished to using the *utf8*.  The owner may be not the unicode.
   * Returns NULL when the error occurred while decoding the unicode.
   */
  PyObject* PyUnicode_BorrowUTF8(PyObject *unicode, const char **utf8, Py_ssize_t *len);

When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it.
Otherwise, this API calls _PyUnicode_AsUTF8String(unicode, NULL) and return it.

serhiy-storchaka · 2019-12-19T09:43:13Z

Do you mean some concrete code? Several times I wished similar feature. To get a UTF-8 cache if it exists and encode to UTF-8 without creating a cache otherwise.

The private _PyUnicode_UTF8() macro could help

if ((s = _PyUnicode_UTF8(str))) {
    size = _PyUnicode_UTF8_LENGTH(str);
    tmpbytes = NULL;
}
else {
    tmpbytes = _PyUnicode_AsUTF8String(str, "replace");
    s = PyBytes_AS_STRING(tmpbytes);
    size = PyBytes_GET_SIZE(tmpbytes);
}

but it is not even available outside of unicodeobject.c.

PyUnicode_BorrowUTF8() looks too complex for the public API. I am not sure that it will be easy to implement it in PyPy. It also does not cover all use cases -- sometimes you want to convert to UTF-8 but does not use any memory allocation at all (either use an existing buffer or raise an error if there is no cached UTF-8 or the string is not ASCII).

vstinner · 2019-12-19T09:46:19Z

The returned object is the owner of the *utf8*. You need to Py_DECREF() it after
you finished to using the *utf8*. The owner may be not the unicode.

Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"?

Py_buffer would be nice since it already has a pointer attribute (data) and a length attribute, and there is an API to "release" a Py_buffer. It can be marked as read-only, etc.

serhiy-storchaka · 2019-12-19T10:01:32Z

Would it be possible to use a "container" object like a Py_buffer?

Looks like a good idea.

int PyUnicode_GetUTF8Buffer(Py_buffer *view, const char *errors)

methane · 2019-12-19T10:04:30Z

Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"?

It looks nice idea! Py_buffer.obj is decref-ed when releasing the buffer.
https://docs.python.org/3/c-api/buffer.html#c.PyBuffer_Release

int PyUnicode_GetUTF8Buffer(PyObject *unicode, Py_buffer *view)
{
    if (!PyUnicode_Check(unicode)) {
        PyErr_BadArgument();
        return NULL;
    }
    if (PyUnicode_READY(unicode) == -1) {
        return NULL;
    }

    if (PyUnicode_UTF8(unicode) != NULL) {
        return PyBuffer_FillInfo(view, unicode,
                                 PyUnicode_UTF8(unicode),
                                 PyUnicode_UTF8_LENGTH(unicode),
                                 1, PyBUF_CONTIG_RO);
    }
    PyObject *bytes = _PyUnicode_AsUTF8String(unicode, NULL);
    if (bytes == NULL) {
        return NULL;
    }
    return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);
}

methane · 2019-12-19T10:05:12Z

s/return NULL/return -1/g

vstinner · 2019-12-19T10:37:21Z

return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);

Don't you need to DECREF bytes somehow, at least, in case of failure?

methane · 2019-12-19T11:20:38Z

Don't you need to DECREF bytes somehow, at least, in case of failure?

Thanks. I will create a pull request with suggested changes.

serhiy-storchaka · 2019-12-21T18:57:08Z

I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier.

methane · 2019-12-25T07:41:22Z

I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier.

I created a post about this issue in discuss.python.org.
https://discuss.python.org/t/better-api-for-encoding-unicode-objects-with-utf-8/2909

methane · 2020-02-03T12:10:03Z

I am still not sure about we should add new API only for avoiding cache.

PyUnicode_AsUTF8String : When we need bytes or want to avoid cache.
PyUnicode_AsUTF8AndSize : When we need C string, and cache is acceptable.

With PR-18327, PyUnicode_AsUTF8AndSize become 10+% faster than master branch, and same speed to PyUnicode_AsUTF8String.

## vs master

$ ./python -m pyperf timeit --compare-to=../cpython/python --python-names master:patched -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
master: ..................... 96.6 us +- 3.3 us
patched: ..................... 83.3 us +- 0.3 us

Mean +- std dev: [master] 96.6 us +- 3.3 us -> [patched] 83.3 us +- 0.3 us: 1.16x faster (-14%)

## vs AsUTF8String

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
.....................
Mean +- std dev: 83.2 us +- 0.2 us

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "こんにちは")'
.....................
Mean +- std dev: 81.9 us +- 2.1 us

## vs AsUTF8String (ASCII)

If we can not accept cache, PyUnicode_AsUTF8String is slower than PyUnicode_AsUTF8 when the unicode is ASCII string. PyUnicode_GetUTF8Buffer helps only this case.

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "world")'
.....................
Mean +- std dev: 37.5 us +- 1.7 us

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "world")'
.....................
Mean +- std dev: 46.4 us +- 1.6 us

methane · 2020-02-03T12:14:13Z

Attached patch is the benchmark function I used in previous post.

methane · 2020-02-27T04:49:03Z

New changeset 02a4d57 by Inada Naoki in branch 'master':
bpo-39087: Optimize PyUnicode_AsUTF8AndSize() (GH-18327)
02a4d57

methane · 2020-03-14T03:43:26Z

New changeset c7ad974 by Inada Naoki in branch 'master':
bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659)
c7ad974

methane · 2020-03-14T04:24:17Z

I'm sorry about merging PR 18327, but I can not find enough usage example of the _PyUnicode_GetUTF8Buffer.

PyUnicode_AsUTF8AndSize is optimized, and utf8_cache is not so bad in most case. So _PyUnicode_GetUTF8Buffer seems not worth enough.

I will revert PR 18327.

methane · 2020-03-14T06:59:32Z

New changeset 3a8c562 by Inada Naoki in branch 'master':
Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" (GH-18985)
3a8c562

serhiy-storchaka · 2020-03-14T11:00:07Z

I though there are at least 3-4 use cases in the core and stdlib.

methane added 3.9 only security fixes topic-C-API type-feature A feature request or enhancement labels Dec 18, 2019

vstinner changed the title ~~No efficient API to get UTF-8 string from unicode object.~~ [C API] No efficient C API to get UTF-8 string from unicode object. Dec 18, 2019

methane closed this as completed Mar 14, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C API] No efficient C API to get UTF-8 string from unicode object. #83268

[C API] No efficient C API to get UTF-8 string from unicode object. #83268

methane commented Dec 18, 2019

methane commented Dec 18, 2019

serhiy-storchaka commented Dec 19, 2019

vstinner commented Dec 19, 2019

serhiy-storchaka commented Dec 19, 2019

methane commented Dec 19, 2019

methane commented Dec 19, 2019

vstinner commented Dec 19, 2019

methane commented Dec 19, 2019

serhiy-storchaka commented Dec 21, 2019

methane commented Dec 25, 2019

methane commented Feb 3, 2020

methane commented Feb 3, 2020

methane commented Feb 27, 2020

methane commented Mar 14, 2020

methane commented Mar 14, 2020

methane commented Mar 14, 2020

serhiy-storchaka commented Mar 14, 2020

[C API] No efficient C API to get UTF-8 string from unicode object. #83268

[C API] No efficient C API to get UTF-8 string from unicode object. #83268

Comments

methane commented Dec 18, 2019

methane commented Dec 18, 2019

serhiy-storchaka commented Dec 19, 2019

vstinner commented Dec 19, 2019

serhiy-storchaka commented Dec 19, 2019

methane commented Dec 19, 2019

methane commented Dec 19, 2019

vstinner commented Dec 19, 2019

methane commented Dec 19, 2019

serhiy-storchaka commented Dec 21, 2019

methane commented Dec 25, 2019

methane commented Feb 3, 2020

methane commented Feb 3, 2020

methane commented Feb 27, 2020

methane commented Mar 14, 2020

methane commented Mar 14, 2020

methane commented Mar 14, 2020

serhiy-storchaka commented Mar 14, 2020