msg358623 - (view) |
Author: Inada Naoki (methane) * |
Date: 2019-12-18 12:10 |
Assume you are writing an extension module that reads string. For example, HTML escape or JSON encode.
There are two courses:
(a) Support three KINDs in the flexible unicode representation.
(b) Get UTF-8 data from the unicode.
(a) will be the fastest on CPython, but there are few drawbacks:
* This is tightly coupled with CPython implementation. It will be slow on PyPy.
* CPython may change the internal representation to UTF-8 in the future, like PyPy.
* You can not easily reuse algorithms written in C that handle `char*`.
So I believe (b) should be the preferred way.
But CPython doesn't provide an efficient way to get UTF-8 from the unicode object.
* PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache. The cache will be remained for longer than required. And there is additional malloc + memcpy to create the cache.
* PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already.
For speed and efficiency, I propose a new API:
```
/* Borrow the UTF-8 C string from the unicode.
*
* Store a pointer to the UTF-8 encoding of the unicode to *utf8* and its size to *size*.
* The returned object is the owner of the *utf8*. You need to Py_DECREF() it after
* you finished to using the *utf8*. The owner may be not the unicode.
* Returns NULL when the error occurred while decoding the unicode.
*/
PyObject* PyUnicode_BorrowUTF8(PyObject *unicode, const char **utf8, Py_ssize_t *len);
```
When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it.
Otherwise, this API calls `_PyUnicode_AsUTF8String(unicode, NULL)` and return it.
|
msg358662 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2019-12-19 09:43 |
Do you mean some concrete code? Several times I wished similar feature. To get a UTF-8 cache if it exists and encode to UTF-8 without creating a cache otherwise.
The private _PyUnicode_UTF8() macro could help
if ((s = _PyUnicode_UTF8(str))) {
size = _PyUnicode_UTF8_LENGTH(str);
tmpbytes = NULL;
}
else {
tmpbytes = _PyUnicode_AsUTF8String(str, "replace");
s = PyBytes_AS_STRING(tmpbytes);
size = PyBytes_GET_SIZE(tmpbytes);
}
but it is not even available outside of unicodeobject.c.
PyUnicode_BorrowUTF8() looks too complex for the public API. I am not sure that it will be easy to implement it in PyPy. It also does not cover all use cases -- sometimes you want to convert to UTF-8 but does not use any memory allocation at all (either use an existing buffer or raise an error if there is no cached UTF-8 or the string is not ASCII).
|
msg358663 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2019-12-19 09:46 |
> The returned object is the owner of the *utf8*. You need to Py_DECREF() it after
> you finished to using the *utf8*. The owner may be not the unicode.
Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"?
Py_buffer would be nice since it already has a pointer attribute (data) and a length attribute, and there is an API to "release" a Py_buffer. It can be marked as read-only, etc.
|
msg358664 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2019-12-19 10:01 |
> Would it be possible to use a "container" object like a Py_buffer?
Looks like a good idea.
int PyUnicode_GetUTF8Buffer(Py_buffer *view, const char *errors)
|
msg358665 - (view) |
Author: Inada Naoki (methane) * |
Date: 2019-12-19 10:04 |
> Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"?
It looks nice idea! Py_buffer.obj is decref-ed when releasing the buffer.
https://docs.python.org/3/c-api/buffer.html#c.PyBuffer_Release
int PyUnicode_GetUTF8Buffer(PyObject *unicode, Py_buffer *view)
{
if (!PyUnicode_Check(unicode)) {
PyErr_BadArgument();
return NULL;
}
if (PyUnicode_READY(unicode) == -1) {
return NULL;
}
if (PyUnicode_UTF8(unicode) != NULL) {
return PyBuffer_FillInfo(view, unicode,
PyUnicode_UTF8(unicode),
PyUnicode_UTF8_LENGTH(unicode),
1, PyBUF_CONTIG_RO);
}
PyObject *bytes = _PyUnicode_AsUTF8String(unicode, NULL);
if (bytes == NULL) {
return NULL;
}
return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);
}
|
msg358666 - (view) |
Author: Inada Naoki (methane) * |
Date: 2019-12-19 10:05 |
s/return NULL/return -1/g
|
msg358670 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2019-12-19 10:37 |
return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);
Don't you need to DECREF bytes somehow, at least, in case of failure?
|
msg358673 - (view) |
Author: Inada Naoki (methane) * |
Date: 2019-12-19 11:20 |
> Don't you need to DECREF bytes somehow, at least, in case of failure?
Thanks. I will create a pull request with suggested changes.
|
msg358778 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2019-12-21 18:57 |
I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier.
|
msg358860 - (view) |
Author: Inada Naoki (methane) * |
Date: 2019-12-25 07:41 |
> I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier.
I created a post about this issue in discuss.python.org.
https://discuss.python.org/t/better-api-for-encoding-unicode-objects-with-utf-8/2909
|
msg361284 - (view) |
Author: Inada Naoki (methane) * |
Date: 2020-02-03 12:10 |
I am still not sure about we should add new API only for avoiding cache.
* PyUnicode_AsUTF8String : When we need bytes or want to avoid cache.
* PyUnicode_AsUTF8AndSize : When we need C string, and cache is acceptable.
With PR-18327, PyUnicode_AsUTF8AndSize become 10+% faster than master branch, and same speed to PyUnicode_AsUTF8String.
## vs master
$ ./python -m pyperf timeit --compare-to=../cpython/python --python-names master:patched -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
master: ..................... 96.6 us +- 3.3 us
patched: ..................... 83.3 us +- 0.3 us
Mean +- std dev: [master] 96.6 us +- 3.3 us -> [patched] 83.3 us +- 0.3 us: 1.16x faster (-14%)
## vs AsUTF8String
$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
.....................
Mean +- std dev: 83.2 us +- 0.2 us
$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "こんにちは")'
.....................
Mean +- std dev: 81.9 us +- 2.1 us
## vs AsUTF8String (ASCII)
If we can not accept cache, PyUnicode_AsUTF8String is slower than PyUnicode_AsUTF8 when the unicode is ASCII string. PyUnicode_GetUTF8Buffer helps only this case.
$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "world")'
.....................
Mean +- std dev: 37.5 us +- 1.7 us
$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "world")'
.....................
Mean +- std dev: 46.4 us +- 1.6 us
|
msg361285 - (view) |
Author: Inada Naoki (methane) * |
Date: 2020-02-03 12:14 |
Attached patch is the benchmark function I used in previous post.
|
msg362766 - (view) |
Author: Inada Naoki (methane) * |
Date: 2020-02-27 04:49 |
New changeset 02a4d57263a9846de35b0db12763ff9e7326f62c by Inada Naoki in branch 'master':
bpo-39087: Optimize PyUnicode_AsUTF8AndSize() (GH-18327)
https://github.com/python/cpython/commit/02a4d57263a9846de35b0db12763ff9e7326f62c
|
msg364141 - (view) |
Author: Inada Naoki (methane) * |
Date: 2020-03-14 03:43 |
New changeset c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b by Inada Naoki in branch 'master':
bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659)
https://github.com/python/cpython/commit/c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b
|
msg364142 - (view) |
Author: Inada Naoki (methane) * |
Date: 2020-03-14 04:24 |
I'm sorry about merging PR 18327, but I can not find enough usage example of the _PyUnicode_GetUTF8Buffer.
PyUnicode_AsUTF8AndSize is optimized, and utf8_cache is not so bad in most case. So _PyUnicode_GetUTF8Buffer seems not worth enough.
I will revert PR 18327.
|
msg364146 - (view) |
Author: Inada Naoki (methane) * |
Date: 2020-03-14 06:59 |
New changeset 3a8c56295d6272ad2177d2de8af4c3f824f3ef92 by Inada Naoki in branch 'master':
Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" (GH-18985)
https://github.com/python/cpython/commit/3a8c56295d6272ad2177d2de8af4c3f824f3ef92
|
msg364151 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2020-03-14 11:00 |
I though there are at least 3-4 use cases in the core and stdlib.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:24 | admin | set | github: 83268 |
2020-03-14 11:00:07 | serhiy.storchaka | set | messages:
+ msg364151 |
2020-03-14 06:59:47 | methane | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2020-03-14 06:59:31 | methane | set | messages:
+ msg364146 |
2020-03-14 04:27:50 | methane | set | pull_requests:
+ pull_request18333 |
2020-03-14 04:24:17 | methane | set | messages:
+ msg364142 |
2020-03-14 03:45:35 | methane | set | pull_requests:
+ pull_request18332 |
2020-03-14 03:43:26 | methane | set | messages:
+ msg364141 |
2020-02-27 04:49:03 | methane | set | messages:
+ msg362766 |
2020-02-03 12:14:13 | methane | set | files:
+ bench-asutf8.patch
messages:
+ msg361285 |
2020-02-03 12:10:03 | methane | set | messages:
+ msg361284 |
2020-02-03 10:57:31 | methane | set | pull_requests:
+ pull_request17701 |
2019-12-25 07:41:22 | methane | set | messages:
+ msg358860 |
2019-12-23 11:56:16 | methane | set | pull_requests:
+ pull_request17140 |
2019-12-21 18:57:08 | serhiy.storchaka | set | messages:
+ msg358778 |
2019-12-19 12:36:22 | methane | set | keywords:
+ patch stage: patch review pull_requests:
+ pull_request17127 |
2019-12-19 11:20:38 | methane | set | messages:
+ msg358673 |
2019-12-19 10:37:20 | vstinner | set | messages:
+ msg358670 |
2019-12-19 10:05:12 | methane | set | messages:
+ msg358666 |
2019-12-19 10:04:29 | methane | set | nosy:
- skrah messages:
+ msg358665
|
2019-12-19 10:01:31 | serhiy.storchaka | set | nosy:
+ skrah messages:
+ msg358664
|
2019-12-19 09:46:19 | vstinner | set | messages:
+ msg358663 |
2019-12-19 09:43:13 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg358662
|
2019-12-18 15:32:26 | vstinner | set | title: No efficient API to get UTF-8 string from unicode object. -> [C API] No efficient C API to get UTF-8 string from unicode object. |
2019-12-18 15:29:27 | vstinner | set | nosy:
+ vstinner
|
2019-12-18 12:10:15 | methane | create | |