Issue 39087: [C API] No efficient C API to get UTF-8 string from unicode object.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/83268

classification

Title:	[C API] No efficient C API to get UTF-8 string from unicode object.
Type:	enhancement	Stage:	resolved
Components:	C API	Versions:	Python 3.9

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	methane, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2019-12-18 12:10 by methane, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
bench-asutf8.patch	methane, 2020-02-03 12:14

Pull Requests
URL	Status	Linked	Edit
PR 17659	merged	methane, 2019-12-19 12:36
PR 17683	closed	methane, 2019-12-23 11:56
PR 18327	merged	methane, 2020-02-03 10:57
PR 18984	closed	methane, 2020-03-14 03:45
PR 18985	merged	methane, 2020-03-14 04:27

Messages (17)
msg358623 - (view)	Author: Inada Naoki (methane) *	Date: 2019-12-18 12:10
Assume you are writing an extension module that reads string. For example, HTML escape or JSON encode. There are two courses: (a) Support three KINDs in the flexible unicode representation. (b) Get UTF-8 data from the unicode. (a) will be the fastest on CPython, but there are few drawbacks: * This is tightly coupled with CPython implementation. It will be slow on PyPy. * CPython may change the internal representation to UTF-8 in the future, like PyPy. * You can not easily reuse algorithms written in C that handle `char`. So I believe (b) should be the preferred way. But CPython doesn't provide an efficient way to get UTF-8 from the unicode object. PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache. The cache will be remained for longer than required. And there is additional malloc + memcpy to create the cache. * PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already. For speed and efficiency, I propose a new API: ``` /* Borrow the UTF-8 C string from the unicode. * * Store a pointer to the UTF-8 encoding of the unicode to utf8 and its size to size. * The returned object is the owner of the utf8. You need to Py_DECREF() it after * you finished to using the utf8. The owner may be not the unicode. * Returns NULL when the error occurred while decoding the unicode. / PyObject PyUnicode_BorrowUTF8(PyObject unicode, const char utf8, Py_ssize_t len); ``` When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it. Otherwise, this API calls `_PyUnicode_AsUTF8String(unicode, NULL)` and return it.
msg358662 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-12-19 09:43
Do you mean some concrete code? Several times I wished similar feature. To get a UTF-8 cache if it exists and encode to UTF-8 without creating a cache otherwise. The private _PyUnicode_UTF8() macro could help if ((s = _PyUnicode_UTF8(str))) { size = _PyUnicode_UTF8_LENGTH(str); tmpbytes = NULL; } else { tmpbytes = _PyUnicode_AsUTF8String(str, "replace"); s = PyBytes_AS_STRING(tmpbytes); size = PyBytes_GET_SIZE(tmpbytes); } but it is not even available outside of unicodeobject.c. PyUnicode_BorrowUTF8() looks too complex for the public API. I am not sure that it will be easy to implement it in PyPy. It also does not cover all use cases -- sometimes you want to convert to UTF-8 but does not use any memory allocation at all (either use an existing buffer or raise an error if there is no cached UTF-8 or the string is not ASCII).
msg358663 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-12-19 09:46
> The returned object is the owner of the utf8. You need to Py_DECREF() it after > you finished to using the utf8. The owner may be not the unicode. Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"? Py_buffer would be nice since it already has a pointer attribute (data) and a length attribute, and there is an API to "release" a Py_buffer. It can be marked as read-only, etc.
msg358664 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-12-19 10:01
> Would it be possible to use a "container" object like a Py_buffer? Looks like a good idea. int PyUnicode_GetUTF8Buffer(Py_buffer view, const char errors)
msg358665 - (view)	Author: Inada Naoki (methane) *	Date: 2019-12-19 10:04
> Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"? It looks nice idea! Py_buffer.obj is decref-ed when releasing the buffer. https://docs.python.org/3/c-api/buffer.html#c.PyBuffer_Release int PyUnicode_GetUTF8Buffer(PyObject unicode, Py_buffer view) { if (!PyUnicode_Check(unicode)) { PyErr_BadArgument(); return NULL; } if (PyUnicode_READY(unicode) == -1) { return NULL; } if (PyUnicode_UTF8(unicode) != NULL) { return PyBuffer_FillInfo(view, unicode, PyUnicode_UTF8(unicode), PyUnicode_UTF8_LENGTH(unicode), 1, PyBUF_CONTIG_RO); } PyObject *bytes = _PyUnicode_AsUTF8String(unicode, NULL); if (bytes == NULL) { return NULL; } return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO); }
msg358666 - (view)	Author: Inada Naoki (methane) *	Date: 2019-12-19 10:05
s/return NULL/return -1/g
msg358670 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-12-19 10:37
return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO); Don't you need to DECREF bytes somehow, at least, in case of failure?
msg358673 - (view)	Author: Inada Naoki (methane) *	Date: 2019-12-19 11:20
> Don't you need to DECREF bytes somehow, at least, in case of failure? Thanks. I will create a pull request with suggested changes.
msg358778 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-12-21 18:57
I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier.
msg358860 - (view)	Author: Inada Naoki (methane) *	Date: 2019-12-25 07:41
> I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier. I created a post about this issue in discuss.python.org. https://discuss.python.org/t/better-api-for-encoding-unicode-objects-with-utf-8/2909
msg361284 - (view)	Author: Inada Naoki (methane) *	Date: 2020-02-03 12:10
I am still not sure about we should add new API only for avoiding cache. * PyUnicode_AsUTF8String : When we need bytes or want to avoid cache. * PyUnicode_AsUTF8AndSize : When we need C string, and cache is acceptable. With PR-18327, PyUnicode_AsUTF8AndSize become 10+% faster than master branch, and same speed to PyUnicode_AsUTF8String. ## vs master $ ./python -m pyperf timeit --compare-to=../cpython/python --python-names master:patched -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")' master: ..................... 96.6 us +- 3.3 us patched: ..................... 83.3 us +- 0.3 us Mean +- std dev: [master] 96.6 us +- 3.3 us -> [patched] 83.3 us +- 0.3 us: 1.16x faster (-14%) ## vs AsUTF8String $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")' ..................... Mean +- std dev: 83.2 us +- 0.2 us $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "こんにちは")' ..................... Mean +- std dev: 81.9 us +- 2.1 us ## vs AsUTF8String (ASCII) If we can not accept cache, PyUnicode_AsUTF8String is slower than PyUnicode_AsUTF8 when the unicode is ASCII string. PyUnicode_GetUTF8Buffer helps only this case. $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "world")' ..................... Mean +- std dev: 37.5 us +- 1.7 us $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "world")' ..................... Mean +- std dev: 46.4 us +- 1.6 us
msg361285 - (view)	Author: Inada Naoki (methane) *	Date: 2020-02-03 12:14
Attached patch is the benchmark function I used in previous post.
msg362766 - (view)	Author: Inada Naoki (methane) *	Date: 2020-02-27 04:49
New changeset 02a4d57263a9846de35b0db12763ff9e7326f62c by Inada Naoki in branch 'master': bpo-39087: Optimize PyUnicode_AsUTF8AndSize() (GH-18327) https://github.com/python/cpython/commit/02a4d57263a9846de35b0db12763ff9e7326f62c
msg364141 - (view)	Author: Inada Naoki (methane) *	Date: 2020-03-14 03:43
New changeset c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b by Inada Naoki in branch 'master': bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659) https://github.com/python/cpython/commit/c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b
msg364142 - (view)	Author: Inada Naoki (methane) *	Date: 2020-03-14 04:24
I'm sorry about merging PR 18327, but I can not find enough usage example of the _PyUnicode_GetUTF8Buffer. PyUnicode_AsUTF8AndSize is optimized, and utf8_cache is not so bad in most case. So _PyUnicode_GetUTF8Buffer seems not worth enough. I will revert PR 18327.
msg364146 - (view)	Author: Inada Naoki (methane) *	Date: 2020-03-14 06:59
New changeset 3a8c56295d6272ad2177d2de8af4c3f824f3ef92 by Inada Naoki in branch 'master': Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" (GH-18985) https://github.com/python/cpython/commit/3a8c56295d6272ad2177d2de8af4c3f824f3ef92
msg364151 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-03-14 11:00
I though there are at least 3-4 use cases in the core and stdlib.

History
Date	User	Action	Args
2022-04-11 14:59:24	admin	set	github: 83268
2020-03-14 11:00:07	serhiy.storchaka	set	messages: + msg364151
2020-03-14 06:59:47	methane	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2020-03-14 06:59:31	methane	set	messages: + msg364146
2020-03-14 04:27:50	methane	set	pull_requests: + pull_request18333
2020-03-14 04:24:17	methane	set	messages: + msg364142
2020-03-14 03:45:35	methane	set	pull_requests: + pull_request18332
2020-03-14 03:43:26	methane	set	messages: + msg364141
2020-02-27 04:49:03	methane	set	messages: + msg362766
2020-02-03 12:14:13	methane	set	files: + bench-asutf8.patch messages: + msg361285
2020-02-03 12:10:03	methane	set	messages: + msg361284
2020-02-03 10:57:31	methane	set	pull_requests: + pull_request17701
2019-12-25 07:41:22	methane	set	messages: + msg358860
2019-12-23 11:56:16	methane	set	pull_requests: + pull_request17140
2019-12-21 18:57:08	serhiy.storchaka	set	messages: + msg358778
2019-12-19 12:36:22	methane	set	keywords: + patch stage: patch review pull_requests: + pull_request17127
2019-12-19 11:20:38	methane	set	messages: + msg358673
2019-12-19 10:37:20	vstinner	set	messages: + msg358670
2019-12-19 10:05:12	methane	set	messages: + msg358666
2019-12-19 10:04:29	methane	set	nosy: - skrah messages: + msg358665
2019-12-19 10:01:31	serhiy.storchaka	set	nosy: + skrah messages: + msg358664
2019-12-19 09:46:19	vstinner	set	messages: + msg358663
2019-12-19 09:43:13	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg358662
2019-12-18 15:32:26	vstinner	set	title: No efficient API to get UTF-8 string from unicode object. -> [C API] No efficient C API to get UTF-8 string from unicode object.
2019-12-18 15:29:27	vstinner	set	nosy: + vstinner
2019-12-18 12:10:15	methane	create