Issue 35295: Please clarify whether PyUnicode_AsUTF8AndSize() or PyUnicode_AsUTF8String() is preferred

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/79476

classification

Title:	Please clarify whether PyUnicode_AsUTF8AndSize() or PyUnicode_AsUTF8String() is preferred
Type:	performance	Stage:	resolved
Components:	Documentation, Interpreter Core, Unicode	Versions:	Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Marcin Kowalczyk, docs@python, ezio.melotti, methane, miss-islington, vstinner
Priority:	normal	Keywords:	patch

Created on 2018-11-22 12:12 by Marcin Kowalczyk, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 24453	merged	methane, 2021-02-05 03:55
PR 24454	merged	miss-islington, 2021-02-05 04:22

Messages (5)
msg330249 - (view)	Author: Marcin Kowalczyk (Marcin Kowalczyk)	Date: 2018-11-22 12:12
The documentation is silent whether PyUnicode_AsUTF8AndSize() or PyUnicode_AsUTF8String() is preferred. We are under the assumption that both are acceptable for the given caller, i.e. the caller wants to access just the sequence of UTF-8 code units (e.g. for calling a C++ function which takes std::string_view or std::string as a parameter), and the caller either will copy the UTF-8 code units immediately or is willing to own a temporary object to ensure a lifetime of the UTF-8 code units. File comments in unicodeobject.h about PyUnicode_AsUTF8AndSize() have a warning: * This API is for interpreter INTERNAL USE ONLY and will likely * be removed or changed in the future. * If you need to access the Unicode object as UTF-8 bytes string, * please use PyUnicode_AsUTF8String() instead. The discrepancy between these comments and the documentation should be fixed. Either the documentation is correct and the comment is outdated, or the comment is correct and the documentation is lacking guidance. It is not even clear which function is better technically: - PyUnicode_AsUTF8String() always allocates the string. PyUnicode_AsUTF8AndSize() does not allocate the string if the unicode object is ASCII-only (this is common) or if PyUnicode_AsUTF8AndSize() was already called before. - If conversion must be performed, then PyUnicode_AsUTF8String() makes a single allocation, while PyUnicode_AsUTF8AndSize() first calls PyUnicode_AsUTF8String() and then copies the string. - If the string is converted multiple times, then PyUnicode_AsUTF8AndSize() caches the result - faster. If the string is converted once, then the result persists as long as the string persists - wastes memory. I see the following possible resolutions: 1a. Declare both functions equally acceptable. Remove comments claiming that PyUnicode_AsUTF8AndSize() should be avoided. 1b. 1a, and change the implementation of PyUnicode_AsUTF8AndSize() to avoid allocating the string twice if it needs to be materialized, so that PyUnicode_AsUTF8AndSize() is never significantly slower than PyUnicode_AsUTF8String(). 2a. Declare PyUnicode_AsUTF8String() preferred. Indicate this in the documentation. 2b. 2a, and provide a public interface to check and access UTF-8 code units without allocating a new string in case this is possible (I think PyUnicode_READY() + PyUnicode_IS_ASCII() + PyUnicode_DATA() + PyUnicode_GET_LENGTH() would work, but they are not documented; or possibly also check if the string has a cached UTF-8 representation without populating that cached representation), so that a combination of the check with PyUnicode_AsUTF8String() is rarely or never significantly slower than PyUnicode_AsUTF8AndSize().
msg386508 - (view)	Author: Inada Naoki (methane) *	Date: 2021-02-05 03:53
> 1a. Declare both functions equally acceptable. Remove comments claiming that PyUnicode_AsUTF8AndSize() should be avoided. > > 1b. 1a, and change the implementation of PyUnicode_AsUTF8AndSize() to avoid allocating the string twice if it needs to be materialized, so that PyUnicode_AsUTF8AndSize() is never significantly slower than PyUnicode_AsUTF8String(). I think 1b is the best approach. PyUnicode_AsUTF8AndSize() is optimized already. See GH-18327. And it becomes limited API. See bpo-41784. So we should just remove the outdated comments.
msg386509 - (view)	Author: Inada Naoki (methane) *	Date: 2021-02-05 04:21
New changeset d938816acf71a74f1bd13fdf0534b3d9ea962e44 by Inada Naoki in branch 'master': bpo-35295: Remove outdated comment. (GH-24453) https://github.com/python/cpython/commit/d938816acf71a74f1bd13fdf0534b3d9ea962e44
msg386511 - (view)	Author: miss-islington (miss-islington)	Date: 2021-02-05 04:44
New changeset b0b01811bb28d3d6c70846e47fa2f6ba03ed03f1 by Miss Islington (bot) in branch '3.9': bpo-35295: Remove outdated comment. (GH-24453) https://github.com/python/cpython/commit/b0b01811bb28d3d6c70846e47fa2f6ba03ed03f1
msg386527 - (view)	Author: Marcin Kowalczyk (Marcin Kowalczyk)	Date: 2021-02-05 12:33
Thank you! This means that I can continue to use PyUnicode_AsUTF8AndSize() without worries: https://github.com/google/riegeli/commit/17ab36bfdd6cc55f37cfbb729bd43c9cbff4cd22

History
Date	User	Action	Args
2022-04-11 14:59:08	admin	set	github: 79476
2021-02-05 12:33:16	Marcin Kowalczyk	set	messages: + msg386527
2021-02-05 04:48:09	methane	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2021-02-05 04:44:24	miss-islington	set	messages: + msg386511
2021-02-05 04:22:00	miss-islington	set	nosy: + miss-islington pull_requests: + pull_request23254
2021-02-05 04:21:39	methane	set	messages: + msg386509
2021-02-05 03:55:35	methane	set	keywords: + patch stage: patch review pull_requests: + pull_request23253
2021-02-05 03:54:11	methane	set	versions: + Python 3.10, - Python 3.8
2021-02-05 03:53:56	methane	set	messages: + msg386508
2018-11-24 01:36:03	methane	set	nosy: + methane
2018-11-22 12:12:44	Marcin Kowalczyk	create