Message 358623 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	methane
Recipients	methane
Date	2019-12-18.12:10:14
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1576671015.48.0.740860223179.issue39087@roundup.psfhosted.org>
In-reply-to

Content
Assume you are writing an extension module that reads string. For example, HTML escape or JSON encode. There are two courses: (a) Support three KINDs in the flexible unicode representation. (b) Get UTF-8 data from the unicode. (a) will be the fastest on CPython, but there are few drawbacks: * This is tightly coupled with CPython implementation. It will be slow on PyPy. * CPython may change the internal representation to UTF-8 in the future, like PyPy. * You can not easily reuse algorithms written in C that handle `char`. So I believe (b) should be the preferred way. But CPython doesn't provide an efficient way to get UTF-8 from the unicode object. PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache. The cache will be remained for longer than required. And there is additional malloc + memcpy to create the cache. * PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already. For speed and efficiency, I propose a new API: ``` /* Borrow the UTF-8 C string from the unicode. * * Store a pointer to the UTF-8 encoding of the unicode to utf8 and its size to size. * The returned object is the owner of the utf8. You need to Py_DECREF() it after * you finished to using the utf8. The owner may be not the unicode. * Returns NULL when the error occurred while decoding the unicode. / PyObject PyUnicode_BorrowUTF8(PyObject unicode, const char utf8, Py_ssize_t len); ``` When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it. Otherwise, this API calls `_PyUnicode_AsUTF8String(unicode, NULL)` and return it.

Assume you are writing an extension module that reads string.  For example, HTML escape or JSON encode.

There are two courses:

(a) Support three KINDs in the flexible unicode representation.
(b) Get UTF-8 data from the unicode.

(a) will be the fastest on CPython, but there are few drawbacks:

 * This is tightly coupled with CPython implementation.  It will be slow on PyPy.
 * CPython may change the internal representation to UTF-8 in the future, like PyPy.
 * You can not easily reuse algorithms written in C that handle `char*`.

So I believe (b) should be the preferred way.
But CPython doesn't provide an efficient way to get UTF-8 from the unicode object.

 * PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache.  The cache will be remained for longer than required.  And there is additional malloc + memcpy to create the cache.

 * PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already.

For speed and efficiency, I propose a new API:

```
  /* Borrow the UTF-8 C string from the unicode.
   *
   * Store a pointer to the UTF-8 encoding of the unicode to *utf8* and its size to *size*.
   * The returned object is the owner of the *utf8*.  You need to Py_DECREF() it after
   * you finished to using the *utf8*.  The owner may be not the unicode.
   * Returns NULL when the error occurred while decoding the unicode.
   */
  PyObject* PyUnicode_BorrowUTF8(PyObject *unicode, const char **utf8, Py_ssize_t *len);
```

When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it.
Otherwise, this API calls `_PyUnicode_AsUTF8String(unicode, NULL)` and return it.

History
Date	User	Action	Args
2019-12-18 12:10:15	methane	set	recipients: + methane
2019-12-18 12:10:15	methane	set	messageid: <1576671015.48.0.740860223179.issue39087@roundup.psfhosted.org>
2019-12-18 12:10:15	methane	link	issue39087 messages
2019-12-18 12:10:14	methane	create