Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C API] No efficient C API to get UTF-8 string from unicode object. #83268

Closed
methane opened this issue Dec 18, 2019 · 17 comments
Closed

[C API] No efficient C API to get UTF-8 string from unicode object. #83268

methane opened this issue Dec 18, 2019 · 17 comments
Labels
3.9 only security fixes topic-C-API type-feature A feature request or enhancement

Comments

@methane
Copy link
Member

methane commented Dec 18, 2019

BPO 39087
Nosy @vstinner, @methane, @serhiy-storchaka
PRs
  • bpo-39087: Add _PyUnicode_GetUTF8Buffer() #17659
  • bpo-39087: Make PyUnicode_AsUTF8AndSize faster. #17683
  • bpo-39087: Optimize PyUnicode_AsUTF8AndSize(). #18327
  • bpo-39087: Use _PyUnicode_GetUTF8Buffer() in fileobject.c #18984
  • Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" #18985
  • Files
  • bench-asutf8.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-03-14.06:59:47.614>
    created_at = <Date 2019-12-18.12:10:15.464>
    labels = ['expert-C-API', 'type-feature', '3.9']
    title = '[C API] No efficient C API to get UTF-8 string from unicode object.'
    updated_at = <Date 2020-03-14.11:00:07.223>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2020-03-14.11:00:07.223>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-03-14.06:59:47.614>
    closer = 'methane'
    components = ['C API']
    creation = <Date 2019-12-18.12:10:15.464>
    creator = 'methane'
    dependencies = []
    files = ['48879']
    hgrepos = []
    issue_num = 39087
    keywords = ['patch']
    message_count = 17.0
    messages = ['358623', '358662', '358663', '358664', '358665', '358666', '358670', '358673', '358778', '358860', '361284', '361285', '362766', '364141', '364142', '364146', '364151']
    nosy_count = 3.0
    nosy_names = ['vstinner', 'methane', 'serhiy.storchaka']
    pr_nums = ['17659', '17683', '18327', '18984', '18985']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue39087'
    versions = ['Python 3.9']

    @methane
    Copy link
    Member Author

    methane commented Dec 18, 2019

    Assume you are writing an extension module that reads string. For example, HTML escape or JSON encode.

    There are two courses:

    (a) Support three KINDs in the flexible unicode representation.
    (b) Get UTF-8 data from the unicode.

    (a) will be the fastest on CPython, but there are few drawbacks:

    • This is tightly coupled with CPython implementation. It will be slow on PyPy.
    • CPython may change the internal representation to UTF-8 in the future, like PyPy.
    • You can not easily reuse algorithms written in C that handle char*.

    So I believe (b) should be the preferred way.
    But CPython doesn't provide an efficient way to get UTF-8 from the unicode object.

    • PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache. The cache will be remained for longer than required. And there is additional malloc + memcpy to create the cache.

    • PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already.

    For speed and efficiency, I propose a new API:

      /* Borrow the UTF-8 C string from the unicode.
       *
       * Store a pointer to the UTF-8 encoding of the unicode to *utf8* and its size to *size*.
       * The returned object is the owner of the *utf8*.  You need to Py_DECREF() it after
       * you finished to using the *utf8*.  The owner may be not the unicode.
       * Returns NULL when the error occurred while decoding the unicode.
       */
      PyObject* PyUnicode_BorrowUTF8(PyObject *unicode, const char **utf8, Py_ssize_t *len);
    

    When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it.
    Otherwise, this API calls _PyUnicode_AsUTF8String(unicode, NULL) and return it.

    @methane methane added 3.9 only security fixes topic-C-API type-feature A feature request or enhancement labels Dec 18, 2019
    @vstinner vstinner changed the title No efficient API to get UTF-8 string from unicode object. [C API] No efficient C API to get UTF-8 string from unicode object. Dec 18, 2019
    @vstinner vstinner changed the title No efficient API to get UTF-8 string from unicode object. [C API] No efficient C API to get UTF-8 string from unicode object. Dec 18, 2019
    @serhiy-storchaka
    Copy link
    Member

    Do you mean some concrete code? Several times I wished similar feature. To get a UTF-8 cache if it exists and encode to UTF-8 without creating a cache otherwise.

    The private _PyUnicode_UTF8() macro could help

    if ((s = _PyUnicode_UTF8(str))) {
        size = _PyUnicode_UTF8_LENGTH(str);
        tmpbytes = NULL;
    }
    else {
        tmpbytes = _PyUnicode_AsUTF8String(str, "replace");
        s = PyBytes_AS_STRING(tmpbytes);
        size = PyBytes_GET_SIZE(tmpbytes);
    }

    but it is not even available outside of unicodeobject.c.

    PyUnicode_BorrowUTF8() looks too complex for the public API. I am not sure that it will be easy to implement it in PyPy. It also does not cover all use cases -- sometimes you want to convert to UTF-8 but does not use any memory allocation at all (either use an existing buffer or raise an error if there is no cached UTF-8 or the string is not ASCII).

    @vstinner
    Copy link
    Member

    The returned object is the owner of the *utf8*. You need to Py_DECREF() it after
    you finished to using the *utf8*. The owner may be not the unicode.

    Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"?

    Py_buffer would be nice since it already has a pointer attribute (data) and a length attribute, and there is an API to "release" a Py_buffer. It can be marked as read-only, etc.

    @serhiy-storchaka
    Copy link
    Member

    Would it be possible to use a "container" object like a Py_buffer?

    Looks like a good idea.

    int PyUnicode_GetUTF8Buffer(Py_buffer *view, const char *errors)

    @methane
    Copy link
    Member Author

    methane commented Dec 19, 2019

    Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"?

    It looks nice idea! Py_buffer.obj is decref-ed when releasing the buffer.
    https://docs.python.org/3/c-api/buffer.html#c.PyBuffer_Release

    int PyUnicode_GetUTF8Buffer(PyObject *unicode, Py_buffer *view)
    {
        if (!PyUnicode_Check(unicode)) {
            PyErr_BadArgument();
            return NULL;
        }
        if (PyUnicode_READY(unicode) == -1) {
            return NULL;
        }
    
        if (PyUnicode_UTF8(unicode) != NULL) {
            return PyBuffer_FillInfo(view, unicode,
                                     PyUnicode_UTF8(unicode),
                                     PyUnicode_UTF8_LENGTH(unicode),
                                     1, PyBUF_CONTIG_RO);
        }
        PyObject *bytes = _PyUnicode_AsUTF8String(unicode, NULL);
        if (bytes == NULL) {
            return NULL;
        }
        return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);
    }

    @methane
    Copy link
    Member Author

    methane commented Dec 19, 2019

    s/return NULL/return -1/g

    @vstinner
    Copy link
    Member

    return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);

    Don't you need to DECREF bytes somehow, at least, in case of failure?

    @methane
    Copy link
    Member Author

    methane commented Dec 19, 2019

    Don't you need to DECREF bytes somehow, at least, in case of failure?

    Thanks. I will create a pull request with suggested changes.

    @serhiy-storchaka
    Copy link
    Member

    I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier.

    @methane
    Copy link
    Member Author

    methane commented Dec 25, 2019

    I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier.

    I created a post about this issue in discuss.python.org.
    https://discuss.python.org/t/better-api-for-encoding-unicode-objects-with-utf-8/2909

    @methane
    Copy link
    Member Author

    methane commented Feb 3, 2020

    I am still not sure about we should add new API only for avoiding cache.

    • PyUnicode_AsUTF8String : When we need bytes or want to avoid cache.
    • PyUnicode_AsUTF8AndSize : When we need C string, and cache is acceptable.

    With PR-18327, PyUnicode_AsUTF8AndSize become 10+% faster than master branch, and same speed to PyUnicode_AsUTF8String.

    ## vs master

    $ ./python -m pyperf timeit --compare-to=../cpython/python --python-names master:patched -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
    master: ..................... 96.6 us +- 3.3 us
    patched: ..................... 83.3 us +- 0.3 us

    Mean +- std dev: [master] 96.6 us +- 3.3 us -> [patched] 83.3 us +- 0.3 us: 1.16x faster (-14%)

    ## vs AsUTF8String

    $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")'
    .....................
    Mean +- std dev: 83.2 us +- 0.2 us
    
    $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "こんにちは")'
    .....................
    Mean +- std dev: 81.9 us +- 2.1 us

    ## vs AsUTF8String (ASCII)

    If we can not accept cache, PyUnicode_AsUTF8String is slower than PyUnicode_AsUTF8 when the unicode is ASCII string. PyUnicode_GetUTF8Buffer helps only this case.

    $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "world")'
    .....................
    Mean +- std dev: 37.5 us +- 1.7 us
    
    $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "world")'
    .....................
    Mean +- std dev: 46.4 us +- 1.6 us

    @methane
    Copy link
    Member Author

    methane commented Feb 3, 2020

    Attached patch is the benchmark function I used in previous post.

    @methane
    Copy link
    Member Author

    methane commented Feb 27, 2020

    New changeset 02a4d57 by Inada Naoki in branch 'master':
    bpo-39087: Optimize PyUnicode_AsUTF8AndSize() (GH-18327)
    02a4d57

    @methane
    Copy link
    Member Author

    methane commented Mar 14, 2020

    New changeset c7ad974 by Inada Naoki in branch 'master':
    bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659)
    c7ad974

    @methane
    Copy link
    Member Author

    methane commented Mar 14, 2020

    I'm sorry about merging PR 18327, but I can not find enough usage example of the _PyUnicode_GetUTF8Buffer.

    PyUnicode_AsUTF8AndSize is optimized, and utf8_cache is not so bad in most case. So _PyUnicode_GetUTF8Buffer seems not worth enough.

    I will revert PR 18327.

    @methane
    Copy link
    Member Author

    methane commented Mar 14, 2020

    New changeset 3a8c562 by Inada Naoki in branch 'master':
    Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" (GH-18985)
    3a8c562

    @serhiy-storchaka
    Copy link
    Member

    I though there are at least 3-4 use cases in the core and stdlib.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes topic-C-API type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants