Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PyUnicode_FromString for short ASCII strings #81529

Closed
methane opened this issue Jun 20, 2019 · 15 comments
Closed

Optimize PyUnicode_FromString for short ASCII strings #81529

methane opened this issue Jun 20, 2019 · 15 comments
Labels
3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@methane
Copy link
Member

methane commented Jun 20, 2019

BPO 37348
Nosy @vstinner, @methane, @serhiy-storchaka
PRs
  • bpo-37348 : add _PyUnicode_FROM_ASCII #14264
  • bpo-37348: optimize PyUnicode_FromString #14273
  • bpo-37348: optimize decoding ASCII string #14283
  • bpo-37348: optimize decoding ASCII string #14291
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-06-24.03:30:43.577>
    created_at = <Date 2019-06-20.11:40:57.663>
    labels = ['interpreter-core', '3.9', 'performance']
    title = 'Optimize PyUnicode_FromString for short ASCII strings'
    updated_at = <Date 2019-06-24.07:08:11.005>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2019-06-24.07:08:11.005>
    actor = 'methane'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-06-24.03:30:43.577>
    closer = 'methane'
    components = ['Interpreter Core']
    creation = <Date 2019-06-20.11:40:57.663>
    creator = 'methane'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 37348
    keywords = ['patch']
    message_count = 15.0
    messages = ['346115', '346116', '346117', '346118', '346121', '346127', '346132', '346134', '346202', '346203', '346237', '346247', '346351', '346354', '346356']
    nosy_count = 3.0
    nosy_names = ['vstinner', 'methane', 'serhiy.storchaka']
    pr_nums = ['14264', '14273', '14283', '14291']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue37348'
    versions = ['Python 3.9']

    @methane
    Copy link
    Member Author

    methane commented Jun 20, 2019

    _PyUnicode_FromASCII(s, len) is faster than PyUnicode_FromString(s) because PyUnicode_FromString() uses temporary _PyUnicodeWriter to support UTF-8.

    But _PyUnicode_FromASCII("hello", strlen("hello"))is not as easy asPyUnicode_FromString("hello")`.

    _PyUnicode_FROM_ASCII() is simple macro which wraps _PyUnicode_FromASCII which calls strlen() automatically:

    /* Convenient wrapper for _PyUnicode_FromASCII */
    #define _PyUnicode_FROM_ASCII(s) _PyUnicode_FromASCII((s), strlen(s))
    

    I believe recent compilers optimize away calls of strlen().

    @methane methane added 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jun 20, 2019
    @vstinner
    Copy link
    Member

    #define _PyUnicode_FROM_ASCII(s) _PyUnicode_FromASCII((s), strlen(s))

    LGTM.

    @methane
    Copy link
    Member Author

    methane commented Jun 20, 2019

    I confirmed at least one measurable speedup:

    ./python -m pyperf timeit -s 'd={}' -- 'repr(d)'

    FromString: Mean +- std dev: 157 ns +- 3 ns
    FROM_ASCII: Mean +- std dev: 132 ns +- 3 ns

    >>> (157-132)/157
    0.1592356687898089

    @methane
    Copy link
    Member Author

    methane commented Jun 20, 2019

    Should we make these APIs public?

    • rename _PyUnicode_FromASCII to PyUnicode_FromASCIIAndSize.
    • add PyUnicode_FromASCII as static inline function (without ABI).
    • add #define _PyUnicode_FromASCII PyUnicode_FromASCIIAndSize for source code compatibility.

    @serhiy-storchaka
    Copy link
    Member

    Most of changes are in not performance sensitive code. I do not think there is a benefit of using new macro there.

    If PyUnicode_FromString() is slow I prefer to optimize it instead of adding yet one esoteric private function for internal needs.

    In case of dict.__repr__() we can get even more gain by using _Py_IDENTIFIER or more general API proposed by Victor.

    @methane
    Copy link
    Member Author

    methane commented Jun 20, 2019

    Most of changes are in not performance sensitive code. I do not think there is a benefit of using new macro there.

    Because I used just sed.

    If PyUnicode_FromString() is slow I prefer to optimize it instead of adding yet one esoteric private function for internal needs.

    OK. There are some code like PyUnicode_FromString(name). Optimizing PyUnicode_FromString will be more effective. I created PR 14273.

    In case of dict.__repr__() we can get even more gain by using _Py_IDENTIFIER or more general API proposed by Victor.

    Of course, I used it just for micro benchmarking. Optimizing it is not a goal. In case of PR 14273:

    $ ./python -m pyperf timeit -s 'd={}' -- 'repr(d)'
    .....................
    Mean +- std dev: 138 ns +- 2 ns

    @methane methane changed the title add _PyUnicode_FROM_ASCII macro Optimize PyUnicode_GetString for short ASCII strings Jun 20, 2019
    @methane
    Copy link
    Member Author

    methane commented Jun 20, 2019

    Oh, wait. Why we used _PyUnicodeWriter here?
    Decoding UTF-8 must not require it. 2-pass is enough.

    @vstinner
    Copy link
    Member

    _PyUnicode_FromASCII(s, len) is faster than PyUnicode_FromString(s) because PyUnicode_FromString() uses temporary _PyUnicodeWriter to support UTF-8.

    I don't understand how _PyUnicodeWriter could be slow. It does not overallocate by default. It's just wrapper to implement efficient memory management.

    Oh, wait. Why we used _PyUnicodeWriter here?

    To optimize decoding errors: the error handler can use replacement string longer than 1 character. Overallocation is used in this case.

    @methane
    Copy link
    Member Author

    methane commented Jun 21, 2019

    I don't understand how _PyUnicodeWriter could be slow. It does not overallocate by default. It's just wrapper to implement efficient memory management.

    I misunderstood _PyUnicodeWriter. I thought it caused one more allocation, but it doesn't.

    But _PyUnicodeWriter is still slow, because gcc and clang are not smart enough to optimize _PyUnicodeWriter_Init() & _PyUnicodeWriter_Prepare().

    See this example:

    #define PY_SSIZE_T_CLEAN
    #include <Python.h>
    
    #define S(s) (s),strlen(s)
    
    int
    main(int argc, char *argv[])
    {
        Py_Initialize();
    
        for (int i=0; i<100000000; i++) {
            //PyObject *s = PyUnicode_FromString("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
            PyObject *s = _PyUnicode_FromASCII(S("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
            Py_DECREF(s);
        }
        return 0;
    }
    

    PyUnicode_FromString() takes about 4 sec on my machine. _PyUnicode_FromASCII() is about 2 sec.
    By skipping _PyUnicodeWriter for ASCII string (GH-14283), PyUnicode_FromString() takes about 3 sec.

    $ time ./x  # PyUnicode_FromString
    
    real    0m4.085s
    user    0m4.081s
    sys     0m0.004s
    
    $ time ./y  # PyUnicode_FromString (skip _PyUnicode_Writer, python/cpython#58491)
    
    real    0m2.988s
    user    0m2.988s
    sys     0m0.000s
    
    $ time ./z  # _PyUnicode_FromASCII
    $ time ./z
    
    real    0m1.975s
    user    0m1.975s
    sys     0m0.000s
    

    @methane
    Copy link
    Member Author

    methane commented Jun 21, 2019

    Another micro benchmark:

    $ ./python-master -m pyperf timeit -o m1.json 'b=b"foobar"' -- 'b.decode()'
    .....................
    Mean +- std dev: 93.1 ns +- 2.4 ns
    
    $ ./python -m pyperf timeit -o m2.json 'b=b"foobar"' -- 'b.decode()'
    .....................
    Mean +- std dev: 83.1 ns +- 2.6 ns
    
    $ ./python -m pyperf compare_to m1.json m2.json
    Mean +- std dev: [m1] 93.1 ns +- 2.4 ns -> [m2] 83.1 ns +- 2.6 ns: 1.12x faster (-11%)
    

    @methane
    Copy link
    Member Author

    methane commented Jun 21, 2019

    PR 14291 seems simpler than PR 14283. But PR 14283 is faster because _PyUnicodeWriter is a learge struct.

    master: 3.7sec
    PR 14283: 2.9sec
    PR 14291: 3.45sec

    compiler: gcc (Ubuntu 8.3.0-6ubuntu1) 8.3.0
    without LTO, without PGO

    @vstinner
    Copy link
    Member

    I'm confused by the issue title: PyUnicode_GetString() doesn't exist, it's PyUnicode_FromString() :-) I changed the title.

    @vstinner vstinner changed the title Optimize PyUnicode_GetString for short ASCII strings Optimize PyUnicode_FromString for short ASCII strings Jun 21, 2019
    @methane
    Copy link
    Member Author

    methane commented Jun 24, 2019

    New changeset 770847a by Inada Naoki in branch 'master':
    bpo-37348: optimize decoding ASCII string (GH-14283)
    770847a

    @methane methane closed this as completed Jun 24, 2019
    @serhiy-storchaka
    Copy link
    Member

    Could you please measure the performance for long strings (1000, 10000 and 100000 characters): a long ASCII string and a long ASCII string ending with a non-ASCII character?

    @methane
    Copy link
    Member Author

    methane commented Jun 24, 2019

    This optimization is only for short strings. There is no significant difference for long and non-ASCII strings.

    # 1000 ASCII
    $ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"f"*1000' -- 'b.decode()'
    python-master: ..................... 196 ns +- 1 ns
    python: ..................... 185 ns +- 1 ns
    
    Mean +- std dev: [python-master] 196 ns +- 1 ns -> [python] 185 ns +- 1 ns: 1.06x faster (-6%)
    
    # 10000 ASCII
    $ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"f"*10000' -- 'b.decode()'
    python-master: ..................... 671 ns +- 4 ns
    python: ..................... 662 ns +- 1 ns
    
    Mean +- std dev: [python-master] 671 ns +- 4 ns -> [python] 662 ns +- 1 ns: 1.01x faster (-1%)
    
    # 100000 ASCII
    $ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"f"*100000' -- 'b.decode()'
    python-master: ..................... 5.86 us +- 0.03 us
    python: ..................... 5.85 us +- 0.02 us
    
    Mean +- std dev: [python-master] 5.86 us +- 0.03 us -> [python] 5.85 us +- 0.02 us: 1.00x faster (-0%)
    
    # 1 non-ASCII
    $ ./python -m pyperf timeit --compare-to=./python-master -s 'b="あ".encode()' -- 'b.decode()'
    python-master: ..................... 138 ns +- 2 ns
    python: ..................... 136 ns +- 2 ns
    
    Mean +- std dev: [python-master] 138 ns +- 2 ns -> [python] 136 ns +- 2 ns: 1.02x faster (-2%)
    
    # 1000 ASCII + 1 non-ASCII
    $ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"x"*1000 + "あ".encode()' -- 'b.decode()'
    python-master: ..................... 361 ns +- 9 ns
    python: ..................... 360 ns +- 5 ns
    
    Mean +- std dev: [python-master] 361 ns +- 9 ns -> [python] 360 ns +- 5 ns: 1.00x faster (-0%)
    Not significant!
    
    # 10000 ASCII + 1 non-ASCII
    $ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"x"*10000 + "あ".encode()' -- 'b.decode()'
    python-master: ..................... 2.83 us +- 0.02 us
    python: ..................... 2.83 us +- 0.03 us
    
    Mean +- std dev: [python-master] 2.83 us +- 0.02 us -> [python] 2.83 us +- 0.03 us: 1.00x slower (+0%)
    Not significant!
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants