Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize UTF-8 decoder with error handlers #69488

Closed
vstinner opened this issue Oct 2, 2015 · 6 comments
Closed

Optimize UTF-8 decoder with error handlers #69488

vstinner opened this issue Oct 2, 2015 · 6 comments
Labels
performance Performance or resource usage topic-unicode

Comments

@vstinner
Copy link
Member

vstinner commented Oct 2, 2015

BPO 25301
Nosy @vstinner, @ezio-melotti
Files
  • utf8_decoder.patch
  • bench.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-10-05.11:44:37.727>
    created_at = <Date 2015-10-02.14:44:42.656>
    labels = ['expert-unicode', 'performance']
    title = 'Optimize UTF-8 decoder with error handlers'
    updated_at = <Date 2015-10-05.11:49:36.115>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2015-10-05.11:49:36.115>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-10-05.11:44:37.727>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2015-10-02.14:44:42.656>
    creator = 'vstinner'
    dependencies = []
    files = ['40663', '40671']
    hgrepos = []
    issue_num = 25301
    keywords = ['patch']
    message_count = 6.0
    messages = ['252117', '252181', '252264', '252319', '252320', '252321']
    nosy_count = 3.0
    nosy_names = ['vstinner', 'ezio.melotti', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue25301'
    versions = ['Python 3.6']

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 2, 2015

    The issue bpo-24870 optimized the ASCII decoder with error handlers:

    New changeset 3c430259873e by Victor Stinner in branch 'default':
    Issue bpo-24870: Optimize the ASCII decoder for error handlers: surrogateescape,
    https://hg.python.org/cpython/rev/3c430259873e

    We should also optimize the UTF-8 decoder with error handlers.

    I will work on a patch next days.

    @vstinner vstinner added topic-unicode performance Performance or resource usage labels Oct 2, 2015
    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 3, 2015

    Here is a first patch. It is written to keep best performances for valid UTF-8 encoded string, but speedup strings with a few undecodable bytes.

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 4, 2015

    Results of the microbenchmark on the UTF-8 decoder.

    As expected, performances on valid UTF-8 is unchanged, which was an important goal for me.

    Decoding with error handlers optimized by the patch are *much* faster.

    backslashreplace is still slow, because I didn't optimize it.

    Common platform:
    Python unicode implementation: PEP-393
    Timer: time.perf_counter
    Platform: Linux-4.1.5-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two
    CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
    Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
    Bits: int=32, long=64, long long=64, size_t=64, void*=64
    CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
    Timer precision: 55 ns

    Platform of campaign before:
    SCM: hg revision=f51921883f50 tag=tip branch=default date="2015-10-04 01:19 -0400"
    Python version: 3.6.0a0 (default:f51921883f50, Oct 4 2015, 10:19:37) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
    Date: 2015-10-04 10:19:44

    Platform of campaign after:
    SCM: hg revision=f51921883f50+ tag=tip branch=default date="2015-10-04 01:19 -0400"
    Python version: 3.6.0a0 (default:f51921883f50+, Oct 4 2015, 10:14:05) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
    Date: 2015-10-04 10:18:55

    ---------------------+-------------+--------
    valid UTF-8 (strict) | before | after
    ---------------------+-------------+--------
    100 x 10**1 bytes | 297 ns () | 297 ns
    100 x 10**3 bytes | 7.4 us (
    ) | 7.44 us
    100 x 10**2 bytes | 929 ns () | 924 ns
    100 x 10**4 bytes | 80.4 us (
    ) | 80.4 us
    ---------------------+-------------+--------
    Total | 89.1 us (*) | 89 us
    ---------------------+-------------+--------

    ------------------+-------------+---------------
    ignore | before | after
    ------------------+-------------+---------------
    100 x 10**1 bytes | 6.68 us () | 743 ns (-89%)
    100 x 10**3 bytes | 561 us (
    ) | 42.6 us (-92%)
    100 x 10**2 bytes | 56.8 us () | 4.55 us (-92%)
    100 x 10**4 bytes | 6.02 ms (
    ) | 425 us (-93%)
    ------------------+-------------+---------------
    Total | 6.65 ms (*) | 473 us (-93%)
    ------------------+-------------+---------------

    ------------------+-------------+---------------
    replace | before | after
    ------------------+-------------+---------------
    100 x 10**1 bytes | 7.61 us () | 890 ns (-88%)
    100 x 10**3 bytes | 639 us (
    ) | 50.3 us (-92%)
    100 x 10**2 bytes | 64.8 us () | 5.37 us (-92%)
    100 x 10**4 bytes | 7.09 ms (
    ) | 505 us (-93%)
    ------------------+-------------+---------------
    Total | 7.81 ms (*) | 561 us (-93%)
    ------------------+-------------+---------------

    ------------------+-------------+---------------
    surrogateescape | before | after
    ------------------+-------------+---------------
    100 x 10**1 bytes | 7.96 us () | 855 ns (-89%)
    100 x 10**3 bytes | 674 us (
    ) | 50.2 us (-93%)
    100 x 10**2 bytes | 68.8 us () | 5.35 us (-92%)
    100 x 10**4 bytes | 7.38 ms (
    ) | 504 us (-93%)
    ------------------+-------------+---------------
    Total | 8.13 ms (*) | 560 us (-93%)
    ------------------+-------------+---------------

    ------------------+-------------+--------
    backslashreplace | before | after
    ------------------+-------------+--------
    100 x 10**1 bytes | 7.66 us () | 7.89 us
    100 x 10**3 bytes | 633 us (
    ) | 633 us
    100 x 10**2 bytes | 64.1 us () | 64.6 us
    100 x 10**4 bytes | 6.9 ms (
    ) | 6.93 ms
    ------------------+-------------+--------
    Total | 7.61 ms (*) | 7.64 ms
    ------------------+-------------+--------

    ---------------------+-------------+---------------
    Summary | before | after
    ---------------------+-------------+---------------
    valid UTF-8 (strict) | 89.1 us () | 89 us
    ignore | 6.65 ms (
    ) | 473 us (-93%)
    replace | 7.81 ms () | 561 us (-93%)
    surrogateescape | 8.13 ms (
    ) | 560 us (-93%)
    backslashreplace | 7.61 ms () | 7.64 ms
    ---------------------+-------------+---------------
    Total | 30.3 ms (
    ) | 9.32 ms (-69%)
    ---------------------+-------------+---------------

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 5, 2015

    New changeset 3152e4038d97 by Victor Stinner in branch 'default':
    Issue bpo-25301: The UTF-8 decoder is now up to 15 times as fast for error
    https://hg.python.org/cpython/rev/3152e4038d97

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 5, 2015

    I pushed my optimization. I close the issue.

    @vstinner vstinner closed this as completed Oct 5, 2015
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 5, 2015

    New changeset 5b9ffea7e7c3 by Victor Stinner in branch 'default':
    Issue bpo-25301: Fix compatibility with ISO C90
    https://hg.python.org/cpython/rev/5b9ffea7e7c3

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant