Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize UTF-8 encoder with error handlers #69454

Closed
vstinner opened this issue Sep 29, 2015 · 6 comments
Closed

Optimize UTF-8 encoder with error handlers #69454

vstinner opened this issue Sep 29, 2015 · 6 comments
Labels
performance Performance or resource usage topic-unicode

Comments

@vstinner
Copy link
Member

BPO 25267
Nosy @vstinner, @ezio-melotti, @methane, @serhiy-storchaka
Files
  • utf8_encoder_errors.patch
  • bench.py
  • utf8_encoder_errors-2.patch
  • utf8_encoder_errors-3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-10-01.21:27:18.328>
    created_at = <Date 2015-09-29.11:30:34.875>
    labels = ['expert-unicode', 'performance']
    title = 'Optimize UTF-8 encoder with error handlers'
    updated_at = <Date 2015-10-01.21:27:18.326>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2015-10-01.21:27:18.326>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-10-01.21:27:18.328>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2015-09-29.11:30:34.875>
    creator = 'vstinner'
    dependencies = []
    files = ['40619', '40645', '40646', '40647']
    hgrepos = []
    issue_num = 25267
    keywords = ['patch']
    message_count = 6.0
    messages = ['251845', '252021', '252022', '252024', '252058', '252060']
    nosy_count = 5.0
    nosy_names = ['vstinner', 'ezio.melotti', 'methane', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue25267'
    versions = ['Python 3.6']

    @vstinner
    Copy link
    Member Author

    Attached patch optimizes the UTF-8 encoder for error handlers: ignore, replace, surrogateescape, surrogatepass. It is based on the patch faster_surrogates_hadling.patch written by Serhiy Storchaka in the issue bpo-24870.

    It also modifies unicode_encode_ucs1() to use memset() for the replace error handler. It should be faster for long sequences of unencodable characters, but it may be slower for short sequences of unencodable characters.

    The patch adds new unit tests and fix unit tests to ensure that utf-8-sig codec is also well tested.

    TODO: write a benchmark.

    See also the issue bpo-25227 which optimized ASCII and latin1 encoders with the surrogateescape error handlers.

    @vstinner vstinner added topic-unicode performance Performance or resource usage labels Sep 29, 2015
    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 1, 2015

    Oh, there is a bug in utf8_encoder() (not in my patch!), newpos was not used after calling the error handler. It's now fixed in the new patch.

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 1, 2015

    Benchmark results. Sorry for the very long output.

    There are some (corner?) cases where the patched Python is a little bit slower. I consider that it's ok since it's *much* faster in the other cases.

    What do you think?

    Common platform:
    Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
    Bits: int=32, long=64, long long=64, size_t=64, void*=64
    CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
    Python unicode implementation: PEP-393
    Timer: time.perf_counter
    SCM: hg revision=10efb1797e7b+ tag=tip branch=default date="2015-10-01 13:16 +0200"
    Platform: Linux-4.1.6-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two
    CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes

    Platform of campaign before:
    Date: 2015-10-01 13:30:07
    Timer precision: 61 ns
    Python version: 3.6.0a0 (default:10efb1797e7b, Oct 1 2015, 13:30:06) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]

    Platform of campaign after:
    Timer precision: 63 ns
    Date: 2015-10-01 13:54:14
    Python version: 3.6.0a0 (default:10efb1797e7b+, Oct 1 2015, 13:53:51) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]

    --------------------------+-------------+----------------
    ignore: "\udcff" * length | before | after
    --------------------------+-------------+----------------

    length=10**1              | 3.16 us (*) |   279 ns (-91%)
    length=10**3              |  241 us (*) | 1.08 us (-100%)
    length=10**2              | 23.9 us (*) |   346 ns (-99%)
    length=10**4              | 2.39 ms (*) | 6.48 us (-100%)
    --------------------------+-------------+

    Total | 2.66 ms (*) | 8.19 us (-100%)
    --------------------------+-------------+----------------

    --------------------------------+-------------+---------------
    ignore: "a" * length + "\udcff" | before | after
    --------------------------------+-------------+---------------

    length=10**1                    | 1.12 us (*) |  295 ns (-74%)
    length=10**3                    |  2.2 us (*) | 1.57 us (-29%)
    length=10**2                    | 1.21 us (*) |  408 ns (-66%)
    length=10**4                    | 10.4 us (*) | 12.3 us (+18%)
    --------------------------------+-------------+

    Total | 15 us (*) | 14.6 us
    --------------------------------+-------------+---------------

    --------------------------------------------+-------------+---------------
    ignore: ("a" * 99 + "\udcff" * 99) * length | before | after
    --------------------------------------------+-------------+---------------

    length=10**1                                |  238 us (*) | 2.46 us (-99%)
    length=10**3                                | 23.7 ms (*) |  234 us (-99%)
    length=10**2                                | 2.38 ms (*) | 20.8 us (-99%)
    length=10**4                                |  238 ms (*) | 2.56 ms (-99%)
    --------------------------------------------+-------------+

    Total | 265 ms (*) | 2.82 ms (-99%)
    --------------------------------------------+-------------+---------------

    ---------------------------------------+-------------+----------------
    ignore: ("\udcff" * 99 + "a") * length | before | after
    ---------------------------------------+-------------+----------------

    length=10**1                           |  239 us (*) |  1.29 us (-99%)
    length=10**3                           | 23.8 ms (*) | 80.9 us (-100%)
    length=10**2                           |  2.4 ms (*) | 8.44 us (-100%)
    length=10**4                           |  236 ms (*) |  839 us (-100%)
    ---------------------------------------+-------------+

    Total | 263 ms (*) | 930 us (-100%)
    ---------------------------------------+-------------+----------------

    --------------------------------+-------------+---------------
    ignore: "\udcff" + "a" * length | before | after
    --------------------------------+-------------+---------------

    length=10**1                    | 1.09 us (*) |  297 ns (-73%)
    length=10**3                    | 2.19 us (*) | 1.58 us (-28%)
    length=10**2                    | 1.19 us (*) |  409 ns (-66%)
    length=10**4                    | 10.5 us (*) | 12.3 us (+17%)
    --------------------------------+-------------+

    Total | 14.9 us (*) | 14.6 us
    --------------------------------+-------------+---------------

    ---------------------------+-------------+----------------
    replace: "\udcff" * length | before | after
    ---------------------------+-------------+----------------

    length=10**1               | 3.47 us (*) |   317 ns (-91%)
    length=10**3               |  263 us (*) | 1.07 us (-100%)
    length=10**2               | 26.4 us (*) |   383 ns (-99%)
    length=10**4               | 2.65 ms (*) | 6.75 us (-100%)
    ---------------------------+-------------+

    Total | 2.94 ms (*) | 8.52 us (-100%)
    ---------------------------+-------------+----------------

    ---------------------------------+-------------+---------------
    replace: "a" * length + "\udcff" | before | after
    ---------------------------------+-------------+---------------

    length=10**1                     | 1.16 us (*) |  319 ns (-72%)
    length=10**3                     | 2.25 us (*) | 1.62 us (-28%)
    length=10**2                     | 1.25 us (*) |  432 ns (-65%)
    length=10**4                     | 13.4 us (*) |  12.4 us (-7%)
    ---------------------------------+-------------+

    Total | 18 us (*) | 14.7 us (-18%)
    ---------------------------------+-------------+---------------

    ---------------------------------------------+-------------+---------------
    replace: ("a" * 99 + "\udcff" * 99) * length | before | after
    ---------------------------------------------+-------------+---------------

    length=10**1                                 |  267 us (*) | 2.52 us (-99%)
    length=10**3                                 | 26.2 ms (*) |  210 us (-99%)
    length=10**2                                 | 2.63 ms (*) | 21.3 us (-99%)
    length=10**4                                 |  264 ms (*) | 2.98 ms (-99%)
    ---------------------------------------------+-------------+

    Total | 293 ms (*) | 3.21 ms (-99%)
    ---------------------------------------------+-------------+---------------

    ----------------------------------------+-------------+----------------
    replace: ("\udcff" * 99 + "a") * length | before | after
    ----------------------------------------+-------------+----------------

    length=10**1                            |  263 us (*) | 1.29 us (-100%)
    length=10**3                            | 26.1 ms (*) | 86.6 us (-100%)
    length=10**2                            | 2.63 ms (*) | 9.02 us (-100%)
    length=10**4                            |  261 ms (*) |  925 us (-100%)
    ----------------------------------------+-------------+

    Total | 290 ms (*) | 1.02 ms (-100%)
    ----------------------------------------+-------------+----------------

    ---------------------------------+-------------+---------------
    replace: "\udcff" + "a" * length | before | after
    ---------------------------------+-------------+---------------

    length=10**1                     | 1.14 us (*) |  317 ns (-72%)
    length=10**3                     | 2.24 us (*) |  1.6 us (-28%)
    length=10**2                     | 1.23 us (*) |  428 ns (-65%)
    length=10**4                     | 10.5 us (*) | 12.3 us (+17%)
    ---------------------------------+-------------+

    Total | 15.1 us (*) | 14.7 us
    ---------------------------------+-------------+---------------

    -----------------------------------+-------------+---------------
    surrogateescape: "\udcff" * length | before | after
    -----------------------------------+-------------+---------------

    length=10**1                       | 3.48 us (*) |  281 ns (-92%)
    length=10**3                       |  267 us (*) | 1.77 us (-99%)
    length=10**2                       | 26.7 us (*) |  424 ns (-98%)
    length=10**4                       | 2.67 ms (*) | 13.9 us (-99%)
    -----------------------------------+-------------+

    Total | 2.97 ms (*) | 16.3 us (-99%)
    -----------------------------------+-------------+---------------

    -----------------------------------------+-------------+---------------
    surrogateescape: "a" * length + "\udcff" | before | after
    -----------------------------------------+-------------+---------------

    length=10**1                             | 1.14 us (*) |  277 ns (-76%)
    length=10**3                             | 2.32 us (*) | 1.57 us (-32%)
    length=10**2                             | 1.24 us (*) |  391 ns (-68%)
    length=10**4                             | 10.6 us (*) | 12.3 us (+17%)
    -----------------------------------------+-------------+

    Total | 15.3 us (*) | 14.6 us
    -----------------------------------------+-------------+---------------

    -----------------------------------------------------+-------------+---------------
    surrogateescape: ("a" * 99 + "\udcff" * 99) * length | before | after
    -----------------------------------------------------+-------------+---------------

    length=10**1                                         |  266 us (*) | 3.26 us (-99%)
    length=10**3                                         | 26.4 ms (*) |  285 us (-99%)
    length=10**2                                         | 2.65 ms (*) | 28.9 us (-99%)
    length=10**4                                         |  266 ms (*) | 3.73 ms (-99%)
    -----------------------------------------------------+-------------+

    Total | 295 ms (*) | 4.04 ms (-99%)
    -----------------------------------------------------+-------------+---------------

    ------------------------------------------------+-------------+---------------
    surrogateescape: ("\udcff" * 99 + "a") * length | before | after
    ------------------------------------------------+-------------+---------------

    length=10**1                                    |  265 us (*) | 2.04 us (-99%)
    length=10**3                                    | 26.2 ms (*) |  165 us (-99%)
    length=10**2                                    | 2.64 ms (*) |   17 us (-99%)
    length=10**4                                    |  263 ms (*) | 1.75 ms (-99%)
    ------------------------------------------------+-------------+

    Total | 292 ms (*) | 1.93 ms (-99%)
    ------------------------------------------------+-------------+---------------

    -----------------------------------------+-------------+---------------
    surrogateescape: "\udcff" + "a" * length | before | after
    -----------------------------------------+-------------+---------------

    length=10**1                             | 1.12 us (*) |  278 ns (-75%)
    length=10**3                             | 2.25 us (*) | 1.59 us (-29%)
    length=10**2                             | 1.21 us (*) |  389 ns (-68%)
    length=10**4                             | 10.5 us (*) | 12.3 us (+17%)
    -----------------------------------------+-------------+

    Total | 15.1 us (*) | 14.6 us
    -----------------------------------------+-------------+---------------

    ---------------------------------+-------------+---------------
    surrogatepass: "\udcff" * length | before | after
    ---------------------------------+-------------+---------------

    length=10**1                     | 3.71 us (*) |  306 ns (-92%)
    length=10**3                     |  289 us (*) | 2.61 us (-99%)
    length=10**2                     | 28.9 us (*) |  532 ns (-98%)
    length=10**4                     | 2.88 ms (*) | 22.4 us (-99%)
    ---------------------------------+-------------+

    Total | 3.2 ms (*) | 25.8 us (-99%)
    ---------------------------------+-------------+---------------

    ---------------------------------------+-------------+---------------
    surrogatepass: "a" * length + "\udcff" | before | after
    ---------------------------------------+-------------+---------------

    length=10**1                           | 1.16 us (*) |  299 ns (-74%)
    length=10**3                           | 2.36 us (*) | 1.59 us (-32%)
    length=10**2                           | 1.27 us (*) |  413 ns (-68%)
    length=10**4                           | 10.6 us (*) | 12.3 us (+16%)
    ---------------------------------------+-------------+

    Total | 15.4 us (*) | 14.6 us (-5%)
    ---------------------------------------+-------------+---------------

    ---------------------------------------------------+-------------+---------------
    surrogatepass: ("a" * 99 + "\udcff" * 99) * length | before | after
    ---------------------------------------------------+-------------+---------------

    length=10**1                                       |  289 us (*) | 3.99 us (-99%)
    length=10**3                                       | 28.5 ms (*) |  362 us (-99%)
    length=10**2                                       | 2.86 ms (*) | 36.7 us (-99%)
    length=10**4                                       |  287 ms (*) | 5.18 ms (-98%)
    ---------------------------------------------------+-------------+

    Total | 319 ms (*) | 5.59 ms (-98%)
    ---------------------------------------------------+-------------+---------------

    ----------------------------------------------+-------------+---------------
    surrogatepass: ("\udcff" * 99 + "a") * length | before | after
    ----------------------------------------------+-------------+---------------

    length=10**1                                  |  288 us (*) | 2.91 us (-99%)
    length=10**3                                  | 28.5 ms (*) |  242 us (-99%)
    length=10**2                                  | 2.86 ms (*) | 24.7 us (-99%)
    length=10**4                                  |  284 ms (*) | 2.53 ms (-99%)
    ----------------------------------------------+-------------+

    Total | 316 ms (*) | 2.8 ms (-99%)
    ----------------------------------------------+-------------+---------------

    ---------------------------------------+-------------+---------------
    surrogatepass: "\udcff" + "a" * length | before | after
    ---------------------------------------+-------------+---------------

    length=10**1                           | 1.13 us (*) |  301 ns (-73%)
    length=10**3                           |  2.3 us (*) | 1.59 us (-31%)
    length=10**2                           | 1.24 us (*) |  409 ns (-67%)
    length=10**4                           | 10.6 us (*) | 12.1 us (+15%)
    ---------------------------------------+-------------+

    Total | 15.2 us (*) | 14.4 us (-5%)
    ---------------------------------------+-------------+---------------

    ------------------------------------+-------------+---------------
    backslashreplace: "\udcff" * length | before | after
    ------------------------------------+-------------+---------------

    length=10**1                        | 4.28 us (*) | 1.58 us (-63%)
    length=10**3                        |  320 us (*) | 11.1 us (-97%)
    length=10**2                        | 32.3 us (*) | 2.56 us (-92%)
    length=10**4                        | 3.17 ms (*) | 96.6 us (-97%)
    ------------------------------------+-------------+

    Total | 3.52 ms (*) | 112 us (-97%)
    ------------------------------------+-------------+---------------

    ------------------------------------------+-------------+---------------
    backslashreplace: "a" * length + "\udcff" | before | after
    ------------------------------------------+-------------+---------------

    length=10**1                              | 1.44 us (*) |        1.47 us
    length=10**3                              | 2.43 us (*) | 2.77 us (+14%)
    length=10**2                              | 1.52 us (*) |  1.64 us (+8%)
    length=10**4                              | 10.6 us (*) | 13.3 us (+25%)
    ------------------------------------------+-------------+

    Total | 16 us (*) | 19.2 us (+20%)
    ------------------------------------------+-------------+---------------

    ------------------------------------------------------+-------------+---------------
    backslashreplace: ("a" * 99 + "\udcff" * 99) * length | before | after
    ------------------------------------------------------+-------------+---------------

    length=10**1                                          |  316 us (*) |   16 us (-95%)
    length=10**3                                          | 31.3 ms (*) | 1.46 ms (-95%)
    length=10**2                                          | 3.14 ms (*) |  147 us (-95%)
    length=10**4                                          |  313 ms (*) | 15.3 ms (-95%)
    ------------------------------------------------------+-------------+

    Total | 347 ms (*) | 16.9 ms (-95%)
    ------------------------------------------------------+-------------+---------------

    -------------------------------------------------+-------------+---------------
    backslashreplace: ("\udcff" * 99 + "a") * length | before | after
    -------------------------------------------------+-------------+---------------

    length=10**1                                     |  317 us (*) | 14.7 us (-95%)
    length=10**3                                     | 31.3 ms (*) | 1.34 ms (-96%)
    length=10**2                                     | 3.17 ms (*) |  135 us (-96%)
    length=10**4                                     |  313 ms (*) | 13.8 ms (-96%)
    -------------------------------------------------+-------------+

    Total | 347 ms (*) | 15.3 ms (-96%)
    -------------------------------------------------+-------------+---------------

    ------------------------------------------+-------------+---------------
    backslashreplace: "\udcff" + "a" * length | before | after
    ------------------------------------------+-------------+---------------

    length=10**1                              | 1.43 us (*) |        1.45 us
    length=10**3                              | 2.36 us (*) |  2.58 us (+9%)
    length=10**2                              | 1.51 us (*) |        1.57 us
    length=10**4                              | 10.5 us (*) | 13.2 us (+26%)
    ------------------------------------------+-------------+

    Total | 15.8 us (*) | 18.8 us (+19%)
    ------------------------------------------+-------------+---------------

    ------------------------------------------------------+--------------+----------------
    Summary | before | after
    ------------------------------------------------------+--------------+----------------
    ignore: "\udcff" * length | 2.66 ms () | 8.19 us (-100%)
    ignore: "a" * length + "\udcff" | 15 us (
    ) | 14.6 us
    ignore: ("a" * 99 + "\udcff" * 99) * length | 265 ms () | 2.82 ms (-99%)
    ignore: ("\udcff" * 99 + "a") * length | 263 ms (
    ) | 930 us (-100%)
    ignore: "\udcff" + "a" * length | 14.9 us () | 14.6 us
    replace: "\udcff" * length | 2.94 ms (
    ) | 8.52 us (-100%)
    replace: "a" * length + "\udcff" | 18 us () | 14.7 us (-18%)
    replace: ("a" * 99 + "\udcff" * 99) * length | 293 ms (
    ) | 3.21 ms (-99%)
    replace: ("\udcff" * 99 + "a") * length | 290 ms () | 1.02 ms (-100%)
    replace: "\udcff" + "a" * length | 15.1 us (
    ) | 14.7 us
    surrogateescape: "\udcff" * length | 2.97 ms () | 16.3 us (-99%)
    surrogateescape: "a" * length + "\udcff" | 15.3 us (
    ) | 14.6 us
    surrogateescape: ("a" * 99 + "\udcff" * 99) * length | 295 ms () | 4.04 ms (-99%)
    surrogateescape: ("\udcff" * 99 + "a") * length | 292 ms (
    ) | 1.93 ms (-99%)
    surrogateescape: "\udcff" + "a" * length | 15.1 us () | 14.6 us
    surrogatepass: "\udcff" * length | 3.2 ms (
    ) | 25.8 us (-99%)
    surrogatepass: "a" * length + "\udcff" | 15.4 us () | 14.6 us (-5%)
    surrogatepass: ("a" * 99 + "\udcff" * 99) * length | 319 ms (
    ) | 5.59 ms (-98%)
    surrogatepass: ("\udcff" * 99 + "a") * length | 316 ms () | 2.8 ms (-99%)
    surrogatepass: "\udcff" + "a" * length | 15.2 us (
    ) | 14.4 us (-5%)
    backslashreplace: "\udcff" * length | 3.52 ms () | 112 us (-97%)
    backslashreplace: "a" * length + "\udcff" | 16 us (
    ) | 19.2 us (+20%)
    backslashreplace: ("a" * 99 + "\udcff" * 99) * length | 347 ms () | 16.9 ms (-95%)
    backslashreplace: ("\udcff" * 99 + "a") * length | 347 ms (
    ) | 15.3 ms (-96%)
    backslashreplace: "\udcff" + "a" * length | 15.8 us () | 18.8 us (+19%)
    ------------------------------------------------------+--------------+----------------
    Total | 3.04 sec (
    ) | 54.9 ms (-98%)
    ------------------------------------------------------+--------------+----------------

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 1, 2015

    Oh, the default handler for errror handlers uses a loop to check for non-ASCII characters. It can be replaced with PyUnicode_IS_ASCII(str) which has a complexity O(1). Done in new patch.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 1, 2015

    New changeset 2b5357b38366 by Victor Stinner in branch 'default':
    Issue bpo-25267: The UTF-8 encoder is now up to 75 times as fast for error
    https://hg.python.org/cpython/rev/2b5357b38366

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 1, 2015

    I pushed my optimization.

    @vstinner vstinner closed this as completed Oct 1, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant