Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers #69058

Closed
methane opened this issue Aug 15, 2015 · 27 comments
Assignees
Labels
performance Performance or resource usage topic-unicode

Comments

@methane
Copy link
Member

methane commented Aug 15, 2015

BPO 24870
Nosy @vstinner, @ezio-melotti, @bitdancer, @methane, @serhiy-storchaka
Files
  • faster_surrogates_hadling.patch: Preliminary patch
  • faster-decode-ascii-surrogateescape.patch
  • faster-decode-ascii-surrogateescape.patch
  • bench.py
  • utf8.patch
  • utf8-2.patch
  • utf8-3.patch
  • bench_utf8.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-10-06.07:45:27.609>
    created_at = <Date 2015-08-15.03:08:44.326>
    labels = ['expert-unicode', 'performance']
    title = 'Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers'
    updated_at = <Date 2016-10-06.07:45:27.608>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2016-10-06.07:45:27.608>
    actor = 'methane'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-10-06.07:45:27.609>
    closer = 'methane'
    components = ['Unicode']
    creation = <Date 2015-08-15.03:08:44.326>
    creator = 'methane'
    dependencies = []
    files = ['40183', '40195', '40429', '40539', '40540', '40544', '40545', '40546']
    hgrepos = []
    issue_num = 24870
    keywords = ['patch']
    message_count = 27.0
    messages = ['248631', '248632', '248636', '248638', '248658', '248660', '248717', '248820', '248832', '250409', '251269', '251270', '251276', '251277', '251294', '251296', '251299', '251517', '251521', '251529', '252059', '252118', '252662', '252664', '252738', '257688', '257689']
    nosy_count = 6.0
    nosy_names = ['vstinner', 'ezio.melotti', 'r.david.murray', 'methane', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue24870'
    versions = ['Python 3.6']

    @methane
    Copy link
    Member Author

    methane commented Aug 15, 2015

    surrogateescape is recommended way to mix binary data in string protocol.
    But surrogateescape is too slow and it cause usability problem.

    One actual problem is: PyMySQL/PyMySQL#366

    surrogateescape is slow because errorhandler is called with UnicodeError object.
    bs.decode('utf-8', 'surrogateescape') may produce len(bs)/2 error objects internally when bs is random bytes.

    surrogateescape is used with ASCII and UTF-8 encoding in ordinal.
    Specialized implementation can make it faster.

    I want to Python 3.4 and Python 3.5 solve this issue since it's critical problem
    for some people.

    @methane methane added topic-unicode performance Performance or resource usage labels Aug 15, 2015
    @methane
    Copy link
    Member Author

    methane commented Aug 15, 2015

    On MacBook Pro (Core i5 2.6GHz), surrogateescape 1MB data takes 250ms.

    In [1]: bs = bytes(range(256)) * (4 * 1024)

    In [2]: len(bs)
    Out[2]: 1048576

    In [3]: %timeit x = bs.decode('ascii', 'surrogateescape')
    1 loops, best of 3: 249 ms per loop

    @bitdancer
    Copy link
    Member

    Why are bytes being escaped in a binary blob? The reason to use surrogateescape is when you have data that is mostly text, should be processed as text, but can have occasional binary data. That wouldn't seem to apply to a database binary blob.

    But that aside, if you want to submit a patch to speed up surrogateescape without changing its functionality, I'm sure it would be considered. It would certainly be useful for the email library, which currently does do the stupid thing of encoding binary message attachments using surrogateescape (and I'm guessing the reason pymysql does it is something similar to why email does it: the code would need to be significantly reorganized to do things right).

    @serhiy-storchaka
    Copy link
    Member

    Few months ago I wrote a patch that drastically speeds up encoding and decoding with surrogateescape and surrogatepass error handlers. However it causes 25% regression in decoding some UTF-8 data (U+0100-U+07FF if I remember correct) with strict error handler, so it needs some work. I hope that it is possible to rewrite UTF-8 decoder so that avoid a regression. The patch was postponed until 3.5 is released. In any case the patch is large and complex enough to be new feature that can appear only in 3.6.

    @serhiy-storchaka serhiy-storchaka self-assigned this Aug 15, 2015
    @serhiy-storchaka serhiy-storchaka changed the title surrogateescape is too slow Optimize coding with surrogateescape and surrogatepass error handlers Aug 15, 2015
    @vstinner
    Copy link
    Member

    Serhiy: maybe we can start with ascii?

    @vstinner vstinner changed the title Optimize coding with surrogateescape and surrogatepass error handlers surrogateescape is too slow Aug 15, 2015
    @vstinner
    Copy link
    Member

    Oh. I restored the old title because i replied by email with an old email.

    @vstinner vstinner changed the title surrogateescape is too slow Optimize coding with surrogateescape and surrogatepass error handlers Aug 15, 2015
    @methane
    Copy link
    Member Author

    methane commented Aug 17, 2015

    I've stripped Serhiy's patch for ascii.

    Here is benchmark result:
    https://gist.github.com/methane/2376ac5d20642c05a8b6#file-result-md

    Is there chance for applying this patch to 3.5.1?

    @methane
    Copy link
    Member Author

    methane commented Aug 19, 2015

    Why are bytes being escaped in a binary blob? The reason to use surrogateescape is when you have data that is mostly text, should be processed as text, but can have occasional binary data. That wouldn't seem to apply to a database binary blob.

    Since SQL may contain binary data.

    data = bytes(range(256))
    cursor.execute(u"INSERT INTO t (blob_col) values (%(data)s)", {"data": data})

    DB driver should escape properly for SQL syntax, then decode with surrogateescape for % operator.

    bytes in Python 3.5 supports % operator so I can use it instead of unicode %.
    But I'll continue to support Python 3.4 for some years.

    @bitdancer
    Copy link
    Member

    Since you already have to rewrite the string to do the escaping, I would judge it worth the extra effort to piece string together as binary, but I can understand wanting to use % notation. The performance issue seems to prevent that, though, and there's no guarantee the proposed optimization will get applied to 3.4, or even 3.5.

    @vstinner
    Copy link
    Member

    I rebased patch because faster-decode-ascii-surrogateescape.patch was generated in git format, and the git format is not accepted by Rietveld (the Review button).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 21, 2015

    New changeset 3c430259873e by Victor Stinner in branch 'default':
    Issue bpo-24870: Optimize the ASCII decoder for error handlers: surrogateescape,
    https://hg.python.org/cpython/rev/3c430259873e

    @vstinner
    Copy link
    Member

    I pushed a change to optimize the ASCII decoder.

    Attached bench.py script: microbenchmark on the ASCII decoder. My results follows.

    Common platform:
    Platform: Linux-4.1.5-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two
    Bits: int=32, long=64, long long=64, size_t=64, void*=64
    Timer: time.perf_counter
    CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
    CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
    Python unicode implementation: PEP-393
    Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)

    Platform of campaign before:
    Timer precision: 57 ns
    Date: 2015-09-21 23:48:00
    SCM: hg revision=73867bf953e6 branch=default date="2015-09-21 22:40 +0200"
    Python version: 3.6.0a0 (default:73867bf953e6, Sep 21 2015, 23:47:35) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]

    Platform of campaign after:
    Timer precision: 55 ns
    Date: 2015-09-21 23:52:55
    Python version: 3.6.0a0 (default:bb0f55f1ec22+, Sep 21 2015, 23:52:43) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
    SCM: hg revision=bb0f55f1ec22+ tag=tip branch=default date="2015-09-21 23:06 +0200"

    ------------------+-------------+---------------
    ignore | before | after
    ------------------+-------------+---------------
    256 x 10**1 bytes | 314 us () | 5.25 us (-98%)
    256 x 10**3 bytes | 31.9 ms (
    ) | 509 us (-98%)
    256 x 10**2 bytes | 3.13 ms () | 50.2 us (-98%)
    256 x 10**4 bytes | 315 ms (
    ) | 5.12 ms (-98%)
    ------------------+-------------+---------------
    Total | 351 ms (*) | 5.69 ms (-98%)
    ------------------+-------------+---------------

    ------------------+-------------+---------------
    replace | before | after
    ------------------+-------------+---------------
    256 x 10**1 bytes | 352 us () | 6.32 us (-98%)
    256 x 10**3 bytes | 35.2 ms (
    ) | 608 us (-98%)
    256 x 10**2 bytes | 3.52 ms () | 60.9 us (-98%)
    256 x 10**4 bytes | 354 ms (
    ) | 6.19 ms (-98%)
    ------------------+-------------+---------------
    Total | 393 ms (*) | 6.87 ms (-98%)
    ------------------+-------------+---------------

    ------------------+-------------+---------------
    surrogateescape | before | after
    ------------------+-------------+---------------
    256 x 10**1 bytes | 369 us () | 5.92 us (-98%)
    256 x 10**3 bytes | 36.8 ms (
    ) | 570 us (-98%)
    256 x 10**2 bytes | 3.69 ms () | 56.9 us (-98%)
    256 x 10**4 bytes | 371 ms (
    ) | 5.79 ms (-98%)
    ------------------+-------------+---------------
    Total | 412 ms (*) | 6.43 ms (-98%)
    ------------------+-------------+---------------

    ------------------+-------------+--------
    backslashreplace | before | after
    ------------------+-------------+--------
    256 x 10**1 bytes | 357 us () | 361 us
    256 x 10**3 bytes | 35.1 ms (
    ) | 36.1 ms
    256 x 10**2 bytes | 3.52 ms () | 3.59 ms
    256 x 10**4 bytes | 357 ms (
    ) | 365 ms
    ------------------+-------------+--------
    Total | 396 ms (*) | 405 ms
    ------------------+-------------+--------

    -----------------+--------------+---------------
    Summary | before | after
    -----------------+--------------+---------------
    ignore | 351 ms () | 5.69 ms (-98%)
    replace | 393 ms (
    ) | 6.87 ms (-98%)
    surrogateescape | 412 ms () | 6.43 ms (-98%)
    backslashreplace | 396 ms (
    ) | 405 ms
    -----------------+--------------+---------------
    Total | 1.55 sec (*) | 424 ms (-73%)
    -----------------+--------------+---------------

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 21, 2015

    New changeset 2cf85e2834c2 by Victor Stinner in branch 'default':
    Issue bpo-24870: Reuse the new _Py_error_handler enum
    https://hg.python.org/cpython/rev/2cf85e2834c2

    New changeset aa247150a8b1 by Victor Stinner in branch 'default':
    Issue bpo-24870: Add _PyUnicodeWriter_PrepareKind() macro
    https://hg.python.org/cpython/rev/aa247150a8b1

    @vstinner
    Copy link
    Member

    Ok, I prepared the code for the UTF-8 optimization.

    @serhiy: would you like to rebase your patch faster_surrogates_hadling.patch?

    Attached utf8.patch is a less optimal implementation which only changes PyUnicode_DecodeUTF8Stateful(). Maybe it's enough?

    I would like to see a benchmark here to choose the good compromise between performance and code complexity.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 22, 2015

    New changeset 8317796ca004 by Victor Stinner in branch 'default':
    Issue bpo-24870: revert unwanted change
    https://hg.python.org/cpython/rev/8317796ca004

    @vstinner
    Copy link
    Member

    I pushed utf8.patch by mistake :-/ The advantage is that buildbots found bugs. Attached utf8-2.patch fixed bugs.

    The bug was how the "s" variable was set in the error handler. It's now set with:

       s = starts + endinpos;

    Bugs found by the buildbots:

    ======================================================================
    FAIL: test_invalid_cb_for_3bytes_seq (test.test_unicode.UnicodeTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/opt/python/3.x.langa-ubuntu/build/Lib/test/test_unicode.py", line 1897, in test_invalid_cb_for_3bytes_seq
        'invalid continuation byte')
      File "/opt/python/3.x.langa-ubuntu/build/Lib/test/test_unicode.py", line 1772, in assertCorrectUTF8Decoding
        self.assertEqual(seq.decode('utf-8', 'replace'), res)
    AssertionError: '��\x00' != '�\x00'
    - ��
    ? -
    +

    ======================================================================
    FAIL: test_unquote_with_unicode (test.test_urllib.UnquotingTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/opt/python/3.x.langa-ubuntu/build/Lib/test/test_urllib.py", line 1016, in test_unquote_with_unicode
        "using unquote(): %r != %r" % (expect, result))
    AssertionError: '�' != '��'
    -
    + ��
    ? +
     : using unquote(): '�' != '��'

    @vstinner
    Copy link
    Member

    Ok, here is a patch which optimizes surrogatepass too.

    Result of bench_utf8.py.

    Common platform:
    CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
    Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
    CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
    Bits: int=32, long=64, long long=64, size_t=64, void*=64
    Timer: time.perf_counter
    Platform: Linux-4.1.6-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two
    Python unicode implementation: PEP-393

    Platform of campaign before:
    SCM: hg revision=8317796ca004 tag=tip branch=default date="2015-09-22 10:46 +0200"
    Python version: 3.6.0a0 (default:8317796ca004, Sep 22 2015, 11:33:04) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
    Date: 2015-09-22 11:33:12
    Timer precision: 73 ns

    Platform of campaign after:
    SCM: hg revision=8317796ca004+ tag=tip branch=default date="2015-09-22 10:46 +0200"
    Python version: 3.6.0a0 (default:8317796ca004+, Sep 22 2015, 11:31:04) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
    Date: 2015-09-22 11:31:58
    Timer precision: 54 ns

    ------------------+-------------+---------------
    ignore | before | after
    ------------------+-------------+---------------
    100 x 10**1 bytes | 8.85 us () | 857 ns (-90%)
    100 x 10**3 bytes | 780 us (
    ) | 45.5 us (-94%)
    100 x 10**2 bytes | 78.7 us () | 4.81 us (-94%)
    100 x 10**4 bytes | 8.13 ms (
    ) | 456 us (-94%)
    ------------------+-------------+---------------
    Total | 9 ms (*) | 507 us (-94%)
    ------------------+-------------+---------------

    ------------------+-------------+---------------
    replace | before | after
    ------------------+-------------+---------------
    100 x 10**1 bytes | 10 us () | 939 ns (-91%)
    100 x 10**3 bytes | 898 us (
    ) | 54.1 us (-94%)
    100 x 10**2 bytes | 90.5 us () | 5.72 us (-94%)
    100 x 10**4 bytes | 9.55 ms (
    ) | 536 us (-94%)
    ------------------+-------------+---------------
    Total | 10.6 ms (*) | 597 us (-94%)
    ------------------+-------------+---------------

    ------------------+-------------+---------------
    surrogateescape | before | after
    ------------------+-------------+---------------
    100 x 10**1 bytes | 10.4 us () | 919 ns (-91%)
    100 x 10**3 bytes | 930 us (
    ) | 55.2 us (-94%)
    100 x 10**2 bytes | 93.1 us () | 5.84 us (-94%)
    100 x 10**4 bytes | 9.85 ms (
    ) | 552 us (-94%)
    ------------------+-------------+---------------
    Total | 10.9 ms (*) | 614 us (-94%)
    ------------------+-------------+---------------

    ------------------+-------------+---------------
    surrogatepass | before | after
    ------------------+-------------+---------------
    100 x 10**1 bytes | 4.83 us () | 963 ns (-80%)
    100 x 10**3 bytes | 353 us (
    ) | 42.6 us (-88%)
    100 x 10**2 bytes | 36.5 us () | 4.69 us (-87%)
    100 x 10**4 bytes | 4.17 ms (
    ) | 424 us (-90%)
    ------------------+-------------+---------------
    Total | 4.56 ms (*) | 472 us (-90%)
    ------------------+-------------+---------------

    ------------------+-------------+--------------
    backslashreplace | before | after
    ------------------+-------------+--------------
    100 x 10**1 bytes | 10.9 us () | 10.1 us (-7%)
    100 x 10**3 bytes | 944 us (
    ) | 853 us (-10%)
    100 x 10**2 bytes | 93.8 us () | 87.5 us (-7%)
    100 x 10**4 bytes | 9.92 ms (
    ) | 9.04 ms (-9%)
    ------------------+-------------+--------------
    Total | 11 ms (*) | 9.99 ms (-9%)
    ------------------+-------------+--------------

    -----------------+-------------+---------------
    Summary | before | after
    -----------------+-------------+---------------
    ignore | 9 ms () | 507 us (-94%)
    replace | 10.6 ms (
    ) | 597 us (-94%)
    surrogateescape | 10.9 ms () | 614 us (-94%)
    surrogatepass | 4.56 ms (
    ) | 472 us (-90%)
    backslashreplace | 11 ms () | 9.99 ms (-9%)
    -----------------+-------------+---------------
    Total | 46 ms (
    ) | 12.2 ms (-73%)
    -----------------+-------------+---------------

    @vstinner
    Copy link
    Member

    I created the issue bpo-25227 to optimize the ASCII and Latin1 *encoders* for surrogateescape.

    @serhiy-storchaka
    Copy link
    Member

    I worked on UTF-16 and UTF-32 encoders, but now I'm off my developing computer. I'll provide updated patch soon.

    I think that only "surrogateescape" and "surrogatepass" error handlers have need in optimization, because they are used to interpolate with other programs, including old Python versions. "strict" stops processing, an optimization is not needed here. All other error handlers lose information and can't be used per se for transcoding bytes as string or string as bytes. They are used together with other slow code (for example for encoding string in XML or HTML you first need to escape '&', '<' and quotes). It is easy to add fast handling for 'ignore' and 'replace', but these error handlers are used largely for produce human-readable output, and adding it can slow down common case (no errors). That is why I limit my patch for "surrogateescape" and "surrogatepass" only.

    @vstinner vstinner changed the title Optimize coding with surrogateescape and surrogatepass error handlers Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers Sep 24, 2015
    @vstinner
    Copy link
    Member

    Serhiy wrote: "All other error handlers lose information and can't be used per se for transcoding bytes as string or string as bytes."

    Well, it was very simple to implement replace and ignore in decoders. I believe that the error handlers are commonly used.

    "(...) adding it can slow down common case (no errors). That is why I limit my patch for "surrogateescape" and "surrogatepass" only."

    We can start with benchmarks and see if modifying Objects/stringlib/ has a real impact on performances, or if modifying the "slower" decoder in Objects/unicodeobject.c is enough. IMHO it's fine to implement many error handlers in Objects/unicodeobject.c: it's the "slow" path when at least one error occurred, so it doesn't impact the path to decode valid UTF-8 strings.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 1, 2015

    I just pushed my patch to optimize the UTF-8 encoder with error handlers: see the issue bpo-25267. It's up to 70 times as fast. The patch was based on Serhiy's work: faster_surrogates_hadling.patch attached to this issue.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 2, 2015

    I created issue bpo-25301: "Optimize UTF-8 decoder with error handlers".

    @vstinner
    Copy link
    Member

    vstinner commented Oct 9, 2015

    Short summary.

    Ok, I optimized ASCII, Latin1 and UTF-8 codecs (encoders and decoders) for the most common error handlers.

    • ASCII and Latin1 encoders: surrogateescape, replace, ignore, backslashreplace, xmlcharrefreplace
    • ASCII decoder: surrogateescape, replace, ignore
    • (Latin1 decoder cannot fail)
    • UTF-8 encoder: surrogateescape, surrogatepass, replace, ignore, backslashreplace, xmlcharrefreplace
    • UTF-8 decoder: surrogateescape, replace, ignore

    The code to handle other error handlers in encoders has also be optimized.

    Surrogateescape has now an efficent implementation for ASCII, Latin1 and UTF-8 encoders and decoders.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 9, 2015

    INADA Naoki: "I want to Python 3.4 and Python 3.5 solve this issue since it's critical problem for some people."

    On microbenchmarks, the optimization that I just implemented in Python 3.6 are impressive. The problem is that the implementation is quite complex. If I understood correctly, you are asking to optimize decoders and encoders for ASCII and UTF-8 (modify 4 functions) for the surrogateescape error handler. Is that right? Would UTF-8 be enough or not?

    I don't like backporting optimizations which are not well tested right now. To optimize encoders, I wrote a full new _PyBytesWriter API. We cannot backport this new API, even if it's private. So the backport may be more complex than the code in the default branch.

    @methane
    Copy link
    Member Author

    methane commented Oct 10, 2015

    UTF-8 and Latin1 are typical encoding for MySQL query.
    When inserting BLOB:

    # Decode binary data
    x = data.decode('ascii', 'surrogateescape')
    # %-format query
    psql = sql % (escape(x),)  # sql is unicode
    # Encode sql to connection encoding (latin1 or utf8)
    query_bytes = psql.encode('utf-8', 'surrogateescape')

    So decoder speedup is required only for ascii, but encoder speedup is required for utf-8 and latin1.

    I'll consider other ways including creating speedup module and register it on PyPI.

    @methane
    Copy link
    Member Author

    methane commented Jan 7, 2016

    FYI, I found a workaround.
    PyMySQL/PyMySQL#409

    _table = [chr(i) for i in range(128)] + [chr(i) for i in range(0xdc80, 0xdd00)]
    
    def decode_surroundescape(s):
        return s.decode('latin1').translate(_table)

    In [15]: data = b'\xff' * 1024 * 1024

    In [16]: data.decode('ascii', 'surrogateescape') == decode_surroundescape(data)
    Out[16]: True

    In [17]: %timeit data.decode('ascii', 'surrogateescape')
    1 loops, best of 3: 394 ms per loop

    In [18]: %timeit decode_surroundescape(data)
    10 loops, best of 3: 40 ms per loop

    @vstinner
    Copy link
    Member

    vstinner commented Jan 7, 2016

    In [18]: %timeit decode_surroundescape(data)
    10 loops, best of 3: 40 ms per loop

    Cool! Good job.

    @methane methane closed this as completed Oct 6, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants