Optimize UTF-8 decoder with error handlers #69488

vstinner · 2015-10-02T14:44:43Z

BPO	25301
Nosy	@vstinner, @ezio-melotti
Files	utf8_decoder.patch bench.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2015-10-05.11:44:37.727>
created_at = <Date 2015-10-02.14:44:42.656>
labels = ['expert-unicode', 'performance']
title = 'Optimize UTF-8 decoder with error handlers'
updated_at = <Date 2015-10-05.11:49:36.115>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2015-10-05.11:49:36.115>
actor = 'python-dev'
assignee = 'none'
closed = True
closed_date = <Date 2015-10-05.11:44:37.727>
closer = 'vstinner'
components = ['Unicode']
creation = <Date 2015-10-02.14:44:42.656>
creator = 'vstinner'
dependencies = []
files = ['40663', '40671']
hgrepos = []
issue_num = 25301
keywords = ['patch']
message_count = 6.0
messages = ['252117', '252181', '252264', '252319', '252320', '252321']
nosy_count = 3.0
nosy_names = ['vstinner', 'ezio.melotti', 'python-dev']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue25301'
versions = ['Python 3.6']

vstinner · 2015-10-02T14:44:42Z

The issue bpo-24870 optimized the ASCII decoder with error handlers:

New changeset 3c430259873e by Victor Stinner in branch 'default':
Issue bpo-24870: Optimize the ASCII decoder for error handlers: surrogateescape,
https://hg.python.org/cpython/rev/3c430259873e

We should also optimize the UTF-8 decoder with error handlers.

I will work on a patch next days.

vstinner · 2015-10-03T00:01:15Z

Here is a first patch. It is written to keep best performances for valid UTF-8 encoded string, but speedup strings with a few undecodable bytes.

vstinner · 2015-10-04T08:30:32Z

Results of the microbenchmark on the UTF-8 decoder.

As expected, performances on valid UTF-8 is unchanged, which was an important goal for me.

Decoding with error handlers optimized by the patch are *much* faster.

backslashreplace is still slow, because I didn't optimize it.

Common platform:
Python unicode implementation: PEP-393
Timer: time.perf_counter
Platform: Linux-4.1.5-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two
CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
Bits: int=32, long=64, long long=64, size_t=64, void*=64
CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Timer precision: 55 ns

Platform of campaign before:
SCM: hg revision=f51921883f50 tag=tip branch=default date="2015-10-04 01:19 -0400"
Python version: 3.6.0a0 (default:f51921883f50, Oct 4 2015, 10:19:37) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
Date: 2015-10-04 10:19:44

Platform of campaign after:
SCM: hg revision=f51921883f50+ tag=tip branch=default date="2015-10-04 01:19 -0400"
Python version: 3.6.0a0 (default:f51921883f50+, Oct 4 2015, 10:14:05) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
Date: 2015-10-04 10:18:55

---------------------+-------------+--------
valid UTF-8 (strict) | before | after
---------------------+-------------+--------
100 x 10**1 bytes | 297 ns () | 297 ns
100 x 10**3 bytes | 7.4 us () | 7.44 us
100 x 10**2 bytes | 929 ns () | 924 ns
100 x 10**4 bytes | 80.4 us () | 80.4 us
---------------------+-------------+--------
Total | 89.1 us (*) | 89 us
---------------------+-------------+--------

------------------+-------------+---------------
ignore | before | after
------------------+-------------+---------------
100 x 10**1 bytes | 6.68 us () | 743 ns (-89%)
100 x 10**3 bytes | 561 us () | 42.6 us (-92%)
100 x 10**2 bytes | 56.8 us () | 4.55 us (-92%)
100 x 10**4 bytes | 6.02 ms () | 425 us (-93%)
------------------+-------------+---------------
Total | 6.65 ms (*) | 473 us (-93%)
------------------+-------------+---------------

------------------+-------------+---------------
replace | before | after
------------------+-------------+---------------
100 x 10**1 bytes | 7.61 us () | 890 ns (-88%)
100 x 10**3 bytes | 639 us () | 50.3 us (-92%)
100 x 10**2 bytes | 64.8 us () | 5.37 us (-92%)
100 x 10**4 bytes | 7.09 ms () | 505 us (-93%)
------------------+-------------+---------------
Total | 7.81 ms (*) | 561 us (-93%)
------------------+-------------+---------------

------------------+-------------+---------------
surrogateescape | before | after
------------------+-------------+---------------
100 x 10**1 bytes | 7.96 us () | 855 ns (-89%)
100 x 10**3 bytes | 674 us () | 50.2 us (-93%)
100 x 10**2 bytes | 68.8 us () | 5.35 us (-92%)
100 x 10**4 bytes | 7.38 ms () | 504 us (-93%)
------------------+-------------+---------------
Total | 8.13 ms (*) | 560 us (-93%)
------------------+-------------+---------------

------------------+-------------+--------
backslashreplace | before | after
------------------+-------------+--------
100 x 10**1 bytes | 7.66 us () | 7.89 us
100 x 10**3 bytes | 633 us () | 633 us
100 x 10**2 bytes | 64.1 us () | 64.6 us
100 x 10**4 bytes | 6.9 ms () | 6.93 ms
------------------+-------------+--------
Total | 7.61 ms (*) | 7.64 ms
------------------+-------------+--------

---------------------+-------------+---------------
Summary | before | after
---------------------+-------------+---------------
valid UTF-8 (strict) | 89.1 us () | 89 us
ignore | 6.65 ms () | 473 us (-93%)
replace | 7.81 ms () | 561 us (-93%)
surrogateescape | 8.13 ms () | 560 us (-93%)
backslashreplace | 7.61 ms () | 7.64 ms
---------------------+-------------+---------------
Total | 30.3 ms () | 9.32 ms (-69%)
---------------------+-------------+---------------

python-dev · 2015-10-05T11:44:03Z

New changeset 3152e4038d97 by Victor Stinner in branch 'default':
Issue bpo-25301: The UTF-8 decoder is now up to 15 times as fast for error
https://hg.python.org/cpython/rev/3152e4038d97

vstinner · 2015-10-05T11:44:38Z

I pushed my optimization. I close the issue.

python-dev · 2015-10-05T11:49:36Z

New changeset 5b9ffea7e7c3 by Victor Stinner in branch 'default':
Issue bpo-25301: Fix compatibility with ISO C90
https://hg.python.org/cpython/rev/5b9ffea7e7c3

vstinner added topic-unicode performance Performance or resource usage labels Oct 2, 2015

vstinner closed this as completed Oct 5, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize UTF-8 decoder with error handlers #69488

Optimize UTF-8 decoder with error handlers #69488

vstinner commented Oct 2, 2015

vstinner commented Oct 2, 2015

vstinner commented Oct 3, 2015

vstinner commented Oct 4, 2015

python-dev mannequin commented Oct 5, 2015

vstinner commented Oct 5, 2015

python-dev mannequin commented Oct 5, 2015

Optimize UTF-8 decoder with error handlers #69488

Optimize UTF-8 decoder with error handlers #69488

Comments

vstinner commented Oct 2, 2015

vstinner commented Oct 2, 2015

vstinner commented Oct 3, 2015

vstinner commented Oct 4, 2015

python-dev mannequin commented Oct 5, 2015

vstinner commented Oct 5, 2015

python-dev mannequin commented Oct 5, 2015