This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Optimize UTF-8 encoder with error handlers
Type: performance Stage:
Components: Unicode Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, methane, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2015-09-29 11:30 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
utf8_encoder_errors.patch vstinner, 2015-09-29 11:30 review
bench.py vstinner, 2015-10-01 11:58
utf8_encoder_errors-2.patch vstinner, 2015-10-01 12:01 review
utf8_encoder_errors-3.patch vstinner, 2015-10-01 12:47 review
Messages (6)
msg251845 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-09-29 11:30
Attached patch optimizes the UTF-8 encoder for error handlers: ignore, replace, surrogateescape, surrogatepass. It is based on the patch  faster_surrogates_hadling.patch written by Serhiy Storchaka in the issue #24870.

It also modifies unicode_encode_ucs1() to use memset() for the replace error handler. It should be faster for long sequences of unencodable characters, but it may be slower for short sequences of unencodable characters.

The patch adds new unit tests and fix unit tests to ensure that utf-8-sig codec is also well tested.

TODO: write a benchmark.

See also the issue #25227 which optimized ASCII and latin1 encoders with the surrogateescape error handlers.
msg252021 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-01 12:01
Oh, there is a bug in utf8_encoder() (not in my patch!), newpos was not used after calling the error handler. It's now fixed in the new patch.
msg252022 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-01 12:04
Benchmark results. Sorry for the very long output.

There are some (corner?) cases where the patched Python is a little bit slower. I consider that it's ok since it's *much* faster in the other cases.

What do you think?


Common platform:
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
Bits: int=32, long=64, long long=64, size_t=64, void*=64
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Python unicode implementation: PEP 393
Timer: time.perf_counter
SCM: hg revision=10efb1797e7b+ tag=tip branch=default date="2015-10-01 13:16 +0200"
Platform: Linux-4.1.6-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two
CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes

Platform of campaign before:
Date: 2015-10-01 13:30:07
Timer precision: 61 ns
Python version: 3.6.0a0 (default:10efb1797e7b, Oct 1 2015, 13:30:06) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]

Platform of campaign after:
Timer precision: 63 ns
Date: 2015-10-01 13:54:14
Python version: 3.6.0a0 (default:10efb1797e7b+, Oct 1 2015, 13:53:51) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]

--------------------------+-------------+----------------
ignore: "\udcff" * length |      before |           after
--------------------------+-------------+----------------
length=10**1              | 3.16 us (*) |   279 ns (-91%)
length=10**3              |  241 us (*) | 1.08 us (-100%)
length=10**2              | 23.9 us (*) |   346 ns (-99%)
length=10**4              | 2.39 ms (*) | 6.48 us (-100%)
--------------------------+-------------+----------------
Total                     | 2.66 ms (*) | 8.19 us (-100%)
--------------------------+-------------+----------------

--------------------------------+-------------+---------------
ignore: "a" * length + "\udcff" |      before |          after
--------------------------------+-------------+---------------
length=10**1                    | 1.12 us (*) |  295 ns (-74%)
length=10**3                    |  2.2 us (*) | 1.57 us (-29%)
length=10**2                    | 1.21 us (*) |  408 ns (-66%)
length=10**4                    | 10.4 us (*) | 12.3 us (+18%)
--------------------------------+-------------+---------------
Total                           |   15 us (*) |        14.6 us
--------------------------------+-------------+---------------

--------------------------------------------+-------------+---------------
ignore: ("a" * 99 + "\udcff" * 99) * length |      before |          after
--------------------------------------------+-------------+---------------
length=10**1                                |  238 us (*) | 2.46 us (-99%)
length=10**3                                | 23.7 ms (*) |  234 us (-99%)
length=10**2                                | 2.38 ms (*) | 20.8 us (-99%)
length=10**4                                |  238 ms (*) | 2.56 ms (-99%)
--------------------------------------------+-------------+---------------
Total                                       |  265 ms (*) | 2.82 ms (-99%)
--------------------------------------------+-------------+---------------

---------------------------------------+-------------+----------------
ignore: ("\udcff" * 99 + "a") * length |      before |           after
---------------------------------------+-------------+----------------
length=10**1                           |  239 us (*) |  1.29 us (-99%)
length=10**3                           | 23.8 ms (*) | 80.9 us (-100%)
length=10**2                           |  2.4 ms (*) | 8.44 us (-100%)
length=10**4                           |  236 ms (*) |  839 us (-100%)
---------------------------------------+-------------+----------------
Total                                  |  263 ms (*) |  930 us (-100%)
---------------------------------------+-------------+----------------

--------------------------------+-------------+---------------
ignore: "\udcff" + "a" * length |      before |          after
--------------------------------+-------------+---------------
length=10**1                    | 1.09 us (*) |  297 ns (-73%)
length=10**3                    | 2.19 us (*) | 1.58 us (-28%)
length=10**2                    | 1.19 us (*) |  409 ns (-66%)
length=10**4                    | 10.5 us (*) | 12.3 us (+17%)
--------------------------------+-------------+---------------
Total                           | 14.9 us (*) |        14.6 us
--------------------------------+-------------+---------------

---------------------------+-------------+----------------
replace: "\udcff" * length |      before |           after
---------------------------+-------------+----------------
length=10**1               | 3.47 us (*) |   317 ns (-91%)
length=10**3               |  263 us (*) | 1.07 us (-100%)
length=10**2               | 26.4 us (*) |   383 ns (-99%)
length=10**4               | 2.65 ms (*) | 6.75 us (-100%)
---------------------------+-------------+----------------
Total                      | 2.94 ms (*) | 8.52 us (-100%)
---------------------------+-------------+----------------

---------------------------------+-------------+---------------
replace: "a" * length + "\udcff" |      before |          after
---------------------------------+-------------+---------------
length=10**1                     | 1.16 us (*) |  319 ns (-72%)
length=10**3                     | 2.25 us (*) | 1.62 us (-28%)
length=10**2                     | 1.25 us (*) |  432 ns (-65%)
length=10**4                     | 13.4 us (*) |  12.4 us (-7%)
---------------------------------+-------------+---------------
Total                            |   18 us (*) | 14.7 us (-18%)
---------------------------------+-------------+---------------

---------------------------------------------+-------------+---------------
replace: ("a" * 99 + "\udcff" * 99) * length |      before |          after
---------------------------------------------+-------------+---------------
length=10**1                                 |  267 us (*) | 2.52 us (-99%)
length=10**3                                 | 26.2 ms (*) |  210 us (-99%)
length=10**2                                 | 2.63 ms (*) | 21.3 us (-99%)
length=10**4                                 |  264 ms (*) | 2.98 ms (-99%)
---------------------------------------------+-------------+---------------
Total                                        |  293 ms (*) | 3.21 ms (-99%)
---------------------------------------------+-------------+---------------

----------------------------------------+-------------+----------------
replace: ("\udcff" * 99 + "a") * length |      before |           after
----------------------------------------+-------------+----------------
length=10**1                            |  263 us (*) | 1.29 us (-100%)
length=10**3                            | 26.1 ms (*) | 86.6 us (-100%)
length=10**2                            | 2.63 ms (*) | 9.02 us (-100%)
length=10**4                            |  261 ms (*) |  925 us (-100%)
----------------------------------------+-------------+----------------
Total                                   |  290 ms (*) | 1.02 ms (-100%)
----------------------------------------+-------------+----------------

---------------------------------+-------------+---------------
replace: "\udcff" + "a" * length |      before |          after
---------------------------------+-------------+---------------
length=10**1                     | 1.14 us (*) |  317 ns (-72%)
length=10**3                     | 2.24 us (*) |  1.6 us (-28%)
length=10**2                     | 1.23 us (*) |  428 ns (-65%)
length=10**4                     | 10.5 us (*) | 12.3 us (+17%)
---------------------------------+-------------+---------------
Total                            | 15.1 us (*) |        14.7 us
---------------------------------+-------------+---------------

-----------------------------------+-------------+---------------
surrogateescape: "\udcff" * length |      before |          after
-----------------------------------+-------------+---------------
length=10**1                       | 3.48 us (*) |  281 ns (-92%)
length=10**3                       |  267 us (*) | 1.77 us (-99%)
length=10**2                       | 26.7 us (*) |  424 ns (-98%)
length=10**4                       | 2.67 ms (*) | 13.9 us (-99%)
-----------------------------------+-------------+---------------
Total                              | 2.97 ms (*) | 16.3 us (-99%)
-----------------------------------+-------------+---------------

-----------------------------------------+-------------+---------------
surrogateescape: "a" * length + "\udcff" |      before |          after
-----------------------------------------+-------------+---------------
length=10**1                             | 1.14 us (*) |  277 ns (-76%)
length=10**3                             | 2.32 us (*) | 1.57 us (-32%)
length=10**2                             | 1.24 us (*) |  391 ns (-68%)
length=10**4                             | 10.6 us (*) | 12.3 us (+17%)
-----------------------------------------+-------------+---------------
Total                                    | 15.3 us (*) |        14.6 us
-----------------------------------------+-------------+---------------

-----------------------------------------------------+-------------+---------------
surrogateescape: ("a" * 99 + "\udcff" * 99) * length |      before |          after
-----------------------------------------------------+-------------+---------------
length=10**1                                         |  266 us (*) | 3.26 us (-99%)
length=10**3                                         | 26.4 ms (*) |  285 us (-99%)
length=10**2                                         | 2.65 ms (*) | 28.9 us (-99%)
length=10**4                                         |  266 ms (*) | 3.73 ms (-99%)
-----------------------------------------------------+-------------+---------------
Total                                                |  295 ms (*) | 4.04 ms (-99%)
-----------------------------------------------------+-------------+---------------

------------------------------------------------+-------------+---------------
surrogateescape: ("\udcff" * 99 + "a") * length |      before |          after
------------------------------------------------+-------------+---------------
length=10**1                                    |  265 us (*) | 2.04 us (-99%)
length=10**3                                    | 26.2 ms (*) |  165 us (-99%)
length=10**2                                    | 2.64 ms (*) |   17 us (-99%)
length=10**4                                    |  263 ms (*) | 1.75 ms (-99%)
------------------------------------------------+-------------+---------------
Total                                           |  292 ms (*) | 1.93 ms (-99%)
------------------------------------------------+-------------+---------------

-----------------------------------------+-------------+---------------
surrogateescape: "\udcff" + "a" * length |      before |          after
-----------------------------------------+-------------+---------------
length=10**1                             | 1.12 us (*) |  278 ns (-75%)
length=10**3                             | 2.25 us (*) | 1.59 us (-29%)
length=10**2                             | 1.21 us (*) |  389 ns (-68%)
length=10**4                             | 10.5 us (*) | 12.3 us (+17%)
-----------------------------------------+-------------+---------------
Total                                    | 15.1 us (*) |        14.6 us
-----------------------------------------+-------------+---------------

---------------------------------+-------------+---------------
surrogatepass: "\udcff" * length |      before |          after
---------------------------------+-------------+---------------
length=10**1                     | 3.71 us (*) |  306 ns (-92%)
length=10**3                     |  289 us (*) | 2.61 us (-99%)
length=10**2                     | 28.9 us (*) |  532 ns (-98%)
length=10**4                     | 2.88 ms (*) | 22.4 us (-99%)
---------------------------------+-------------+---------------
Total                            |  3.2 ms (*) | 25.8 us (-99%)
---------------------------------+-------------+---------------

---------------------------------------+-------------+---------------
surrogatepass: "a" * length + "\udcff" |      before |          after
---------------------------------------+-------------+---------------
length=10**1                           | 1.16 us (*) |  299 ns (-74%)
length=10**3                           | 2.36 us (*) | 1.59 us (-32%)
length=10**2                           | 1.27 us (*) |  413 ns (-68%)
length=10**4                           | 10.6 us (*) | 12.3 us (+16%)
---------------------------------------+-------------+---------------
Total                                  | 15.4 us (*) |  14.6 us (-5%)
---------------------------------------+-------------+---------------

---------------------------------------------------+-------------+---------------
surrogatepass: ("a" * 99 + "\udcff" * 99) * length |      before |          after
---------------------------------------------------+-------------+---------------
length=10**1                                       |  289 us (*) | 3.99 us (-99%)
length=10**3                                       | 28.5 ms (*) |  362 us (-99%)
length=10**2                                       | 2.86 ms (*) | 36.7 us (-99%)
length=10**4                                       |  287 ms (*) | 5.18 ms (-98%)
---------------------------------------------------+-------------+---------------
Total                                              |  319 ms (*) | 5.59 ms (-98%)
---------------------------------------------------+-------------+---------------

----------------------------------------------+-------------+---------------
surrogatepass: ("\udcff" * 99 + "a") * length |      before |          after
----------------------------------------------+-------------+---------------
length=10**1                                  |  288 us (*) | 2.91 us (-99%)
length=10**3                                  | 28.5 ms (*) |  242 us (-99%)
length=10**2                                  | 2.86 ms (*) | 24.7 us (-99%)
length=10**4                                  |  284 ms (*) | 2.53 ms (-99%)
----------------------------------------------+-------------+---------------
Total                                         |  316 ms (*) |  2.8 ms (-99%)
----------------------------------------------+-------------+---------------

---------------------------------------+-------------+---------------
surrogatepass: "\udcff" + "a" * length |      before |          after
---------------------------------------+-------------+---------------
length=10**1                           | 1.13 us (*) |  301 ns (-73%)
length=10**3                           |  2.3 us (*) | 1.59 us (-31%)
length=10**2                           | 1.24 us (*) |  409 ns (-67%)
length=10**4                           | 10.6 us (*) | 12.1 us (+15%)
---------------------------------------+-------------+---------------
Total                                  | 15.2 us (*) |  14.4 us (-5%)
---------------------------------------+-------------+---------------

------------------------------------+-------------+---------------
backslashreplace: "\udcff" * length |      before |          after
------------------------------------+-------------+---------------
length=10**1                        | 4.28 us (*) | 1.58 us (-63%)
length=10**3                        |  320 us (*) | 11.1 us (-97%)
length=10**2                        | 32.3 us (*) | 2.56 us (-92%)
length=10**4                        | 3.17 ms (*) | 96.6 us (-97%)
------------------------------------+-------------+---------------
Total                               | 3.52 ms (*) |  112 us (-97%)
------------------------------------+-------------+---------------

------------------------------------------+-------------+---------------
backslashreplace: "a" * length + "\udcff" |      before |          after
------------------------------------------+-------------+---------------
length=10**1                              | 1.44 us (*) |        1.47 us
length=10**3                              | 2.43 us (*) | 2.77 us (+14%)
length=10**2                              | 1.52 us (*) |  1.64 us (+8%)
length=10**4                              | 10.6 us (*) | 13.3 us (+25%)
------------------------------------------+-------------+---------------
Total                                     |   16 us (*) | 19.2 us (+20%)
------------------------------------------+-------------+---------------

------------------------------------------------------+-------------+---------------
backslashreplace: ("a" * 99 + "\udcff" * 99) * length |      before |          after
------------------------------------------------------+-------------+---------------
length=10**1                                          |  316 us (*) |   16 us (-95%)
length=10**3                                          | 31.3 ms (*) | 1.46 ms (-95%)
length=10**2                                          | 3.14 ms (*) |  147 us (-95%)
length=10**4                                          |  313 ms (*) | 15.3 ms (-95%)
------------------------------------------------------+-------------+---------------
Total                                                 |  347 ms (*) | 16.9 ms (-95%)
------------------------------------------------------+-------------+---------------

-------------------------------------------------+-------------+---------------
backslashreplace: ("\udcff" * 99 + "a") * length |      before |          after
-------------------------------------------------+-------------+---------------
length=10**1                                     |  317 us (*) | 14.7 us (-95%)
length=10**3                                     | 31.3 ms (*) | 1.34 ms (-96%)
length=10**2                                     | 3.17 ms (*) |  135 us (-96%)
length=10**4                                     |  313 ms (*) | 13.8 ms (-96%)
-------------------------------------------------+-------------+---------------
Total                                            |  347 ms (*) | 15.3 ms (-96%)
-------------------------------------------------+-------------+---------------

------------------------------------------+-------------+---------------
backslashreplace: "\udcff" + "a" * length |      before |          after
------------------------------------------+-------------+---------------
length=10**1                              | 1.43 us (*) |        1.45 us
length=10**3                              | 2.36 us (*) |  2.58 us (+9%)
length=10**2                              | 1.51 us (*) |        1.57 us
length=10**4                              | 10.5 us (*) | 13.2 us (+26%)
------------------------------------------+-------------+---------------
Total                                     | 15.8 us (*) | 18.8 us (+19%)
------------------------------------------+-------------+---------------

------------------------------------------------------+--------------+----------------
Summary                                               |       before |           after
------------------------------------------------------+--------------+----------------
ignore: "\udcff" * length                             |  2.66 ms (*) | 8.19 us (-100%)
ignore: "a" * length + "\udcff"                       |    15 us (*) |         14.6 us
ignore: ("a" * 99 + "\udcff" * 99) * length           |   265 ms (*) |  2.82 ms (-99%)
ignore: ("\udcff" * 99 + "a") * length                |   263 ms (*) |  930 us (-100%)
ignore: "\udcff" + "a" * length                       |  14.9 us (*) |         14.6 us
replace: "\udcff" * length                            |  2.94 ms (*) | 8.52 us (-100%)
replace: "a" * length + "\udcff"                      |    18 us (*) |  14.7 us (-18%)
replace: ("a" * 99 + "\udcff" * 99) * length          |   293 ms (*) |  3.21 ms (-99%)
replace: ("\udcff" * 99 + "a") * length               |   290 ms (*) | 1.02 ms (-100%)
replace: "\udcff" + "a" * length                      |  15.1 us (*) |         14.7 us
surrogateescape: "\udcff" * length                    |  2.97 ms (*) |  16.3 us (-99%)
surrogateescape: "a" * length + "\udcff"              |  15.3 us (*) |         14.6 us
surrogateescape: ("a" * 99 + "\udcff" * 99) * length  |   295 ms (*) |  4.04 ms (-99%)
surrogateescape: ("\udcff" * 99 + "a") * length       |   292 ms (*) |  1.93 ms (-99%)
surrogateescape: "\udcff" + "a" * length              |  15.1 us (*) |         14.6 us
surrogatepass: "\udcff" * length                      |   3.2 ms (*) |  25.8 us (-99%)
surrogatepass: "a" * length + "\udcff"                |  15.4 us (*) |   14.6 us (-5%)
surrogatepass: ("a" * 99 + "\udcff" * 99) * length    |   319 ms (*) |  5.59 ms (-98%)
surrogatepass: ("\udcff" * 99 + "a") * length         |   316 ms (*) |   2.8 ms (-99%)
surrogatepass: "\udcff" + "a" * length                |  15.2 us (*) |   14.4 us (-5%)
backslashreplace: "\udcff" * length                   |  3.52 ms (*) |   112 us (-97%)
backslashreplace: "a" * length + "\udcff"             |    16 us (*) |  19.2 us (+20%)
backslashreplace: ("a" * 99 + "\udcff" * 99) * length |   347 ms (*) |  16.9 ms (-95%)
backslashreplace: ("\udcff" * 99 + "a") * length      |   347 ms (*) |  15.3 ms (-96%)
backslashreplace: "\udcff" + "a" * length             |  15.8 us (*) |  18.8 us (+19%)
------------------------------------------------------+--------------+----------------
Total                                                 | 3.04 sec (*) |  54.9 ms (-98%)
------------------------------------------------------+--------------+----------------
msg252024 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-01 12:47
Oh, the default handler for errror handlers uses a loop to check for non-ASCII characters. It can be replaced with PyUnicode_IS_ASCII(str) which has a complexity O(1). Done in new patch.
msg252058 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-10-01 21:20
New changeset 2b5357b38366 by Victor Stinner in branch 'default':
Issue #25267: The UTF-8 encoder is now up to 75 times as fast for error
https://hg.python.org/cpython/rev/2b5357b38366
msg252060 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-01 21:27
I pushed my optimization.
History
Date User Action Args
2022-04-11 14:58:21adminsetgithub: 69454
2015-10-01 21:27:18vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg252060
2015-10-01 21:20:10python-devsetnosy: + python-dev
messages: + msg252058
2015-10-01 12:47:04vstinnersetfiles: + utf8_encoder_errors-3.patch

messages: + msg252024
2015-10-01 12:04:37vstinnersetmessages: + msg252022
2015-10-01 12:01:26vstinnersetfiles: + utf8_encoder_errors-2.patch

messages: + msg252021
2015-10-01 11:58:15vstinnersetfiles: + bench.py
2015-09-29 11:30:34vstinnercreate