Issue25267
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2015-09-29 11:30 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
utf8_encoder_errors.patch | vstinner, 2015-09-29 11:30 | review | ||
bench.py | vstinner, 2015-10-01 11:58 | |||
utf8_encoder_errors-2.patch | vstinner, 2015-10-01 12:01 | review | ||
utf8_encoder_errors-3.patch | vstinner, 2015-10-01 12:47 | review |
Messages (6) | |||
---|---|---|---|
msg251845 - (view) | Author: STINNER Victor (vstinner) * | Date: 2015-09-29 11:30 | |
Attached patch optimizes the UTF-8 encoder for error handlers: ignore, replace, surrogateescape, surrogatepass. It is based on the patch faster_surrogates_hadling.patch written by Serhiy Storchaka in the issue #24870. It also modifies unicode_encode_ucs1() to use memset() for the replace error handler. It should be faster for long sequences of unencodable characters, but it may be slower for short sequences of unencodable characters. The patch adds new unit tests and fix unit tests to ensure that utf-8-sig codec is also well tested. TODO: write a benchmark. See also the issue #25227 which optimized ASCII and latin1 encoders with the surrogateescape error handlers. |
|||
msg252021 - (view) | Author: STINNER Victor (vstinner) * | Date: 2015-10-01 12:01 | |
Oh, there is a bug in utf8_encoder() (not in my patch!), newpos was not used after calling the error handler. It's now fixed in the new patch. |
|||
msg252022 - (view) | Author: STINNER Victor (vstinner) * | Date: 2015-10-01 12:04 | |
Benchmark results. Sorry for the very long output. There are some (corner?) cases where the patched Python is a little bit slower. I consider that it's ok since it's *much* faster in the other cases. What do you think? Common platform: Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Bits: int=32, long=64, long long=64, size_t=64, void*=64 CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Python unicode implementation: PEP 393 Timer: time.perf_counter SCM: hg revision=10efb1797e7b+ tag=tip branch=default date="2015-10-01 13:16 +0200" Platform: Linux-4.1.6-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Platform of campaign before: Date: 2015-10-01 13:30:07 Timer precision: 61 ns Python version: 3.6.0a0 (default:10efb1797e7b, Oct 1 2015, 13:30:06) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] Platform of campaign after: Timer precision: 63 ns Date: 2015-10-01 13:54:14 Python version: 3.6.0a0 (default:10efb1797e7b+, Oct 1 2015, 13:53:51) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] --------------------------+-------------+---------------- ignore: "\udcff" * length | before | after --------------------------+-------------+---------------- length=10**1 | 3.16 us (*) | 279 ns (-91%) length=10**3 | 241 us (*) | 1.08 us (-100%) length=10**2 | 23.9 us (*) | 346 ns (-99%) length=10**4 | 2.39 ms (*) | 6.48 us (-100%) --------------------------+-------------+---------------- Total | 2.66 ms (*) | 8.19 us (-100%) --------------------------+-------------+---------------- --------------------------------+-------------+--------------- ignore: "a" * length + "\udcff" | before | after --------------------------------+-------------+--------------- length=10**1 | 1.12 us (*) | 295 ns (-74%) length=10**3 | 2.2 us (*) | 1.57 us (-29%) length=10**2 | 1.21 us (*) | 408 ns (-66%) length=10**4 | 10.4 us (*) | 12.3 us (+18%) --------------------------------+-------------+--------------- Total | 15 us (*) | 14.6 us --------------------------------+-------------+--------------- --------------------------------------------+-------------+--------------- ignore: ("a" * 99 + "\udcff" * 99) * length | before | after --------------------------------------------+-------------+--------------- length=10**1 | 238 us (*) | 2.46 us (-99%) length=10**3 | 23.7 ms (*) | 234 us (-99%) length=10**2 | 2.38 ms (*) | 20.8 us (-99%) length=10**4 | 238 ms (*) | 2.56 ms (-99%) --------------------------------------------+-------------+--------------- Total | 265 ms (*) | 2.82 ms (-99%) --------------------------------------------+-------------+--------------- ---------------------------------------+-------------+---------------- ignore: ("\udcff" * 99 + "a") * length | before | after ---------------------------------------+-------------+---------------- length=10**1 | 239 us (*) | 1.29 us (-99%) length=10**3 | 23.8 ms (*) | 80.9 us (-100%) length=10**2 | 2.4 ms (*) | 8.44 us (-100%) length=10**4 | 236 ms (*) | 839 us (-100%) ---------------------------------------+-------------+---------------- Total | 263 ms (*) | 930 us (-100%) ---------------------------------------+-------------+---------------- --------------------------------+-------------+--------------- ignore: "\udcff" + "a" * length | before | after --------------------------------+-------------+--------------- length=10**1 | 1.09 us (*) | 297 ns (-73%) length=10**3 | 2.19 us (*) | 1.58 us (-28%) length=10**2 | 1.19 us (*) | 409 ns (-66%) length=10**4 | 10.5 us (*) | 12.3 us (+17%) --------------------------------+-------------+--------------- Total | 14.9 us (*) | 14.6 us --------------------------------+-------------+--------------- ---------------------------+-------------+---------------- replace: "\udcff" * length | before | after ---------------------------+-------------+---------------- length=10**1 | 3.47 us (*) | 317 ns (-91%) length=10**3 | 263 us (*) | 1.07 us (-100%) length=10**2 | 26.4 us (*) | 383 ns (-99%) length=10**4 | 2.65 ms (*) | 6.75 us (-100%) ---------------------------+-------------+---------------- Total | 2.94 ms (*) | 8.52 us (-100%) ---------------------------+-------------+---------------- ---------------------------------+-------------+--------------- replace: "a" * length + "\udcff" | before | after ---------------------------------+-------------+--------------- length=10**1 | 1.16 us (*) | 319 ns (-72%) length=10**3 | 2.25 us (*) | 1.62 us (-28%) length=10**2 | 1.25 us (*) | 432 ns (-65%) length=10**4 | 13.4 us (*) | 12.4 us (-7%) ---------------------------------+-------------+--------------- Total | 18 us (*) | 14.7 us (-18%) ---------------------------------+-------------+--------------- ---------------------------------------------+-------------+--------------- replace: ("a" * 99 + "\udcff" * 99) * length | before | after ---------------------------------------------+-------------+--------------- length=10**1 | 267 us (*) | 2.52 us (-99%) length=10**3 | 26.2 ms (*) | 210 us (-99%) length=10**2 | 2.63 ms (*) | 21.3 us (-99%) length=10**4 | 264 ms (*) | 2.98 ms (-99%) ---------------------------------------------+-------------+--------------- Total | 293 ms (*) | 3.21 ms (-99%) ---------------------------------------------+-------------+--------------- ----------------------------------------+-------------+---------------- replace: ("\udcff" * 99 + "a") * length | before | after ----------------------------------------+-------------+---------------- length=10**1 | 263 us (*) | 1.29 us (-100%) length=10**3 | 26.1 ms (*) | 86.6 us (-100%) length=10**2 | 2.63 ms (*) | 9.02 us (-100%) length=10**4 | 261 ms (*) | 925 us (-100%) ----------------------------------------+-------------+---------------- Total | 290 ms (*) | 1.02 ms (-100%) ----------------------------------------+-------------+---------------- ---------------------------------+-------------+--------------- replace: "\udcff" + "a" * length | before | after ---------------------------------+-------------+--------------- length=10**1 | 1.14 us (*) | 317 ns (-72%) length=10**3 | 2.24 us (*) | 1.6 us (-28%) length=10**2 | 1.23 us (*) | 428 ns (-65%) length=10**4 | 10.5 us (*) | 12.3 us (+17%) ---------------------------------+-------------+--------------- Total | 15.1 us (*) | 14.7 us ---------------------------------+-------------+--------------- -----------------------------------+-------------+--------------- surrogateescape: "\udcff" * length | before | after -----------------------------------+-------------+--------------- length=10**1 | 3.48 us (*) | 281 ns (-92%) length=10**3 | 267 us (*) | 1.77 us (-99%) length=10**2 | 26.7 us (*) | 424 ns (-98%) length=10**4 | 2.67 ms (*) | 13.9 us (-99%) -----------------------------------+-------------+--------------- Total | 2.97 ms (*) | 16.3 us (-99%) -----------------------------------+-------------+--------------- -----------------------------------------+-------------+--------------- surrogateescape: "a" * length + "\udcff" | before | after -----------------------------------------+-------------+--------------- length=10**1 | 1.14 us (*) | 277 ns (-76%) length=10**3 | 2.32 us (*) | 1.57 us (-32%) length=10**2 | 1.24 us (*) | 391 ns (-68%) length=10**4 | 10.6 us (*) | 12.3 us (+17%) -----------------------------------------+-------------+--------------- Total | 15.3 us (*) | 14.6 us -----------------------------------------+-------------+--------------- -----------------------------------------------------+-------------+--------------- surrogateescape: ("a" * 99 + "\udcff" * 99) * length | before | after -----------------------------------------------------+-------------+--------------- length=10**1 | 266 us (*) | 3.26 us (-99%) length=10**3 | 26.4 ms (*) | 285 us (-99%) length=10**2 | 2.65 ms (*) | 28.9 us (-99%) length=10**4 | 266 ms (*) | 3.73 ms (-99%) -----------------------------------------------------+-------------+--------------- Total | 295 ms (*) | 4.04 ms (-99%) -----------------------------------------------------+-------------+--------------- ------------------------------------------------+-------------+--------------- surrogateescape: ("\udcff" * 99 + "a") * length | before | after ------------------------------------------------+-------------+--------------- length=10**1 | 265 us (*) | 2.04 us (-99%) length=10**3 | 26.2 ms (*) | 165 us (-99%) length=10**2 | 2.64 ms (*) | 17 us (-99%) length=10**4 | 263 ms (*) | 1.75 ms (-99%) ------------------------------------------------+-------------+--------------- Total | 292 ms (*) | 1.93 ms (-99%) ------------------------------------------------+-------------+--------------- -----------------------------------------+-------------+--------------- surrogateescape: "\udcff" + "a" * length | before | after -----------------------------------------+-------------+--------------- length=10**1 | 1.12 us (*) | 278 ns (-75%) length=10**3 | 2.25 us (*) | 1.59 us (-29%) length=10**2 | 1.21 us (*) | 389 ns (-68%) length=10**4 | 10.5 us (*) | 12.3 us (+17%) -----------------------------------------+-------------+--------------- Total | 15.1 us (*) | 14.6 us -----------------------------------------+-------------+--------------- ---------------------------------+-------------+--------------- surrogatepass: "\udcff" * length | before | after ---------------------------------+-------------+--------------- length=10**1 | 3.71 us (*) | 306 ns (-92%) length=10**3 | 289 us (*) | 2.61 us (-99%) length=10**2 | 28.9 us (*) | 532 ns (-98%) length=10**4 | 2.88 ms (*) | 22.4 us (-99%) ---------------------------------+-------------+--------------- Total | 3.2 ms (*) | 25.8 us (-99%) ---------------------------------+-------------+--------------- ---------------------------------------+-------------+--------------- surrogatepass: "a" * length + "\udcff" | before | after ---------------------------------------+-------------+--------------- length=10**1 | 1.16 us (*) | 299 ns (-74%) length=10**3 | 2.36 us (*) | 1.59 us (-32%) length=10**2 | 1.27 us (*) | 413 ns (-68%) length=10**4 | 10.6 us (*) | 12.3 us (+16%) ---------------------------------------+-------------+--------------- Total | 15.4 us (*) | 14.6 us (-5%) ---------------------------------------+-------------+--------------- ---------------------------------------------------+-------------+--------------- surrogatepass: ("a" * 99 + "\udcff" * 99) * length | before | after ---------------------------------------------------+-------------+--------------- length=10**1 | 289 us (*) | 3.99 us (-99%) length=10**3 | 28.5 ms (*) | 362 us (-99%) length=10**2 | 2.86 ms (*) | 36.7 us (-99%) length=10**4 | 287 ms (*) | 5.18 ms (-98%) ---------------------------------------------------+-------------+--------------- Total | 319 ms (*) | 5.59 ms (-98%) ---------------------------------------------------+-------------+--------------- ----------------------------------------------+-------------+--------------- surrogatepass: ("\udcff" * 99 + "a") * length | before | after ----------------------------------------------+-------------+--------------- length=10**1 | 288 us (*) | 2.91 us (-99%) length=10**3 | 28.5 ms (*) | 242 us (-99%) length=10**2 | 2.86 ms (*) | 24.7 us (-99%) length=10**4 | 284 ms (*) | 2.53 ms (-99%) ----------------------------------------------+-------------+--------------- Total | 316 ms (*) | 2.8 ms (-99%) ----------------------------------------------+-------------+--------------- ---------------------------------------+-------------+--------------- surrogatepass: "\udcff" + "a" * length | before | after ---------------------------------------+-------------+--------------- length=10**1 | 1.13 us (*) | 301 ns (-73%) length=10**3 | 2.3 us (*) | 1.59 us (-31%) length=10**2 | 1.24 us (*) | 409 ns (-67%) length=10**4 | 10.6 us (*) | 12.1 us (+15%) ---------------------------------------+-------------+--------------- Total | 15.2 us (*) | 14.4 us (-5%) ---------------------------------------+-------------+--------------- ------------------------------------+-------------+--------------- backslashreplace: "\udcff" * length | before | after ------------------------------------+-------------+--------------- length=10**1 | 4.28 us (*) | 1.58 us (-63%) length=10**3 | 320 us (*) | 11.1 us (-97%) length=10**2 | 32.3 us (*) | 2.56 us (-92%) length=10**4 | 3.17 ms (*) | 96.6 us (-97%) ------------------------------------+-------------+--------------- Total | 3.52 ms (*) | 112 us (-97%) ------------------------------------+-------------+--------------- ------------------------------------------+-------------+--------------- backslashreplace: "a" * length + "\udcff" | before | after ------------------------------------------+-------------+--------------- length=10**1 | 1.44 us (*) | 1.47 us length=10**3 | 2.43 us (*) | 2.77 us (+14%) length=10**2 | 1.52 us (*) | 1.64 us (+8%) length=10**4 | 10.6 us (*) | 13.3 us (+25%) ------------------------------------------+-------------+--------------- Total | 16 us (*) | 19.2 us (+20%) ------------------------------------------+-------------+--------------- ------------------------------------------------------+-------------+--------------- backslashreplace: ("a" * 99 + "\udcff" * 99) * length | before | after ------------------------------------------------------+-------------+--------------- length=10**1 | 316 us (*) | 16 us (-95%) length=10**3 | 31.3 ms (*) | 1.46 ms (-95%) length=10**2 | 3.14 ms (*) | 147 us (-95%) length=10**4 | 313 ms (*) | 15.3 ms (-95%) ------------------------------------------------------+-------------+--------------- Total | 347 ms (*) | 16.9 ms (-95%) ------------------------------------------------------+-------------+--------------- -------------------------------------------------+-------------+--------------- backslashreplace: ("\udcff" * 99 + "a") * length | before | after -------------------------------------------------+-------------+--------------- length=10**1 | 317 us (*) | 14.7 us (-95%) length=10**3 | 31.3 ms (*) | 1.34 ms (-96%) length=10**2 | 3.17 ms (*) | 135 us (-96%) length=10**4 | 313 ms (*) | 13.8 ms (-96%) -------------------------------------------------+-------------+--------------- Total | 347 ms (*) | 15.3 ms (-96%) -------------------------------------------------+-------------+--------------- ------------------------------------------+-------------+--------------- backslashreplace: "\udcff" + "a" * length | before | after ------------------------------------------+-------------+--------------- length=10**1 | 1.43 us (*) | 1.45 us length=10**3 | 2.36 us (*) | 2.58 us (+9%) length=10**2 | 1.51 us (*) | 1.57 us length=10**4 | 10.5 us (*) | 13.2 us (+26%) ------------------------------------------+-------------+--------------- Total | 15.8 us (*) | 18.8 us (+19%) ------------------------------------------+-------------+--------------- ------------------------------------------------------+--------------+---------------- Summary | before | after ------------------------------------------------------+--------------+---------------- ignore: "\udcff" * length | 2.66 ms (*) | 8.19 us (-100%) ignore: "a" * length + "\udcff" | 15 us (*) | 14.6 us ignore: ("a" * 99 + "\udcff" * 99) * length | 265 ms (*) | 2.82 ms (-99%) ignore: ("\udcff" * 99 + "a") * length | 263 ms (*) | 930 us (-100%) ignore: "\udcff" + "a" * length | 14.9 us (*) | 14.6 us replace: "\udcff" * length | 2.94 ms (*) | 8.52 us (-100%) replace: "a" * length + "\udcff" | 18 us (*) | 14.7 us (-18%) replace: ("a" * 99 + "\udcff" * 99) * length | 293 ms (*) | 3.21 ms (-99%) replace: ("\udcff" * 99 + "a") * length | 290 ms (*) | 1.02 ms (-100%) replace: "\udcff" + "a" * length | 15.1 us (*) | 14.7 us surrogateescape: "\udcff" * length | 2.97 ms (*) | 16.3 us (-99%) surrogateescape: "a" * length + "\udcff" | 15.3 us (*) | 14.6 us surrogateescape: ("a" * 99 + "\udcff" * 99) * length | 295 ms (*) | 4.04 ms (-99%) surrogateescape: ("\udcff" * 99 + "a") * length | 292 ms (*) | 1.93 ms (-99%) surrogateescape: "\udcff" + "a" * length | 15.1 us (*) | 14.6 us surrogatepass: "\udcff" * length | 3.2 ms (*) | 25.8 us (-99%) surrogatepass: "a" * length + "\udcff" | 15.4 us (*) | 14.6 us (-5%) surrogatepass: ("a" * 99 + "\udcff" * 99) * length | 319 ms (*) | 5.59 ms (-98%) surrogatepass: ("\udcff" * 99 + "a") * length | 316 ms (*) | 2.8 ms (-99%) surrogatepass: "\udcff" + "a" * length | 15.2 us (*) | 14.4 us (-5%) backslashreplace: "\udcff" * length | 3.52 ms (*) | 112 us (-97%) backslashreplace: "a" * length + "\udcff" | 16 us (*) | 19.2 us (+20%) backslashreplace: ("a" * 99 + "\udcff" * 99) * length | 347 ms (*) | 16.9 ms (-95%) backslashreplace: ("\udcff" * 99 + "a") * length | 347 ms (*) | 15.3 ms (-96%) backslashreplace: "\udcff" + "a" * length | 15.8 us (*) | 18.8 us (+19%) ------------------------------------------------------+--------------+---------------- Total | 3.04 sec (*) | 54.9 ms (-98%) ------------------------------------------------------+--------------+---------------- |
|||
msg252024 - (view) | Author: STINNER Victor (vstinner) * | Date: 2015-10-01 12:47 | |
Oh, the default handler for errror handlers uses a loop to check for non-ASCII characters. It can be replaced with PyUnicode_IS_ASCII(str) which has a complexity O(1). Done in new patch. |
|||
msg252058 - (view) | Author: Roundup Robot (python-dev) | Date: 2015-10-01 21:20 | |
New changeset 2b5357b38366 by Victor Stinner in branch 'default': Issue #25267: The UTF-8 encoder is now up to 75 times as fast for error https://hg.python.org/cpython/rev/2b5357b38366 |
|||
msg252060 - (view) | Author: STINNER Victor (vstinner) * | Date: 2015-10-01 21:27 | |
I pushed my optimization. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:21 | admin | set | github: 69454 |
2015-10-01 21:27:18 | vstinner | set | status: open -> closed resolution: fixed messages: + msg252060 |
2015-10-01 21:20:10 | python-dev | set | nosy:
+ python-dev messages: + msg252058 |
2015-10-01 12:47:04 | vstinner | set | files:
+ utf8_encoder_errors-3.patch messages: + msg252024 |
2015-10-01 12:04:37 | vstinner | set | messages: + msg252022 |
2015-10-01 12:01:26 | vstinner | set | files:
+ utf8_encoder_errors-2.patch messages: + msg252021 |
2015-10-01 11:58:15 | vstinner | set | files: + bench.py |
2015-09-29 11:30:34 | vstinner | create |