This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Amazingly faster UTF-8 decoding
Type: performance Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.3
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, ezio.melotti, janssen, jcea, loewis, mark.dickinson, ned.deily, pitrou, python-dev, ronaldoussoren, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-05-06 18:00 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

File name Uploaded Description Edit
decode_utf8_4.patch serhiy.storchaka, 2012-05-06 18:00 review
decode_utf8_5.patch serhiy.storchaka, 2012-05-06 22:11 review
Messages (15)
msg160103 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-06 18:00
I propose a complex patch, which significantly speeds up UTF-8 decoding. Now decoder faster even decoder in 3.2 (except in a few unreal patological cases).

Also the decoder code reduced and simplified (formerly decoding code was repeated in at least three places).

As a side effect ASCII decoding now faster on some platforms (issue14419).

Related issues:
[issue4868] Faster utf-8 decoding
[issue13417] faster utf-8 decoding
[issue14419] Faster ascii decoding
[issue14624] Faster utf-16 decoder
[issue14625] Faster utf-32 decoder
[issue14654] Faster utf-8 decoding

Here are the results of benchmarking (numbers is speed in MB/s).

On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:

                                          3.2           3.3(vanilla)  patched

utf-8     'A'*10000                       1199 (+69%)   1721 (+18%)   2032
utf-8         'A'*9999+'\x80'             1189 (+25%)   996 (+49%)    1488
utf-8         'A'*9999+'\u0100'           1192 (-25%)   887 (+1%)     894
utf-8         'A'*9999+'\u8000'           1178 (-24%)   888 (+0%)     890
utf-8         'A'*9999+'\U00010000'       1177 (-29%)   872 (-4%)     837
utf-8     '\x80'*10000                    220 (+74%)    172 (+122%)   382
utf-8       '\x80'+'A'*9999               1192 (+5%)    376 (+232%)   1250
utf-8         '\x80'*9999+'\u0100'        220 (+54%)    160 (+112%)   339
utf-8         '\x80'*9999+'\u8000'        220 (+54%)    160 (+112%)   339
utf-8         '\x80'*9999+'\U00010000'    221 (+49%)    176 (+88%)    330
utf-8     '\u0100'*10000                  220 (+74%)    163 (+134%)   382
utf-8       '\u0100'+'A'*9999             1177 (+4%)    382 (+219%)   1220
utf-8       '\u0100'+'\x80'*9999          220 (+74%)    163 (+134%)   382
utf-8         '\u0100'*9999+'\u8000'      220 (+74%)    163 (+134%)   382
utf-8         '\u0100'*9999+'\U00010000'  220 (+50%)    180 (+83%)    330
utf-8     '\u8000'*10000                  261 (+66%)    191 (+126%)   432
utf-8       '\u8000'+'A'*9999             1197 (+1%)    384 (+216%)   1212
utf-8       '\u8000'+'\x80'*9999          216 (+77%)    163 (+134%)   382
utf-8       '\u8000'+'\u0100'*9999        215 (+77%)    164 (+132%)   381
utf-8         '\u8000'*9999+'\U00010000'  261 (+46%)    201 (+89%)    380
utf-8     '\U00010000'*10000              248 (+44%)    198 (+80%)    357
utf-8       '\U00010000'+'A'*9999         1192 (-5%)    383 (+196%)   1135
utf-8       '\U00010000'+'\x80'*9999      220 (+73%)    180 (+111%)   380
utf-8       '\U00010000'+'\u0100'*9999    220 (+73%)    180 (+111%)   380
utf-8       '\U00010000'+'\u8000'*9999    261 (+54%)    201 (+100%)   403

ascii     'A'*10000                       233 (+971%)   1876 (+33%)   2496

On 32-bit Linux, Intel Atom N570 @ 1.66GHz:

                                          3.2           3.3(vanilla)  patched

utf-8     'A'*10000                       345 (+81%)    596 (+5%)     623
utf-8         'A'*9999+'\x80'             335 (+41%)    303 (+56%)    474
utf-8         'A'*9999+'\u0100'           336 (-23%)    123 (+110%)   258
utf-8         'A'*9999+'\u8000'           337 (-24%)    123 (+108%)   256
utf-8         'A'*9999+'\U00010000'       336 (-24%)    261 (-3%)     254
utf-8     '\x80'*10000                    88 (+66%)     65 (+125%)    146
utf-8       '\x80'+'A'*9999               334 (+8%)     124 (+190%)   360
utf-8         '\x80'*9999+'\u0100'        88 (+43%)     65 (+94%)     126
utf-8         '\x80'*9999+'\u8000'        88 (+43%)     65 (+94%)     126
utf-8         '\x80'*9999+'\U00010000'    89 (+40%)     65 (+92%)     125
utf-8     '\u0100'*10000                  88 (+85%)     65 (+151%)    163
utf-8       '\u0100'+'A'*9999             336 (+2%)     77 (+345%)    343
utf-8       '\u0100'+'\x80'*9999          88 (+86%)     65 (+152%)    164
utf-8         '\u0100'*9999+'\u8000'      88 (+86%)     65 (+152%)    164
utf-8         '\u0100'*9999+'\U00010000'  88 (+57%)     65 (+112%)    138
utf-8     '\u8000'*10000                  98 (+79%)     69 (+154%)    175
utf-8       '\u8000'+'A'*9999             339 (+3%)     77 (+353%)    349
utf-8       '\u8000'+'\x80'*9999          89 (+84%)     66 (+148%)    164
utf-8       '\u8000'+'\u0100'*9999        88 (+86%)     65 (+152%)    164
utf-8         '\u8000'*9999+'\U00010000'  98 (+58%)     69 (+125%)    155
utf-8     '\U00010000'*10000              104 (+46%)    79 (+92%)     152
utf-8       '\U00010000'+'A'*9999         339 (-5%)     124 (+160%)   323
utf-8       '\U00010000'+'\x80'*9999      88 (+84%)     68 (+138%)    162
utf-8       '\U00010000'+'\u0100'*9999    88 (+83%)     68 (+137%)    161
utf-8       '\U00010000'+'\u8000'*9999    98 (+63%)     72 (+122%)    160

ascii     'A'*10000                       132 (+499%)   758 (+4%)     791
msg160107 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-06 20:01
64-bit Linux, Intel Core i5 2500K:

                                          3.2           3.3             patched

utf-8     'A'*10000                       2550 (+198%)	6828 (+11%)	7607
utf-8         'A'*9999+'\x80'             2501 (+118%)	2415 (+126%)	5456
utf-8         'A'*9999+'\u0100'           2501 (-20%)	2297 (-13%)	1996
utf-8         'A'*9999+'\u8000'           2494 (-14%)	2291 (-7%)	2133
utf-8         'A'*9999+'\U00010000'       2494 (-11%)	2293 (-3%)	2219
utf-8     '\x80'*10000                    422 (+135%)	517 (+92%)	991
utf-8       '\x80'+'A'*9999               2513 (+12%)	860 (+228%)	2820
utf-8         '\x80'*9999+'\u0100'        426 (+102%)	525 (+64%)	862
utf-8         '\x80'*9999+'\u8000'        426 (+104%)	538 (+62%)	871
utf-8         '\x80'*9999+'\U00010000'    428 (+105%)	523 (+68%)	878
utf-8     '\u0100'*10000                  425 (+140%)	517 (+97%)	1019
utf-8       '\u0100'+'A'*9999             2488 (+2%)	820 (+211%)	2549
utf-8       '\u0100'+'\x80'*9999          426 (+139%)	517 (+97%)	1019
utf-8         '\u0100'*9999+'\u8000'      426 (+139%)	529 (+93%)	1019
utf-8         '\u0100'*9999+'\U00010000'  426 (+106%)	509 (+72%)	876
utf-8     '\u8000'*10000                  573 (+28%)	490 (+50%)	733
utf-8       '\u8000'+'A'*9999             2500 (+1%)	822 (+208%)	2528
utf-8       '\u8000'+'\x80'*9999          426 (+139%)	530 (+92%)	1018
utf-8       '\u8000'+'\u0100'*9999        428 (+138%)	509 (+100%)	1018
utf-8         '\u8000'*9999+'\U00010000'  573 (+17%)	447 (+51%)	673
utf-8     '\U00010000'*10000              562 (+24%)	552 (+26%)	696
utf-8       '\U00010000'+'A'*9999         2512 (+3%)	939 (+175%)	2584
utf-8       '\U00010000'+'\x80'*9999      423 (+140%)	553 (+84%)	1017
utf-8       '\U00010000'+'\u0100'*9999    426 (+139%)	549 (+85%)	1017
utf-8       '\U00010000'+'\u8000'*9999    572 (+18%)	479 (+41%)	674
msg160110 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-06 21:48
Thank your, Antoine. Finally Intel Core is defeated!

If someone wants to repeat tests, see benchmark tools in issue14624.
msg160112 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-06 22:11
The patch updated in accordance with Antoine cosmetic comments.
msg160305 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-09 16:50
There's a Mac-specific portion in the patch, it would be nice if someone could check that it works.
msg160306 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 18:05
It would be good if someone checked on Macs work with command line arguments, including non-valid utf8. The difficulty is that you need to check on both Macs with 16-bit and with 32-bit wchar_t.
msg160307 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 18:32
Issue4388 is related to this Mac-specific portion of the patch.
msg160308 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-09 18:41
> It would be good if someone checked on Macs work with command line
> arguments, including non-valid utf8. The difficulty is that you need
> to check on both Macs with 16-bit and with 32-bit wchar_t.

Actually, it should be enough to run the test suite, since we should
have tests for this.
As for different wchar_t widths, that's the kind of thing we can leave
to the buildbots (assuming our OS X buildbots come back alive some
day :-)).
msg160309 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 19:29
I hacked the code (commented out "#if __APPLE__" in
Objects/unicodeobject.c and Modules/python.c) to start this branch on
Linux and ran the test (test_cmd_line) with C locale. It passed. Then I
broke decoder and ran the test again to get the error. I can now confirm
that the code works correctly on a platform with a 32-bit wchar_t.
msg160311 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-05-09 20:13
> Actually, it should be enough to run the test suite, since we should
> have tests for this.

I just ran the test suite ("python -m test") on OS X 10.6.8 with 'decode_utf8_5.patch' applied.  (64-bit --with-pydebug build of Python.)  No test failures.

test header:

== CPython 3.3.0a3+ (default:840cb46d0395+, May 9 2012, 20:55:18) [GCC 4.2.1 (Apple Inc. build 5664)]
==   Darwin-10.8.0-i386-64bit little-endian
==   /Users/mdickinson/Python/cpython/build/test_python_39794

Fragment of configure output relevant to wchar looked like this:

checking wchar.h usability... yes
checking wchar.h presence... yes
checking for wchar.h... yes
checking size of wchar_t... 4
checking for UCS-4 tcl... no
checking whether wchar_t is signed... yes
no usable wchar_t found
msg160312 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-09 20:18
> The difficulty is that you need to check on both Macs
> with 16-bit and with 32-bit wchar_t.

I don't think that the size of wchar_t is configurable: it should always be 32 bits on Mac OS X.
msg160346 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-05-10 14:38
New changeset e08c3791f035 by Antoine Pitrou in branch 'default':
Issue #14738: Speed-up UTF-8 decoding on non-ASCII data.  Patch by Serhiy Storchaka.
msg160347 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-10 14:38
The patch is now committed. Well done and thanks for your contribution.
msg160447 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-11 19:45
Thanks Martin for review, which has allowed me to make a quality patch, and for promotion of further research. Thanks Antoine for review, benchmarks, commit, and for the original optimization, which served as the basis for my patch.
msg160462 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-12 07:09
If the commit makes Python 3.3 faster than Python 3.2, it is an
optimisation that should be documented in the What's New in Python 3.3
Date User Action Args
2022-04-11 14:57:29adminsetgithub: 58943
2012-05-12 07:09:09vstinnersetmessages: + msg160462
2012-05-11 21:58:22pitroulinkissue14419 superseder
2012-05-11 21:58:22pitrouunlinkissue14419 dependencies
2012-05-11 21:58:14pitroulinkissue14419 dependencies
2012-05-11 19:45:44serhiy.storchakasetmessages: + msg160447
2012-05-10 14:38:47pitrousetstatus: open -> closed
resolution: fixed
messages: + msg160347

stage: patch review -> resolved
2012-05-10 14:38:11python-devsetnosy: + python-dev
messages: + msg160346
2012-05-09 20:18:21vstinnersetmessages: + msg160312
2012-05-09 20:13:57mark.dickinsonsetnosy: + mark.dickinson
messages: + msg160311
2012-05-09 19:29:53serhiy.storchakasetmessages: + msg160309
2012-05-09 18:41:36pitrousetnosy: + janssen
2012-05-09 18:41:16pitrousetmessages: + msg160308
2012-05-09 18:32:09serhiy.storchakasetmessages: + msg160307
2012-05-09 18:05:08serhiy.storchakasetmessages: + msg160306
2012-05-09 16:50:50pitrousetnosy: + ronaldoussoren, ned.deily
messages: + msg160305
2012-05-06 22:11:07serhiy.storchakasetfiles: + decode_utf8_5.patch

messages: + msg160112
2012-05-06 21:48:10serhiy.storchakasetmessages: + msg160110
2012-05-06 20:01:02pitrousetmessages: + msg160107
2012-05-06 18:30:06ezio.melottisetnosy: + ezio.melotti

components: + Unicode
stage: patch review
2012-05-06 18:00:54serhiy.storchakacreate