classification
Title: Amazingly faster UTF-8 decoding
Type: performance Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, ezio.melotti, haypo, janssen, jcea, loewis, mark.dickinson, ned.deily, pitrou, python-dev, ronaldoussoren, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2012-05-06 18:00 by serhiy.storchaka, last changed 2012-05-12 07:09 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
decode_utf8_4.patch serhiy.storchaka, 2012-05-06 18:00 review
decode_utf8_5.patch serhiy.storchaka, 2012-05-06 22:11 review
Messages (15)
msg160103 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-06 18:00
I propose a complex patch, which significantly speeds up UTF-8 decoding. Now decoder faster even decoder in 3.2 (except in a few unreal patological cases).

Also the decoder code reduced and simplified (formerly decoding code was repeated in at least three places).

As a side effect ASCII decoding now faster on some platforms (issue14419).

Related issues:
[issue4868] Faster utf-8 decoding
[issue13417] faster utf-8 decoding
[issue14419] Faster ascii decoding
[issue14624] Faster utf-16 decoder
[issue14625] Faster utf-32 decoder
[issue14654] Faster utf-8 decoding


Here are the results of benchmarking (numbers is speed in MB/s).

On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:

                                          3.2           3.3(vanilla)  patched

utf-8     'A'*10000                       1199 (+69%)   1721 (+18%)   2032
utf-8         'A'*9999+'\x80'             1189 (+25%)   996 (+49%)    1488
utf-8         'A'*9999+'\u0100'           1192 (-25%)   887 (+1%)     894
utf-8         'A'*9999+'\u8000'           1178 (-24%)   888 (+0%)     890
utf-8         'A'*9999+'\U00010000'       1177 (-29%)   872 (-4%)     837
utf-8     '\x80'*10000                    220 (+74%)    172 (+122%)   382
utf-8       '\x80'+'A'*9999               1192 (+5%)    376 (+232%)   1250
utf-8         '\x80'*9999+'\u0100'        220 (+54%)    160 (+112%)   339
utf-8         '\x80'*9999+'\u8000'        220 (+54%)    160 (+112%)   339
utf-8         '\x80'*9999+'\U00010000'    221 (+49%)    176 (+88%)    330
utf-8     '\u0100'*10000                  220 (+74%)    163 (+134%)   382
utf-8       '\u0100'+'A'*9999             1177 (+4%)    382 (+219%)   1220
utf-8       '\u0100'+'\x80'*9999          220 (+74%)    163 (+134%)   382
utf-8         '\u0100'*9999+'\u8000'      220 (+74%)    163 (+134%)   382
utf-8         '\u0100'*9999+'\U00010000'  220 (+50%)    180 (+83%)    330
utf-8     '\u8000'*10000                  261 (+66%)    191 (+126%)   432
utf-8       '\u8000'+'A'*9999             1197 (+1%)    384 (+216%)   1212
utf-8       '\u8000'+'\x80'*9999          216 (+77%)    163 (+134%)   382
utf-8       '\u8000'+'\u0100'*9999        215 (+77%)    164 (+132%)   381
utf-8         '\u8000'*9999+'\U00010000'  261 (+46%)    201 (+89%)    380
utf-8     '\U00010000'*10000              248 (+44%)    198 (+80%)    357
utf-8       '\U00010000'+'A'*9999         1192 (-5%)    383 (+196%)   1135
utf-8       '\U00010000'+'\x80'*9999      220 (+73%)    180 (+111%)   380
utf-8       '\U00010000'+'\u0100'*9999    220 (+73%)    180 (+111%)   380
utf-8       '\U00010000'+'\u8000'*9999    261 (+54%)    201 (+100%)   403

ascii     'A'*10000                       233 (+971%)   1876 (+33%)   2496

On 32-bit Linux, Intel Atom N570 @ 1.66GHz:

                                          3.2           3.3(vanilla)  patched

utf-8     'A'*10000                       345 (+81%)    596 (+5%)     623
utf-8         'A'*9999+'\x80'             335 (+41%)    303 (+56%)    474
utf-8         'A'*9999+'\u0100'           336 (-23%)    123 (+110%)   258
utf-8         'A'*9999+'\u8000'           337 (-24%)    123 (+108%)   256
utf-8         'A'*9999+'\U00010000'       336 (-24%)    261 (-3%)     254
utf-8     '\x80'*10000                    88 (+66%)     65 (+125%)    146
utf-8       '\x80'+'A'*9999               334 (+8%)     124 (+190%)   360
utf-8         '\x80'*9999+'\u0100'        88 (+43%)     65 (+94%)     126
utf-8         '\x80'*9999+'\u8000'        88 (+43%)     65 (+94%)     126
utf-8         '\x80'*9999+'\U00010000'    89 (+40%)     65 (+92%)     125
utf-8     '\u0100'*10000                  88 (+85%)     65 (+151%)    163
utf-8       '\u0100'+'A'*9999             336 (+2%)     77 (+345%)    343
utf-8       '\u0100'+'\x80'*9999          88 (+86%)     65 (+152%)    164
utf-8         '\u0100'*9999+'\u8000'      88 (+86%)     65 (+152%)    164
utf-8         '\u0100'*9999+'\U00010000'  88 (+57%)     65 (+112%)    138
utf-8     '\u8000'*10000                  98 (+79%)     69 (+154%)    175
utf-8       '\u8000'+'A'*9999             339 (+3%)     77 (+353%)    349
utf-8       '\u8000'+'\x80'*9999          89 (+84%)     66 (+148%)    164
utf-8       '\u8000'+'\u0100'*9999        88 (+86%)     65 (+152%)    164
utf-8         '\u8000'*9999+'\U00010000'  98 (+58%)     69 (+125%)    155
utf-8     '\U00010000'*10000              104 (+46%)    79 (+92%)     152
utf-8       '\U00010000'+'A'*9999         339 (-5%)     124 (+160%)   323
utf-8       '\U00010000'+'\x80'*9999      88 (+84%)     68 (+138%)    162
utf-8       '\U00010000'+'\u0100'*9999    88 (+83%)     68 (+137%)    161
utf-8       '\U00010000'+'\u8000'*9999    98 (+63%)     72 (+122%)    160

ascii     'A'*10000                       132 (+499%)   758 (+4%)     791
msg160107 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-06 20:01
64-bit Linux, Intel Core i5 2500K:

                                          3.2           3.3             patched

utf-8     'A'*10000                       2550 (+198%)	6828 (+11%)	7607
utf-8         'A'*9999+'\x80'             2501 (+118%)	2415 (+126%)	5456
utf-8         'A'*9999+'\u0100'           2501 (-20%)	2297 (-13%)	1996
utf-8         'A'*9999+'\u8000'           2494 (-14%)	2291 (-7%)	2133
utf-8         'A'*9999+'\U00010000'       2494 (-11%)	2293 (-3%)	2219
utf-8     '\x80'*10000                    422 (+135%)	517 (+92%)	991
utf-8       '\x80'+'A'*9999               2513 (+12%)	860 (+228%)	2820
utf-8         '\x80'*9999+'\u0100'        426 (+102%)	525 (+64%)	862
utf-8         '\x80'*9999+'\u8000'        426 (+104%)	538 (+62%)	871
utf-8         '\x80'*9999+'\U00010000'    428 (+105%)	523 (+68%)	878
utf-8     '\u0100'*10000                  425 (+140%)	517 (+97%)	1019
utf-8       '\u0100'+'A'*9999             2488 (+2%)	820 (+211%)	2549
utf-8       '\u0100'+'\x80'*9999          426 (+139%)	517 (+97%)	1019
utf-8         '\u0100'*9999+'\u8000'      426 (+139%)	529 (+93%)	1019
utf-8         '\u0100'*9999+'\U00010000'  426 (+106%)	509 (+72%)	876
utf-8     '\u8000'*10000                  573 (+28%)	490 (+50%)	733
utf-8       '\u8000'+'A'*9999             2500 (+1%)	822 (+208%)	2528
utf-8       '\u8000'+'\x80'*9999          426 (+139%)	530 (+92%)	1018
utf-8       '\u8000'+'\u0100'*9999        428 (+138%)	509 (+100%)	1018
utf-8         '\u8000'*9999+'\U00010000'  573 (+17%)	447 (+51%)	673
utf-8     '\U00010000'*10000              562 (+24%)	552 (+26%)	696
utf-8       '\U00010000'+'A'*9999         2512 (+3%)	939 (+175%)	2584
utf-8       '\U00010000'+'\x80'*9999      423 (+140%)	553 (+84%)	1017
utf-8       '\U00010000'+'\u0100'*9999    426 (+139%)	549 (+85%)	1017
utf-8       '\U00010000'+'\u8000'*9999    572 (+18%)	479 (+41%)	674
msg160110 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-06 21:48
Thank your, Antoine. Finally Intel Core is defeated!

If someone wants to repeat tests, see benchmark tools in issue14624.
msg160112 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-06 22:11
The patch updated in accordance with Antoine cosmetic comments.
msg160305 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-09 16:50
There's a Mac-specific portion in the patch, it would be nice if someone could check that it works.
msg160306 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 18:05
It would be good if someone checked on Macs work with command line arguments, including non-valid utf8. The difficulty is that you need to check on both Macs with 16-bit and with 32-bit wchar_t.
msg160307 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 18:32
Issue4388 is related to this Mac-specific portion of the patch.
msg160308 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-09 18:41
> It would be good if someone checked on Macs work with command line
> arguments, including non-valid utf8. The difficulty is that you need
> to check on both Macs with 16-bit and with 32-bit wchar_t.

Actually, it should be enough to run the test suite, since we should
have tests for this.
As for different wchar_t widths, that's the kind of thing we can leave
to the buildbots (assuming our OS X buildbots come back alive some
day :-)).
msg160309 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 19:29
I hacked the code (commented out "#if __APPLE__" in
Objects/unicodeobject.c and Modules/python.c) to start this branch on
Linux and ran the test (test_cmd_line) with C locale. It passed. Then I
broke decoder and ran the test again to get the error. I can now confirm
that the code works correctly on a platform with a 32-bit wchar_t.
msg160311 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-05-09 20:13
> Actually, it should be enough to run the test suite, since we should
> have tests for this.

I just ran the test suite ("python -m test") on OS X 10.6.8 with 'decode_utf8_5.patch' applied.  (64-bit --with-pydebug build of Python.)  No test failures.


test header:

== CPython 3.3.0a3+ (default:840cb46d0395+, May 9 2012, 20:55:18) [GCC 4.2.1 (Apple Inc. build 5664)]
==   Darwin-10.8.0-i386-64bit little-endian
==   /Users/mdickinson/Python/cpython/build/test_python_39794

Fragment of configure output relevant to wchar looked like this:

checking wchar.h usability... yes
checking wchar.h presence... yes
checking for wchar.h... yes
checking size of wchar_t... 4
checking for UCS-4 tcl... no
checking whether wchar_t is signed... yes
no usable wchar_t found
msg160312 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-05-09 20:18
> The difficulty is that you need to check on both Macs
> with 16-bit and with 32-bit wchar_t.

I don't think that the size of wchar_t is configurable: it should always be 32 bits on Mac OS X.
msg160346 - (view) Author: Roundup Robot (python-dev) Date: 2012-05-10 14:38
New changeset e08c3791f035 by Antoine Pitrou in branch 'default':
Issue #14738: Speed-up UTF-8 decoding on non-ASCII data.  Patch by Serhiy Storchaka.
http://hg.python.org/cpython/rev/e08c3791f035
msg160347 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-10 14:38
The patch is now committed. Well done and thanks for your contribution.
msg160447 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-11 19:45
Thanks Martin for review, which has allowed me to make a quality patch, and for promotion of further research. Thanks Antoine for review, benchmarks, commit, and for the original optimization, which served as the basis for my patch.
msg160462 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-05-12 07:09
If the commit makes Python 3.3 faster than Python 3.2, it is an
optimisation that should be documented in the What's New in Python 3.3
document.
History
Date User Action Args
2012-05-12 07:09:09hayposetmessages: + msg160462
2012-05-11 21:58:22pitroulinkissue14419 superseder
2012-05-11 21:58:22pitrouunlinkissue14419 dependencies
2012-05-11 21:58:14pitroulinkissue14419 dependencies
2012-05-11 19:45:44serhiy.storchakasetmessages: + msg160447
2012-05-10 14:38:47pitrousetstatus: open -> closed
resolution: fixed
messages: + msg160347

stage: patch review -> resolved
2012-05-10 14:38:11python-devsetnosy: + python-dev
messages: + msg160346
2012-05-09 20:18:21hayposetmessages: + msg160312
2012-05-09 20:13:57mark.dickinsonsetnosy: + mark.dickinson
messages: + msg160311
2012-05-09 19:29:53serhiy.storchakasetmessages: + msg160309
2012-05-09 18:41:36pitrousetnosy: + janssen
2012-05-09 18:41:16pitrousetmessages: + msg160308
2012-05-09 18:32:09serhiy.storchakasetmessages: + msg160307
2012-05-09 18:05:08serhiy.storchakasetmessages: + msg160306
2012-05-09 16:50:50pitrousetnosy: + ronaldoussoren, ned.deily
messages: + msg160305
2012-05-06 22:11:07serhiy.storchakasetfiles: + decode_utf8_5.patch

messages: + msg160112
2012-05-06 21:48:10serhiy.storchakasetmessages: + msg160110
2012-05-06 20:01:02pitrousetmessages: + msg160107
2012-05-06 18:30:06ezio.melottisetnosy: + ezio.melotti

components: + Unicode
stage: patch review
2012-05-06 18:00:54serhiy.storchakacreate