This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Even faster UTF-8 decoding
Type: performance Stage: patch review
Components: Interpreter Core, Unicode Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: mark.dickinson Nosy List: Arfrever, ezio.melotti, janssen, jcea, loewis, mark.dickinson, ned.deily, pitrou, python-dev, ronaldoussoren, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-05-26 09:11 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
decode_utf8_signed_byte.patch serhiy.storchaka, 2012-05-26 09:11 Two's complement representation trick review
decodebench.py serhiy.storchaka, 2012-05-26 09:13 Benchmark script
bench-diff.py serhiy.storchaka, 2012-05-26 09:14 Benchmark results comparator
decode_utf8_range_check.patch serhiy.storchaka, 2012-05-27 15:49 Simple range check review
decode_utf8_signed_byte-2.patch serhiy.storchaka, 2012-06-22 22:31 review
Messages (17)
msg161652 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-26 09:11
As strange as it may seem, but using a simple trick was made UTF-8 decoding even more speed up.

Here are the benchmark results.

On 32-bit Linux, AMD Athlon 64 X2:

                                          vanilla      patched

utf-8     'A'*10000                       2061 (+3%)   2115
utf-8     '\x80'*10000                    383 (-7%)    355
utf-8       '\x80'+'A'*9999               1273 (+1%)   1290
utf-8     '\u0100'*10000                  382 (+47%)   562
utf-8       '\u0100'+'A'*9999             1239 (+1%)   1253
utf-8       '\u0100'+'\x80'*9999          383 (+47%)   562
utf-8     '\u8000'*10000                  434 (-6%)    409
utf-8       '\u8000'+'A'*9999             1245 (+1%)   1256
utf-8       '\u8000'+'\x80'*9999          382 (+47%)   560
utf-8       '\u8000'+'\u0100'*9999        383 (+44%)   553
utf-8     '\U00010000'*10000              358 (+4%)    373
utf-8       '\U00010000'+'A'*9999         1171 (+0%)   1176
utf-8       '\U00010000'+'\x80'*9999      381 (+44%)   548
utf-8       '\U00010000'+'\u0100'*9999    381 (+44%)   548
utf-8       '\U00010000'+'\u8000'*9999    404 (+0%)    406

On 32-bit Linux, Intel Atom N570:

                                          vanilla      patched

utf-8     'A'*10000                       623 (+0%)    626
utf-8     '\x80'*10000                    145 (+15%)   167
utf-8       '\x80'+'A'*9999               354 (+2%)    362
utf-8     '\u0100'*10000                  164 (+10%)   181
utf-8       '\u0100'+'A'*9999             343 (-0%)    342
utf-8       '\u0100'+'\x80'*9999          164 (+11%)   182
utf-8     '\u8000'*10000                  175 (+5%)    183
utf-8       '\u8000'+'A'*9999             349 (+0%)    349
utf-8       '\u8000'+'\x80'*9999          164 (+11%)   182
utf-8       '\u8000'+'\u0100'*9999        164 (+10%)   181
utf-8     '\U00010000'*10000              152 (+11%)   168
utf-8       '\U00010000'+'A'*9999         313 (+0%)    313
utf-8       '\U00010000'+'\x80'*9999      161 (+11%)   179
utf-8       '\U00010000'+'\u0100'*9999    161 (+11%)   179
utf-8       '\U00010000'+'\u8000'*9999    160 (+11%)   177
msg161653 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-26 09:19
I see a slight increase under 64-bit Linux with gcc 4.5.2, too:

                                          vanilla       patched

utf-8     'A'*10000                       7857 (+4%)	8210
utf-8         'A'*9999+'\x80'             5392 (+8%)	5843
utf-8         'A'*9999+'\u0100'           2119 (+3%)	2173
utf-8         'A'*9999+'\u8000'           2121 (+2%)	2172
utf-8         'A'*9999+'\U00010000'       2248 (+2%)	2293
utf-8     '\x80'*10000                    1015 (+1%)	1021
utf-8       '\x80'+'A'*9999               2747 (+5%)	2877
utf-8         '\x80'*9999+'\u0100'        868 (+0%)	869
utf-8         '\x80'*9999+'\u8000'        857 (+2%)	870
utf-8         '\x80'*9999+'\U00010000'    877 (+0%)	881
utf-8     '\u0100'*10000                  1016 (+16%)	1181
utf-8       '\u0100'+'A'*9999             2506 (+3%)	2592
utf-8       '\u0100'+'\x80'*9999          1015 (+16%)	1179
utf-8         '\u0100'*9999+'\u8000'      1015 (+16%)	1182
utf-8         '\u0100'*9999+'\U00010000'  875 (+13%)	992
utf-8     '\u8000'*10000                  836 (+18%)	985
utf-8       '\u8000'+'A'*9999             2508 (+3%)	2588
utf-8       '\u8000'+'\x80'*9999          1015 (+16%)	1182
utf-8       '\u8000'+'\u0100'*9999        1014 (+17%)	1182
utf-8         '\u8000'*9999+'\U00010000'  767 (+17%)	894
utf-8     '\U00010000'*10000              730 (+0%)	732
utf-8       '\U00010000'+'A'*9999         2542 (+2%)	2599
utf-8       '\U00010000'+'\x80'*9999      1013 (+17%)	1182
utf-8       '\U00010000'+'\u0100'*9999    1013 (+17%)	1181
utf-8       '\U00010000'+'\u8000'*9999    727 (+0%)	728
msg161654 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-26 09:20
It seems the patch relies on a two's complement representation of integers. Mark, do you think that's ok?
msg161656 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-26 09:52
> It seems the patch relies on a two's complement representation of integers. Mark, do you think that's ok?

Yes, the patch depends on two facts -- 8-bit bytes and a two's
complement representation of integers. That's why I call it a trick.
However, today CPython will not work on other platforms. However, we can
wrap macro definition in #if/#else/#end and provide the traditional form
(but I don't remember how to test a two's complement representation in
compile time).
msg161657 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-26 10:02
The C standard says, in 6.3.1.3/3

Otherwise [*], the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.

[*]: the value cannot be exactly converted, and the target type is not unsigned.

We shouldn't be using unsigned->signed conversions where the source value is out of range for the signed type.
msg161711 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-27 15:49
Yes, this is an implementation-dependent behavior (and on the supported platforms it is implemented well in a certain way).

However, if the continuation byte check to do the simplest way ((ch) >= 0x80 && (ch) < 0xC0), this has the same effect (speed up to +45%) on AMD Athlon.

                                          vanilla      patched

utf-8     'A'*10000                       2061 (-2%)   2018
utf-8     '\x80'*10000                    383 (+9%)    416
utf-8       '\x80'+'A'*9999               1273 (+3%)   1315
utf-8     '\u0100'*10000                  382 (+46%)   558
utf-8       '\u0100'+'A'*9999             1239 (+0%)   1245
utf-8       '\u0100'+'\x80'*9999          383 (+46%)   558
utf-8     '\u8000'*10000                  434 (-6%)    408
utf-8       '\u8000'+'A'*9999             1245 (+0%)   1245
utf-8       '\u8000'+'\x80'*9999          382 (+46%)   556
utf-8       '\u8000'+'\u0100'*9999        383 (+45%)   556
utf-8     '\U00010000'*10000              358 (+0%)    359
utf-8       '\U00010000'+'A'*9999         1171 (-0%)   1170
utf-8       '\U00010000'+'\x80'*9999      381 (+30%)   495
utf-8       '\U00010000'+'\u0100'*9999    381 (+30%)   495
utf-8       '\U00010000'+'\u8000'*9999    404 (-5%)    385

On Intel Atom the results did not change or become a little better.

                                          vanilla      patched

utf-8     'A'*10000                       623 (+3%)    642
utf-8     '\x80'*10000                    145 (+9%)    158
utf-8       '\x80'+'A'*9999               354 (+4%)    367
utf-8     '\u0100'*10000                  164 (+0%)    164
utf-8       '\u0100'+'A'*9999             343 (+2%)    351
utf-8       '\u0100'+'\x80'*9999          164 (+1%)    165
utf-8     '\u8000'*10000                  175 (-2%)    171
utf-8       '\u8000'+'A'*9999             349 (+3%)    359
utf-8       '\u8000'+'\x80'*9999          164 (+0%)    164
utf-8       '\u8000'+'\u0100'*9999        164 (+0%)    164
utf-8     '\U00010000'*10000              152 (-1%)    150
utf-8       '\U00010000'+'A'*9999         313 (+2%)    319
utf-8       '\U00010000'+'\x80'*9999      161 (+1%)    162
utf-8       '\U00010000'+'\u0100'*9999    161 (+1%)    162
utf-8       '\U00010000'+'\u8000'*9999    160 (-2%)    156
msg161715 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-27 18:37
> However, if the continuation byte check to do the simplest way ((ch) >= 
> 0x80 && (ch) < 0xC0), this has the same effect (speed up to +45%) on 
> AMD Athlon.

Doesn't produce any significant speedup on Intel Core i5-2500.
msg161759 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-05-28 06:42
> It seems the patch relies on a two's complement representation of
> integers. Mark, do you think that's ok?

(1) Relying on two's complement integers seems fine to me:  we're already relying on it in other places in Python (e.g., bitwise operations for ints in Python 2.x); it seems unlikely Python's going to run into current or future hardware that uses anything else; and any special-case code for ones' complement or sign-magnitude is going to be essentially unused and awkward to test, so it's probably better not to have it in the codebase at all.

In an ideal world, I guess we'd add some configure-time tests for two's complement so that in the unlikely event that Python *does* meet a non two's complement platform the build fails early with a clear message rather than the tests failing in strange ways.

(2) The bit that Martin identifies: relying on conversion from unsigned to signed doing a wraparound modulo 2**<suitable n> is a bit more troubling;  it's something that I try to avoid where possible, but that's not always easy.  I don't recall ever having encountered this causing problems in practice, though---it feels like a leftover from a non-two's complement world where processors would have a hard time giving wraparound semantics, so the C standard didn't want to require it.  Again, if we're going to rely on this, it would probably make sense to have some configure-time checks; and it would be better not to rely on it at all without a really good reason.
msg163501 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-22 22:31
Here is a patch that uses some sort of autodetection.
msg163583 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-23 11:30
Any chance to commit the patch before final feature freeze?
msg163586 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-06-23 11:36
> Any chance to commit the patch before final feature freeze?

I'll defer to Mark :-)
msg163589 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-06-23 11:43
Okay, will look at this this afternoon.
msg163607 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-06-23 13:33
I'm happy to apply the 'decode_utf8_range_check.patch';  I'll do that unless there are objections.  The code is clearer than the original, and if we get a speedup into the bargain then I don't see a reason not to apply this.

I'm less comfortable with either the original patch, or the most recent one (decode_utf8_signed_byte-2.patch).
msg163611 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-06-23 13:51
Serhiy, does this patch also fix #8271?
If so, can you also include the tests I wrote for it?
msg163672 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-06-23 20:45
New changeset 3214c9ebcf5e by Mark Dickinson in branch 'default':
Issue #14923: Optimize continuation-byte check in UTF-8 decoding.  Patch by Serhiy Storchaka.
http://hg.python.org/cpython/rev/3214c9ebcf5e
msg163673 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-06-23 20:47
Patch applied.  Closing.

Ezio:  the patch is pure optimization, with no change in semantics;  I don't see how it could fix #8271.
msg163675 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-23 21:10
> Serhiy, does this patch also fix #8271?

No, this patch not change behavior. But updated patch for issue 8271 now
contains this patch (I hope this will help merge).

> If so, can you also include the tests I wrote for it?

Your tests included in patch for issue 8271.
History
Date User Action Args
2022-04-11 14:57:30adminsetgithub: 59128
2012-06-23 21:10:43serhiy.storchakasetmessages: + msg163675
2012-06-23 20:47:20mark.dickinsonsetstatus: open -> closed
resolution: fixed
messages: + msg163673
2012-06-23 20:45:26python-devsetmessages: + msg163672
2012-06-23 13:51:00ezio.melottisetmessages: + msg163611
2012-06-23 13:33:24mark.dickinsonsetmessages: + msg163607
2012-06-23 11:43:32mark.dickinsonsetassignee: mark.dickinson
messages: + msg163589
2012-06-23 11:36:11pitrousetmessages: + msg163586
2012-06-23 11:30:38serhiy.storchakasetmessages: + msg163583
2012-06-22 22:31:57serhiy.storchakasetfiles: + decode_utf8_signed_byte-2.patch

messages: + msg163501
2012-05-28 06:42:50mark.dickinsonsetmessages: + msg161759
2012-05-27 18:37:04pitrousetmessages: + msg161715
2012-05-27 15:49:48serhiy.storchakasetfiles: + decode_utf8_range_check.patch

messages: + msg161711
2012-05-26 10:02:09loewissetmessages: + msg161657
2012-05-26 09:52:09serhiy.storchakasetmessages: + msg161656
2012-05-26 09:20:52pitrousetstage: commit review -> patch review
2012-05-26 09:20:38pitrousetmessages: + msg161654
stage: commit review
2012-05-26 09:19:46pitrousetmessages: + msg161653
2012-05-26 09:14:53serhiy.storchakasetfiles: + bench-diff.py
2012-05-26 09:13:51serhiy.storchakasetfiles: + decodebench.py
2012-05-26 09:11:07serhiy.storchakacreate