Issue 14923: Even faster UTF-8 decoding

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/59128

classification

Title:	Even faster UTF-8 decoding
Type:	performance	Stage:	patch review
Components:	Interpreter Core, Unicode	Versions:	Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	mark.dickinson	Nosy List:	Arfrever, ezio.melotti, janssen, jcea, loewis, mark.dickinson, ned.deily, pitrou, python-dev, ronaldoussoren, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-05-26 09:11 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
decode_utf8_signed_byte.patch	serhiy.storchaka, 2012-05-26 09:11	Two's complement representation trick	review
decodebench.py	serhiy.storchaka, 2012-05-26 09:13	Benchmark script
bench-diff.py	serhiy.storchaka, 2012-05-26 09:14	Benchmark results comparator
decode_utf8_range_check.patch	serhiy.storchaka, 2012-05-27 15:49	Simple range check	review
decode_utf8_signed_byte-2.patch	serhiy.storchaka, 2012-06-22 22:31		review

Messages (17)
msg161652 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-26 09:11
As strange as it may seem, but using a simple trick was made UTF-8 decoding even more speed up. Here are the benchmark results. On 32-bit Linux, AMD Athlon 64 X2: vanilla patched utf-8 'A'10000 2061 (+3%) 2115 utf-8 '\x80'10000 383 (-7%) 355 utf-8 '\x80'+'A'9999 1273 (+1%) 1290 utf-8 '\u0100'10000 382 (+47%) 562 utf-8 '\u0100'+'A'9999 1239 (+1%) 1253 utf-8 '\u0100'+'\x80'9999 383 (+47%) 562 utf-8 '\u8000'10000 434 (-6%) 409 utf-8 '\u8000'+'A'9999 1245 (+1%) 1256 utf-8 '\u8000'+'\x80'9999 382 (+47%) 560 utf-8 '\u8000'+'\u0100'9999 383 (+44%) 553 utf-8 '\U00010000'10000 358 (+4%) 373 utf-8 '\U00010000'+'A'9999 1171 (+0%) 1176 utf-8 '\U00010000'+'\x80'9999 381 (+44%) 548 utf-8 '\U00010000'+'\u0100'9999 381 (+44%) 548 utf-8 '\U00010000'+'\u8000'9999 404 (+0%) 406 On 32-bit Linux, Intel Atom N570: vanilla patched utf-8 'A'10000 623 (+0%) 626 utf-8 '\x80'10000 145 (+15%) 167 utf-8 '\x80'+'A'9999 354 (+2%) 362 utf-8 '\u0100'10000 164 (+10%) 181 utf-8 '\u0100'+'A'9999 343 (-0%) 342 utf-8 '\u0100'+'\x80'9999 164 (+11%) 182 utf-8 '\u8000'10000 175 (+5%) 183 utf-8 '\u8000'+'A'9999 349 (+0%) 349 utf-8 '\u8000'+'\x80'9999 164 (+11%) 182 utf-8 '\u8000'+'\u0100'9999 164 (+10%) 181 utf-8 '\U00010000'10000 152 (+11%) 168 utf-8 '\U00010000'+'A'9999 313 (+0%) 313 utf-8 '\U00010000'+'\x80'9999 161 (+11%) 179 utf-8 '\U00010000'+'\u0100'9999 161 (+11%) 179 utf-8 '\U00010000'+'\u8000'9999 160 (+11%) 177
msg161653 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-26 09:19
I see a slight increase under 64-bit Linux with gcc 4.5.2, too: vanilla patched utf-8 'A'10000 7857 (+4%) 8210 utf-8 'A'9999+'\x80' 5392 (+8%) 5843 utf-8 'A'9999+'\u0100' 2119 (+3%) 2173 utf-8 'A'9999+'\u8000' 2121 (+2%) 2172 utf-8 'A'9999+'\U00010000' 2248 (+2%) 2293 utf-8 '\x80'10000 1015 (+1%) 1021 utf-8 '\x80'+'A'9999 2747 (+5%) 2877 utf-8 '\x80'9999+'\u0100' 868 (+0%) 869 utf-8 '\x80'9999+'\u8000' 857 (+2%) 870 utf-8 '\x80'9999+'\U00010000' 877 (+0%) 881 utf-8 '\u0100'10000 1016 (+16%) 1181 utf-8 '\u0100'+'A'9999 2506 (+3%) 2592 utf-8 '\u0100'+'\x80'9999 1015 (+16%) 1179 utf-8 '\u0100'9999+'\u8000' 1015 (+16%) 1182 utf-8 '\u0100'9999+'\U00010000' 875 (+13%) 992 utf-8 '\u8000'10000 836 (+18%) 985 utf-8 '\u8000'+'A'9999 2508 (+3%) 2588 utf-8 '\u8000'+'\x80'9999 1015 (+16%) 1182 utf-8 '\u8000'+'\u0100'9999 1014 (+17%) 1182 utf-8 '\u8000'9999+'\U00010000' 767 (+17%) 894 utf-8 '\U00010000'10000 730 (+0%) 732 utf-8 '\U00010000'+'A'9999 2542 (+2%) 2599 utf-8 '\U00010000'+'\x80'9999 1013 (+17%) 1182 utf-8 '\U00010000'+'\u0100'9999 1013 (+17%) 1181 utf-8 '\U00010000'+'\u8000'*9999 727 (+0%) 728
msg161654 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-26 09:20
It seems the patch relies on a two's complement representation of integers. Mark, do you think that's ok?
msg161656 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-26 09:52
> It seems the patch relies on a two's complement representation of integers. Mark, do you think that's ok? Yes, the patch depends on two facts -- 8-bit bytes and a two's complement representation of integers. That's why I call it a trick. However, today CPython will not work on other platforms. However, we can wrap macro definition in #if/#else/#end and provide the traditional form (but I don't remember how to test a two's complement representation in compile time).
msg161657 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-05-26 10:02
The C standard says, in 6.3.1.3/3 Otherwise [], the new type is signed and the value cannot be represented in it; either the result is implementation-deﬁned or an implementation-deﬁned signal is raised. []: the value cannot be exactly converted, and the target type is not unsigned. We shouldn't be using unsigned->signed conversions where the source value is out of range for the signed type.
msg161711 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-27 15:49
Yes, this is an implementation-dependent behavior (and on the supported platforms it is implemented well in a certain way). However, if the continuation byte check to do the simplest way ((ch) >= 0x80 && (ch) < 0xC0), this has the same effect (speed up to +45%) on AMD Athlon. vanilla patched utf-8 'A'10000 2061 (-2%) 2018 utf-8 '\x80'10000 383 (+9%) 416 utf-8 '\x80'+'A'9999 1273 (+3%) 1315 utf-8 '\u0100'10000 382 (+46%) 558 utf-8 '\u0100'+'A'9999 1239 (+0%) 1245 utf-8 '\u0100'+'\x80'9999 383 (+46%) 558 utf-8 '\u8000'10000 434 (-6%) 408 utf-8 '\u8000'+'A'9999 1245 (+0%) 1245 utf-8 '\u8000'+'\x80'9999 382 (+46%) 556 utf-8 '\u8000'+'\u0100'9999 383 (+45%) 556 utf-8 '\U00010000'10000 358 (+0%) 359 utf-8 '\U00010000'+'A'9999 1171 (-0%) 1170 utf-8 '\U00010000'+'\x80'9999 381 (+30%) 495 utf-8 '\U00010000'+'\u0100'9999 381 (+30%) 495 utf-8 '\U00010000'+'\u8000'9999 404 (-5%) 385 On Intel Atom the results did not change or become a little better. vanilla patched utf-8 'A'10000 623 (+3%) 642 utf-8 '\x80'10000 145 (+9%) 158 utf-8 '\x80'+'A'9999 354 (+4%) 367 utf-8 '\u0100'10000 164 (+0%) 164 utf-8 '\u0100'+'A'9999 343 (+2%) 351 utf-8 '\u0100'+'\x80'9999 164 (+1%) 165 utf-8 '\u8000'10000 175 (-2%) 171 utf-8 '\u8000'+'A'9999 349 (+3%) 359 utf-8 '\u8000'+'\x80'9999 164 (+0%) 164 utf-8 '\u8000'+'\u0100'9999 164 (+0%) 164 utf-8 '\U00010000'10000 152 (-1%) 150 utf-8 '\U00010000'+'A'9999 313 (+2%) 319 utf-8 '\U00010000'+'\x80'9999 161 (+1%) 162 utf-8 '\U00010000'+'\u0100'9999 161 (+1%) 162 utf-8 '\U00010000'+'\u8000'9999 160 (-2%) 156
msg161715 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-27 18:37
> However, if the continuation byte check to do the simplest way ((ch) >= > 0x80 && (ch) < 0xC0), this has the same effect (speed up to +45%) on > AMD Athlon. Doesn't produce any significant speedup on Intel Core i5-2500.
msg161759 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2012-05-28 06:42
> It seems the patch relies on a two's complement representation of > integers. Mark, do you think that's ok? (1) Relying on two's complement integers seems fine to me: we're already relying on it in other places in Python (e.g., bitwise operations for ints in Python 2.x); it seems unlikely Python's going to run into current or future hardware that uses anything else; and any special-case code for ones' complement or sign-magnitude is going to be essentially unused and awkward to test, so it's probably better not to have it in the codebase at all. In an ideal world, I guess we'd add some configure-time tests for two's complement so that in the unlikely event that Python does meet a non two's complement platform the build fails early with a clear message rather than the tests failing in strange ways. (2) The bit that Martin identifies: relying on conversion from unsigned to signed doing a wraparound modulo 2**<suitable n> is a bit more troubling; it's something that I try to avoid where possible, but that's not always easy. I don't recall ever having encountered this causing problems in practice, though---it feels like a leftover from a non-two's complement world where processors would have a hard time giving wraparound semantics, so the C standard didn't want to require it. Again, if we're going to rely on this, it would probably make sense to have some configure-time checks; and it would be better not to rely on it at all without a really good reason.
msg163501 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-22 22:31
Here is a patch that uses some sort of autodetection.
msg163583 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-23 11:30
Any chance to commit the patch before final feature freeze?
msg163586 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-23 11:36
> Any chance to commit the patch before final feature freeze? I'll defer to Mark :-)
msg163589 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2012-06-23 11:43
Okay, will look at this this afternoon.
msg163607 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2012-06-23 13:33
I'm happy to apply the 'decode_utf8_range_check.patch'; I'll do that unless there are objections. The code is clearer than the original, and if we get a speedup into the bargain then I don't see a reason not to apply this. I'm less comfortable with either the original patch, or the most recent one (decode_utf8_signed_byte-2.patch).
msg163611 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-06-23 13:51
Serhiy, does this patch also fix #8271? If so, can you also include the tests I wrote for it?
msg163672 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-06-23 20:45
New changeset 3214c9ebcf5e by Mark Dickinson in branch 'default': Issue #14923: Optimize continuation-byte check in UTF-8 decoding. Patch by Serhiy Storchaka. http://hg.python.org/cpython/rev/3214c9ebcf5e
msg163673 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2012-06-23 20:47
Patch applied. Closing. Ezio: the patch is pure optimization, with no change in semantics; I don't see how it could fix #8271.
msg163675 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-23 21:10
> Serhiy, does this patch also fix #8271? No, this patch not change behavior. But updated patch for issue 8271 now contains this patch (I hope this will help merge). > If so, can you also include the tests I wrote for it? Your tests included in patch for issue 8271.

History
Date	User	Action	Args
2022-04-11 14:57:30	admin	set	github: 59128
2012-06-23 21:10:43	serhiy.storchaka	set	messages: + msg163675
2012-06-23 20:47:20	mark.dickinson	set	status: open -> closed resolution: fixed messages: + msg163673
2012-06-23 20:45:26	python-dev	set	messages: + msg163672
2012-06-23 13:51:00	ezio.melotti	set	messages: + msg163611
2012-06-23 13:33:24	mark.dickinson	set	messages: + msg163607
2012-06-23 11:43:32	mark.dickinson	set	assignee: mark.dickinson messages: + msg163589
2012-06-23 11:36:11	pitrou	set	messages: + msg163586
2012-06-23 11:30:38	serhiy.storchaka	set	messages: + msg163583
2012-06-22 22:31:57	serhiy.storchaka	set	files: + decode_utf8_signed_byte-2.patch messages: + msg163501
2012-05-28 06:42:50	mark.dickinson	set	messages: + msg161759
2012-05-27 18:37:04	pitrou	set	messages: + msg161715
2012-05-27 15:49:48	serhiy.storchaka	set	files: + decode_utf8_range_check.patch messages: + msg161711
2012-05-26 10:02:09	loewis	set	messages: + msg161657
2012-05-26 09:52:09	serhiy.storchaka	set	messages: + msg161656
2012-05-26 09:20:52	pitrou	set	stage: commit review -> patch review
2012-05-26 09:20:38	pitrou	set	messages: + msg161654 stage: commit review
2012-05-26 09:19:46	pitrou	set	messages: + msg161653
2012-05-26 09:14:53	serhiy.storchaka	set	files: + bench-diff.py
2012-05-26 09:13:51	serhiy.storchaka	set	files: + decodebench.py
2012-05-26 09:11:07	serhiy.storchaka	create