Faster UTF-32 encoding #59232

serhiy-storchaka · 2012-06-07T13:57:31Z

BPO	15027
Nosy	@gpshead, @pitrou, @vstinner, @larryhastings, @ezio-melotti, @asvetlov, @serhiy-storchaka
Files	encode_utf32_2.patch encode_utf32_3.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2015-05-12.20:26:46.786>
created_at = <Date 2012-06-07.13:57:30.888>
labels = ['interpreter-core', 'expert-unicode', 'performance']
title = 'Faster UTF-32 encoding'
updated_at = <Date 2015-05-18.19:22:44.031>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2015-05-18.19:22:44.031>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2015-05-12.20:26:46.786>
closer = 'serhiy.storchaka'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2012-06-07.13:57:30.888>
creator = 'serhiy.storchaka'
dependencies = []
files = ['27637', '33096']
hgrepos = []
issue_num = 15027
keywords = ['patch', 'needs review']
message_count = 21.0
messages = ['162474', '162823', '173404', '205912', '205934', '205940', '207292', '207294', '207302', '207305', '207306', '207311', '210147', '210148', '242871', '242954', '242981', '243005', '243008', '243523', '243524']
nosy_count = 12.0
nosy_names = ['gregory.p.smith', 'pitrou', 'vstinner', 'larry', 'ezio.melotti', 'Arfrever', 'asvetlov', 'neologix', 'BreamoreBoy', 'python-dev', 'serhiy.storchaka', 'kmike']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue15027'
versions = ['Python 3.5']

serhiy-storchaka · 2012-06-07T13:57:30Z

In pair to bpo-14625 here is a patch than speed up UTF-32 encoding in several times. In addition, it fixes an unsafe check of an integer overflow.

Here are the results of benchmarking. See benchmark tools in https://bitbucket.org/storchaka/cpython-stuff repository.

On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:

Py2.7 Py3.2 Py3.3 patched

541 (+1032%) 541 (+1032%) 844 (+626%) 6125 encode utf-32le 'A'*10000
543 (+1056%) 541 (+1060%) 844 (+643%) 6275 encode utf-32le '\x80'*10000
544 (+1010%) 542 (+1014%) 843 (+616%) 6037 encode utf-32le '\x80'+'A'*9999
541 (+799%) 542 (+797%) 764 (+537%) 4864 encode utf-32le '\u0100'*10000
544 (+781%) 542 (+784%) 767 (+525%) 4793 encode utf-32le '\u0100'+'A'*9999
544 (+789%) 542 (+792%) 766 (+531%) 4834 encode utf-32le '\u0100'+'\x80'*9999
542 (+799%) 541 (+801%) 764 (+538%) 4874 encode utf-32le '\u8000'*10000
544 (+779%) 542 (+782%) 767 (+523%) 4780 encode utf-32le '\u8000'+'A'*9999
544 (+793%) 542 (+796%) 766 (+534%) 4859 encode utf-32le '\u8000'+'\x80'*9999
544 (+819%) 542 (+823%) 766 (+553%) 5001 encode utf-32le '\u8000'+'\u0100'*9999
430 (+867%) 427 (+874%) 860 (+383%) 4157 encode utf-32le '\U00010000'*10000
543 (+655%) 543 (+655%) 861 (+376%) 4101 encode utf-32le '\U00010000'+'A'*9999
543 (+658%) 543 (+658%) 861 (+378%) 4116 encode utf-32le '\U00010000'+'\x80'*9999
543 (+670%) 543 (+670%) 859 (+387%) 4180 encode utf-32le '\U00010000'+'\u0100'*9999
543 (+666%) 543 (+666%) 860 (+383%) 4158 encode utf-32le '\U00010000'+'\u8000'*9999

541 (+880%) 543 (+876%) 844 (+528%) 5300 encode utf-32be 'A'*10000
541 (+872%) 542 (+870%) 844 (+523%) 5256 encode utf-32be '\x80'*10000
544 (+843%) 542 (+846%) 843 (+509%) 5130 encode utf-32be '\x80'+'A'*9999
541 (+363%) 542 (+362%) 764 (+228%) 2505 encode utf-32be '\u0100'*10000
544 (+366%) 542 (+368%) 766 (+231%) 2534 encode utf-32be '\u0100'+'A'*9999
544 (+363%) 542 (+365%) 766 (+229%) 2519 encode utf-32be '\u0100'+'\x80'*9999
542 (+363%) 541 (+364%) 764 (+228%) 2509 encode utf-32be '\u8000'*10000
544 (+366%) 542 (+368%) 766 (+231%) 2534 encode utf-32be '\u8000'+'A'*9999
544 (+363%) 542 (+364%) 766 (+229%) 2517 encode utf-32be '\u8000'+'\x80'*9999
544 (+372%) 542 (+374%) 766 (+235%) 2568 encode utf-32be '\u8000'+'\u0100'*9999
430 (+428%) 427 (+432%) 860 (+164%) 2270 encode utf-32be '\U00010000'*10000
543 (+317%) 541 (+318%) 861 (+163%) 2262 encode utf-32be '\U00010000'+'A'*9999
543 (+320%) 541 (+321%) 861 (+165%) 2279 encode utf-32be '\U00010000'+'\x80'*9999
543 (+322%) 541 (+323%) 859 (+167%) 2290 encode utf-32be '\U00010000'+'\u0100'*9999
543 (+322%) 541 (+324%) 860 (+167%) 2292 encode utf-32be '\U00010000'+'\u8000'*9999

serhiy-storchaka · 2012-06-14T20:30:11Z

On 32-bit Linux, Intel Atom N570 @ 1.66GHz:

Py2.7 Py3.2 Py3.3 patched

214 (+718%) 215 (+714%) 363 (+382%) 1750 encode utf-32le 'A'*10000
214 (+704%) 214 (+704%) 362 (+375%) 1720 encode utf-32le '\x80'*10000
214 (+712%) 215 (+708%) 363 (+379%) 1738 encode utf-32le '\x80'+'A'*9999
214 (+698%) 214 (+698%) 342 (+399%) 1707 encode utf-32le '\u0100'*10000
214 (+688%) 215 (+684%) 343 (+392%) 1686 encode utf-32le '\u0100'+'A'*9999
214 (+699%) 215 (+695%) 342 (+400%) 1710 encode utf-32le '\u0100'+'\x80'*9999
214 (+694%) 214 (+694%) 342 (+397%) 1699 encode utf-32le '\u8000'*10000
214 (+688%) 215 (+685%) 343 (+392%) 1687 encode utf-32le '\u8000'+'A'*9999
214 (+700%) 214 (+700%) 342 (+401%) 1713 encode utf-32le '\u8000'+'\x80'*9999
214 (+682%) 215 (+679%) 342 (+389%) 1674 encode utf-32le '\u8000'+'\u0100'*9999
121 (+2237%) 121 (+2237%) 333 (+749%) 2828 encode utf-32le '\U00010000'*10000
214 (+1108%) 214 (+1108%) 333 (+676%) 2585 encode utf-32le '\U00010000'+'A'*9999
214 (+1112%) 214 (+1112%) 333 (+679%) 2594 encode utf-32le '\U00010000'+'\x80'*9999
214 (+1208%) 214 (+1208%) 333 (+741%) 2799 encode utf-32le '\U00010000'+'\u0100'*9999
214 (+1214%) 215 (+1208%) 333 (+745%) 2813 encode utf-32le '\U00010000'+'\u8000'*9999

214 (+556%) 214 (+556%) 363 (+287%) 1404 encode utf-32be 'A'*10000
214 (+558%) 214 (+558%) 363 (+288%) 1408 encode utf-32be '\x80'*10000
214 (+550%) 214 (+550%) 363 (+283%) 1390 encode utf-32be '\x80'+'A'*9999
214 (+224%) 214 (+224%) 342 (+103%) 693 encode utf-32be '\u0100'*10000
214 (+229%) 214 (+229%) 343 (+105%) 703 encode utf-32be '\u0100'+'A'*9999
214 (+221%) 214 (+221%) 342 (+101%) 688 encode utf-32be '\u0100'+'\x80'*9999
214 (+224%) 214 (+224%) 342 (+103%) 694 encode utf-32be '\u8000'*10000
215 (+227%) 214 (+229%) 343 (+105%) 704 encode utf-32be '\u8000'+'A'*9999
214 (+221%) 214 (+221%) 342 (+101%) 686 encode utf-32be '\u8000'+'\x80'*9999
214 (+222%) 214 (+222%) 341 (+102%) 690 encode utf-32be '\u8000'+'\u0100'*9999
121 (+387%) 121 (+387%) 333 (+77%) 589 encode utf-32be '\U00010000'*10000
214 (+174%) 215 (+173%) 333 (+76%) 587 encode utf-32be '\U00010000'+'A'*9999
214 (+183%) 214 (+183%) 333 (+82%) 606 encode utf-32be '\U00010000'+'\x80'*9999
214 (+184%) 214 (+184%) 333 (+82%) 607 encode utf-32be '\U00010000'+'\u0100'*9999
214 (+183%) 214 (+183%) 333 (+82%) 605 encode utf-32be '\U00010000'+'\u8000'*9999

serhiy-storchaka · 2012-10-20T19:05:07Z

Patch updated to 3.4.

Is anyone interested in 7x speedup of UTF-32 encoder?

BreamoreBoy · 2013-12-11T18:28:29Z

From http://kmike.ru/python-data-structures/ under heading DATrie "Python wrapper uses utf_32_le codec internally; this codec is currently slow and it is the bottleneck for datrie. There is a ticket with a patch in the CPython bug tracker (http://bugs.python.org/issue15027) that should make this codec fast, so there is a hope datrie will become faster with future Pythons."

serhiy-storchaka · 2013-12-11T22:17:01Z

Here is updated patch, synchronized with trunk. UTF-32 encoder now checks surrogates and therefore speedup is less (only up to 5 times). But this compensates regression in 3.4.

On 32-bit Linux, Intel Atom N570 @ 1.66GHz:

Py3.3 Py3.4 patched

531 (+245%) 489 (+274%) 1831 encode utf-32le 'A'*10000
383 (+158%) 223 (+344%) 990 encode utf-32le '\u0100'*10000
325 (+262%) 229 (+414%) 1177 encode utf-32le '\U00010000'*10000

544 (+166%) 494 (+193%) 1448 encode utf-32be 'A'*10000
384 (+67%) 223 (+188%) 642 encode utf-32be '\u0100'*10000
323 (+108%) 229 (+193%) 671 encode utf-32be '\U00010000'*10000

gpshead · 2013-12-11T23:05:08Z

one comment to address on the review, otherwise after addressing that I believe this is ready to go in for 3.4.

python-dev · 2014-01-04T17:26:00Z

New changeset b72c5573c5e7 by Serhiy Storchaka in branch 'default':
Issue bpo-15027: Rewrite the UTF-32 encoder. It is now 1.6x to 3.5x faster.
http://hg.python.org/cpython/rev/b72c5573c5e7

serhiy-storchaka · 2014-01-04T17:32:35Z

Thank you Gregory for your review.

larryhastings · 2014-01-04T18:41:17Z

Isn't this a new feature?

serhiy-storchaka · 2014-01-04T19:59:31Z

Sorry if I have missed. Should I revert changeset b72c5573c5e7?

This patch doesn't introduce new functions and doesn't change behavior. Without this patch the UTF-32 encoder is up to 2.5x slower in 3.4 than in 3.3 (due to bpo-12892).

larryhastings · 2014-01-04T20:10:40Z

Would you describe it as a "bug fix" or a "security fix"? If it's neither of those things, then you need special permission to add it during beta. And given that this patch has the possibility of causing bugs, I'd prefer to not accept it for 3.4.

Please revert it for now. If you think it should go in to 3.4, you may ask on python-dev that it be considered and take a poll. (Note that the poll is not binding on me; this is still solely my decision. However if there was an uproar of support for your patch, that would certainly cause me to reconsider.)

python-dev · 2014-01-04T20:51:12Z

New changeset 1e345924f7ea by Serhiy Storchaka in branch 'default':
Reverted changeset b72c5573c5e7 (issue bpo-15027).
http://hg.python.org/cpython/rev/1e345924f7ea

larryhastings · 2014-02-03T16:02:10Z

BreamoreBoy: why did you remove Arfrever from this issue?

neologix · 2014-02-03T16:28:03Z

BreamoreBoy: why did you remove Arfrever from this issue?

Noisy lists members are sorted by alphabetical order: since Arfrever comes just before BreamoreBoy, I assume his fingers tripped ;-)

BreamoreBoy · 2015-05-10T23:21:31Z

As this appears to be a performance improvement only can it go into 3.5 or do we wait for 3.x?

serhiy-storchaka · 2015-05-12T10:22:08Z

Can I commit the patch now Larry?

larryhastings · 2015-05-12T15:44:00Z

We're still in alpha, so it's fine for 3.5 right now. The cutoff for new features for 3.5 will be May 23.

python-dev · 2015-05-12T20:13:09Z

New changeset 80cf7723c4cf by Serhiy Storchaka in branch 'default':
Issue bpo-15027: The UTF-32 encoder is now 3x to 7x faster.
https://hg.python.org/cpython/rev/80cf7723c4cf

serhiy-storchaka · 2015-05-12T20:26:47Z

And that's not all...

Arfrever · 2015-05-18T19:14:25Z

In Objects/stringlib/codecs.h in 2 comments U+DC800 should be changed into U+D800 (from definition of Py_UNICODE_IS_SURROGATE) or U+DC80 (from result of b"\x80".decode(errors="surrogateescape")).

serhiy-storchaka · 2015-05-18T19:22:44Z

Thank you Arfrever. That was copy-pasted old typo. Fixed in 3d5bf6174c4b and bc6ed8360312.

serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode performance Performance or resource usage labels Jun 7, 2012

serhiy-storchaka self-assigned this Jan 7, 2013

serhiy-storchaka closed this as completed Jan 4, 2014

serhiy-storchaka reopened this Jan 4, 2014

serhiy-storchaka closed this as completed May 12, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster UTF-32 encoding #59232

Faster UTF-32 encoding #59232

serhiy-storchaka commented Jun 7, 2012

serhiy-storchaka commented Jun 7, 2012

serhiy-storchaka commented Jun 14, 2012

serhiy-storchaka commented Oct 20, 2012

BreamoreBoy mannequin commented Dec 11, 2013

serhiy-storchaka commented Dec 11, 2013

gpshead commented Dec 11, 2013

python-dev mannequin commented Jan 4, 2014

serhiy-storchaka commented Jan 4, 2014

larryhastings commented Jan 4, 2014

serhiy-storchaka commented Jan 4, 2014

larryhastings commented Jan 4, 2014

python-dev mannequin commented Jan 4, 2014

larryhastings commented Feb 3, 2014

neologix mannequin commented Feb 3, 2014

BreamoreBoy mannequin commented May 10, 2015

serhiy-storchaka commented May 12, 2015

larryhastings commented May 12, 2015

python-dev mannequin commented May 12, 2015

serhiy-storchaka commented May 12, 2015

Arfrever mannequin commented May 18, 2015

serhiy-storchaka commented May 18, 2015

Faster UTF-32 encoding #59232

Faster UTF-32 encoding #59232

Comments

serhiy-storchaka commented Jun 7, 2012

serhiy-storchaka commented Jun 7, 2012

serhiy-storchaka commented Jun 14, 2012

serhiy-storchaka commented Oct 20, 2012

BreamoreBoy mannequin commented Dec 11, 2013

serhiy-storchaka commented Dec 11, 2013

gpshead commented Dec 11, 2013

python-dev mannequin commented Jan 4, 2014

serhiy-storchaka commented Jan 4, 2014

larryhastings commented Jan 4, 2014

serhiy-storchaka commented Jan 4, 2014

larryhastings commented Jan 4, 2014

python-dev mannequin commented Jan 4, 2014

larryhastings commented Feb 3, 2014

neologix mannequin commented Feb 3, 2014

BreamoreBoy mannequin commented May 10, 2015

serhiy-storchaka commented May 12, 2015

larryhastings commented May 12, 2015

python-dev mannequin commented May 12, 2015

serhiy-storchaka commented May 12, 2015

Arfrever mannequin commented May 18, 2015

serhiy-storchaka commented May 18, 2015