Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster UTF-32 encoding #59232

Closed
serhiy-storchaka opened this issue Jun 7, 2012 · 21 comments
Closed

Faster UTF-32 encoding #59232

serhiy-storchaka opened this issue Jun 7, 2012 · 21 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode

Comments

@serhiy-storchaka
Copy link
Member

BPO 15027
Nosy @gpshead, @pitrou, @vstinner, @larryhastings, @ezio-melotti, @asvetlov, @serhiy-storchaka
Files
  • encode_utf32_2.patch
  • encode_utf32_3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2015-05-12.20:26:46.786>
    created_at = <Date 2012-06-07.13:57:30.888>
    labels = ['interpreter-core', 'expert-unicode', 'performance']
    title = 'Faster UTF-32 encoding'
    updated_at = <Date 2015-05-18.19:22:44.031>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2015-05-18.19:22:44.031>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2015-05-12.20:26:46.786>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2012-06-07.13:57:30.888>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['27637', '33096']
    hgrepos = []
    issue_num = 15027
    keywords = ['patch', 'needs review']
    message_count = 21.0
    messages = ['162474', '162823', '173404', '205912', '205934', '205940', '207292', '207294', '207302', '207305', '207306', '207311', '210147', '210148', '242871', '242954', '242981', '243005', '243008', '243523', '243524']
    nosy_count = 12.0
    nosy_names = ['gregory.p.smith', 'pitrou', 'vstinner', 'larry', 'ezio.melotti', 'Arfrever', 'asvetlov', 'neologix', 'BreamoreBoy', 'python-dev', 'serhiy.storchaka', 'kmike']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue15027'
    versions = ['Python 3.5']

    @serhiy-storchaka
    Copy link
    Member Author

    In pair to bpo-14625 here is a patch than speed up UTF-32 encoding in several times. In addition, it fixes an unsafe check of an integer overflow.

    Here are the results of benchmarking. See benchmark tools in https://bitbucket.org/storchaka/cpython-stuff repository.

    On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:

    Py2.7 Py3.2 Py3.3 patched

    541 (+1032%) 541 (+1032%) 844 (+626%) 6125 encode utf-32le 'A'*10000
    543 (+1056%) 541 (+1060%) 844 (+643%) 6275 encode utf-32le '\x80'*10000
    544 (+1010%) 542 (+1014%) 843 (+616%) 6037 encode utf-32le '\x80'+'A'*9999
    541 (+799%) 542 (+797%) 764 (+537%) 4864 encode utf-32le '\u0100'*10000
    544 (+781%) 542 (+784%) 767 (+525%) 4793 encode utf-32le '\u0100'+'A'*9999
    544 (+789%) 542 (+792%) 766 (+531%) 4834 encode utf-32le '\u0100'+'\x80'*9999
    542 (+799%) 541 (+801%) 764 (+538%) 4874 encode utf-32le '\u8000'*10000
    544 (+779%) 542 (+782%) 767 (+523%) 4780 encode utf-32le '\u8000'+'A'*9999
    544 (+793%) 542 (+796%) 766 (+534%) 4859 encode utf-32le '\u8000'+'\x80'*9999
    544 (+819%) 542 (+823%) 766 (+553%) 5001 encode utf-32le '\u8000'+'\u0100'*9999
    430 (+867%) 427 (+874%) 860 (+383%) 4157 encode utf-32le '\U00010000'*10000
    543 (+655%) 543 (+655%) 861 (+376%) 4101 encode utf-32le '\U00010000'+'A'*9999
    543 (+658%) 543 (+658%) 861 (+378%) 4116 encode utf-32le '\U00010000'+'\x80'*9999
    543 (+670%) 543 (+670%) 859 (+387%) 4180 encode utf-32le '\U00010000'+'\u0100'*9999
    543 (+666%) 543 (+666%) 860 (+383%) 4158 encode utf-32le '\U00010000'+'\u8000'*9999

    541 (+880%) 543 (+876%) 844 (+528%) 5300 encode utf-32be 'A'*10000
    541 (+872%) 542 (+870%) 844 (+523%) 5256 encode utf-32be '\x80'*10000
    544 (+843%) 542 (+846%) 843 (+509%) 5130 encode utf-32be '\x80'+'A'*9999
    541 (+363%) 542 (+362%) 764 (+228%) 2505 encode utf-32be '\u0100'*10000
    544 (+366%) 542 (+368%) 766 (+231%) 2534 encode utf-32be '\u0100'+'A'*9999
    544 (+363%) 542 (+365%) 766 (+229%) 2519 encode utf-32be '\u0100'+'\x80'*9999
    542 (+363%) 541 (+364%) 764 (+228%) 2509 encode utf-32be '\u8000'*10000
    544 (+366%) 542 (+368%) 766 (+231%) 2534 encode utf-32be '\u8000'+'A'*9999
    544 (+363%) 542 (+364%) 766 (+229%) 2517 encode utf-32be '\u8000'+'\x80'*9999
    544 (+372%) 542 (+374%) 766 (+235%) 2568 encode utf-32be '\u8000'+'\u0100'*9999
    430 (+428%) 427 (+432%) 860 (+164%) 2270 encode utf-32be '\U00010000'*10000
    543 (+317%) 541 (+318%) 861 (+163%) 2262 encode utf-32be '\U00010000'+'A'*9999
    543 (+320%) 541 (+321%) 861 (+165%) 2279 encode utf-32be '\U00010000'+'\x80'*9999
    543 (+322%) 541 (+323%) 859 (+167%) 2290 encode utf-32be '\U00010000'+'\u0100'*9999
    543 (+322%) 541 (+324%) 860 (+167%) 2292 encode utf-32be '\U00010000'+'\u8000'*9999

    @serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode performance Performance or resource usage labels Jun 7, 2012
    @serhiy-storchaka
    Copy link
    Member Author

    On 32-bit Linux, Intel Atom N570 @ 1.66GHz:

    Py2.7 Py3.2 Py3.3 patched

    214 (+718%) 215 (+714%) 363 (+382%) 1750 encode utf-32le 'A'*10000
    214 (+704%) 214 (+704%) 362 (+375%) 1720 encode utf-32le '\x80'*10000
    214 (+712%) 215 (+708%) 363 (+379%) 1738 encode utf-32le '\x80'+'A'*9999
    214 (+698%) 214 (+698%) 342 (+399%) 1707 encode utf-32le '\u0100'*10000
    214 (+688%) 215 (+684%) 343 (+392%) 1686 encode utf-32le '\u0100'+'A'*9999
    214 (+699%) 215 (+695%) 342 (+400%) 1710 encode utf-32le '\u0100'+'\x80'*9999
    214 (+694%) 214 (+694%) 342 (+397%) 1699 encode utf-32le '\u8000'*10000
    214 (+688%) 215 (+685%) 343 (+392%) 1687 encode utf-32le '\u8000'+'A'*9999
    214 (+700%) 214 (+700%) 342 (+401%) 1713 encode utf-32le '\u8000'+'\x80'*9999
    214 (+682%) 215 (+679%) 342 (+389%) 1674 encode utf-32le '\u8000'+'\u0100'*9999
    121 (+2237%) 121 (+2237%) 333 (+749%) 2828 encode utf-32le '\U00010000'*10000
    214 (+1108%) 214 (+1108%) 333 (+676%) 2585 encode utf-32le '\U00010000'+'A'*9999
    214 (+1112%) 214 (+1112%) 333 (+679%) 2594 encode utf-32le '\U00010000'+'\x80'*9999
    214 (+1208%) 214 (+1208%) 333 (+741%) 2799 encode utf-32le '\U00010000'+'\u0100'*9999
    214 (+1214%) 215 (+1208%) 333 (+745%) 2813 encode utf-32le '\U00010000'+'\u8000'*9999

    214 (+556%) 214 (+556%) 363 (+287%) 1404 encode utf-32be 'A'*10000
    214 (+558%) 214 (+558%) 363 (+288%) 1408 encode utf-32be '\x80'*10000
    214 (+550%) 214 (+550%) 363 (+283%) 1390 encode utf-32be '\x80'+'A'*9999
    214 (+224%) 214 (+224%) 342 (+103%) 693 encode utf-32be '\u0100'*10000
    214 (+229%) 214 (+229%) 343 (+105%) 703 encode utf-32be '\u0100'+'A'*9999
    214 (+221%) 214 (+221%) 342 (+101%) 688 encode utf-32be '\u0100'+'\x80'*9999
    214 (+224%) 214 (+224%) 342 (+103%) 694 encode utf-32be '\u8000'*10000
    215 (+227%) 214 (+229%) 343 (+105%) 704 encode utf-32be '\u8000'+'A'*9999
    214 (+221%) 214 (+221%) 342 (+101%) 686 encode utf-32be '\u8000'+'\x80'*9999
    214 (+222%) 214 (+222%) 341 (+102%) 690 encode utf-32be '\u8000'+'\u0100'*9999
    121 (+387%) 121 (+387%) 333 (+77%) 589 encode utf-32be '\U00010000'*10000
    214 (+174%) 215 (+173%) 333 (+76%) 587 encode utf-32be '\U00010000'+'A'*9999
    214 (+183%) 214 (+183%) 333 (+82%) 606 encode utf-32be '\U00010000'+'\x80'*9999
    214 (+184%) 214 (+184%) 333 (+82%) 607 encode utf-32be '\U00010000'+'\u0100'*9999
    214 (+183%) 214 (+183%) 333 (+82%) 605 encode utf-32be '\U00010000'+'\u8000'*9999

    @serhiy-storchaka
    Copy link
    Member Author

    Patch updated to 3.4.

    Is anyone interested in 7x speedup of UTF-32 encoder?

    @serhiy-storchaka serhiy-storchaka self-assigned this Jan 7, 2013
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Dec 11, 2013

    From http://kmike.ru/python-data-structures/ under heading DATrie "Python wrapper uses utf_32_le codec internally; this codec is currently slow and it is the bottleneck for datrie. There is a ticket with a patch in the CPython bug tracker (http://bugs.python.org/issue15027) that should make this codec fast, so there is a hope datrie will become faster with future Pythons."

    @serhiy-storchaka
    Copy link
    Member Author

    Here is updated patch, synchronized with trunk. UTF-32 encoder now checks surrogates and therefore speedup is less (only up to 5 times). But this compensates regression in 3.4.

    On 32-bit Linux, Intel Atom N570 @ 1.66GHz:

    Py3.3 Py3.4 patched

    531 (+245%) 489 (+274%) 1831 encode utf-32le 'A'*10000
    383 (+158%) 223 (+344%) 990 encode utf-32le '\u0100'*10000
    325 (+262%) 229 (+414%) 1177 encode utf-32le '\U00010000'*10000

    544 (+166%) 494 (+193%) 1448 encode utf-32be 'A'*10000
    384 (+67%) 223 (+188%) 642 encode utf-32be '\u0100'*10000
    323 (+108%) 229 (+193%) 671 encode utf-32be '\U00010000'*10000

    @gpshead
    Copy link
    Member

    gpshead commented Dec 11, 2013

    one comment to address on the review, otherwise after addressing that I believe this is ready to go in for 3.4.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 4, 2014

    New changeset b72c5573c5e7 by Serhiy Storchaka in branch 'default':
    Issue bpo-15027: Rewrite the UTF-32 encoder. It is now 1.6x to 3.5x faster.
    http://hg.python.org/cpython/rev/b72c5573c5e7

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you Gregory for your review.

    @larryhastings
    Copy link
    Contributor

    Isn't this a new feature?

    @serhiy-storchaka
    Copy link
    Member Author

    Sorry if I have missed. Should I revert changeset b72c5573c5e7?

    This patch doesn't introduce new functions and doesn't change behavior. Without this patch the UTF-32 encoder is up to 2.5x slower in 3.4 than in 3.3 (due to bpo-12892).

    @larryhastings
    Copy link
    Contributor

    Would you describe it as a "bug fix" or a "security fix"? If it's neither of those things, then you need special permission to add it during beta. And given that this patch has the possibility of causing bugs, I'd prefer to not accept it for 3.4.

    Please revert it for now. If you think it should go in to 3.4, you may ask on python-dev that it be considered and take a poll. (Note that the poll is not binding on me; this is still solely my decision. However if there was an uproar of support for your patch, that would certainly cause me to reconsider.)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 4, 2014

    New changeset 1e345924f7ea by Serhiy Storchaka in branch 'default':
    Reverted changeset b72c5573c5e7 (issue bpo-15027).
    http://hg.python.org/cpython/rev/1e345924f7ea

    @larryhastings
    Copy link
    Contributor

    BreamoreBoy: why did you remove Arfrever from this issue?

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Feb 3, 2014

    BreamoreBoy: why did you remove Arfrever from this issue?

    Noisy lists members are sorted by alphabetical order: since Arfrever comes just before BreamoreBoy, I assume his fingers tripped ;-)

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented May 10, 2015

    As this appears to be a performance improvement only can it go into 3.5 or do we wait for 3.x?

    @serhiy-storchaka
    Copy link
    Member Author

    Can I commit the patch now Larry?

    @larryhastings
    Copy link
    Contributor

    We're still in alpha, so it's fine for 3.5 right now. The cutoff for new features for 3.5 will be May 23.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 12, 2015

    New changeset 80cf7723c4cf by Serhiy Storchaka in branch 'default':
    Issue bpo-15027: The UTF-32 encoder is now 3x to 7x faster.
    https://hg.python.org/cpython/rev/80cf7723c4cf

    @serhiy-storchaka
    Copy link
    Member Author

    And that's not all...

    @Arfrever
    Copy link
    Mannequin

    Arfrever mannequin commented May 18, 2015

    In Objects/stringlib/codecs.h in 2 comments U+DC800 should be changed into U+D800 (from definition of Py_UNICODE_IS_SURROGATE) or U+DC80 (from result of b"\x80".decode(errors="surrogateescape")).

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you Arfrever. That was copy-pasted old typo. Fixed in 3d5bf6174c4b and bc6ed8360312.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants