Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazingly faster UTF-8 decoding #58943

Closed
serhiy-storchaka opened this issue May 6, 2012 · 15 comments
Closed

Amazingly faster UTF-8 decoding #58943

serhiy-storchaka opened this issue May 6, 2012 · 15 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode

Comments

@serhiy-storchaka
Copy link
Member

BPO 14738
Nosy @loewis, @jcea, @ronaldoussoren, @mdickinson, @pitrou, @vstinner, @ned-deily, @ezio-melotti, @serhiy-storchaka
Files
  • decode_utf8_4.patch
  • decode_utf8_5.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-05-10.14:38:47.809>
    created_at = <Date 2012-05-06.18:00:54.170>
    labels = ['interpreter-core', 'expert-unicode', 'performance']
    title = 'Amazingly faster UTF-8 decoding'
    updated_at = <Date 2012-05-12.07:09:09.219>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2012-05-12.07:09:09.219>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-05-10.14:38:47.809>
    closer = 'pitrou'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2012-05-06.18:00:54.170>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['25484', '25485']
    hgrepos = []
    issue_num = 14738
    keywords = ['patch']
    message_count = 15.0
    messages = ['160103', '160107', '160110', '160112', '160305', '160306', '160307', '160308', '160309', '160311', '160312', '160346', '160347', '160447', '160462']
    nosy_count = 12.0
    nosy_names = ['loewis', 'jcea', 'ronaldoussoren', 'mark.dickinson', 'janssen', 'pitrou', 'vstinner', 'ned.deily', 'ezio.melotti', 'Arfrever', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue14738'
    versions = ['Python 3.3']

    @serhiy-storchaka
    Copy link
    Member Author

    I propose a complex patch, which significantly speeds up UTF-8 decoding. Now decoder faster even decoder in 3.2 (except in a few unreal patological cases).

    Also the decoder code reduced and simplified (formerly decoding code was repeated in at least three places).

    As a side effect ASCII decoding now faster on some platforms (bpo-14419).

    Related issues:
    [bpo-4868] Faster utf-8 decoding
    [bpo-13417] faster utf-8 decoding
    [bpo-14419] Faster ascii decoding
    [bpo-14624] Faster utf-16 decoder
    [bpo-14625] Faster utf-32 decoder
    [bpo-14654] Faster utf-8 decoding

    Here are the results of benchmarking (numbers is speed in MB/s).

    On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:

                                          3.2           3.3(vanilla)  patched
    

    utf-8 'A'*10000 1199 (+69%) 1721 (+18%) 2032
    utf-8 'A'*9999+'\x80' 1189 (+25%) 996 (+49%) 1488
    utf-8 'A'*9999+'\u0100' 1192 (-25%) 887 (+1%) 894
    utf-8 'A'*9999+'\u8000' 1178 (-24%) 888 (+0%) 890
    utf-8 'A'*9999+'\U00010000' 1177 (-29%) 872 (-4%) 837
    utf-8 '\x80'*10000 220 (+74%) 172 (+122%) 382
    utf-8 '\x80'+'A'*9999 1192 (+5%) 376 (+232%) 1250
    utf-8 '\x80'*9999+'\u0100' 220 (+54%) 160 (+112%) 339
    utf-8 '\x80'*9999+'\u8000' 220 (+54%) 160 (+112%) 339
    utf-8 '\x80'*9999+'\U00010000' 221 (+49%) 176 (+88%) 330
    utf-8 '\u0100'*10000 220 (+74%) 163 (+134%) 382
    utf-8 '\u0100'+'A'*9999 1177 (+4%) 382 (+219%) 1220
    utf-8 '\u0100'+'\x80'*9999 220 (+74%) 163 (+134%) 382
    utf-8 '\u0100'*9999+'\u8000' 220 (+74%) 163 (+134%) 382
    utf-8 '\u0100'*9999+'\U00010000' 220 (+50%) 180 (+83%) 330
    utf-8 '\u8000'*10000 261 (+66%) 191 (+126%) 432
    utf-8 '\u8000'+'A'*9999 1197 (+1%) 384 (+216%) 1212
    utf-8 '\u8000'+'\x80'*9999 216 (+77%) 163 (+134%) 382
    utf-8 '\u8000'+'\u0100'*9999 215 (+77%) 164 (+132%) 381
    utf-8 '\u8000'*9999+'\U00010000' 261 (+46%) 201 (+89%) 380
    utf-8 '\U00010000'*10000 248 (+44%) 198 (+80%) 357
    utf-8 '\U00010000'+'A'*9999 1192 (-5%) 383 (+196%) 1135
    utf-8 '\U00010000'+'\x80'*9999 220 (+73%) 180 (+111%) 380
    utf-8 '\U00010000'+'\u0100'*9999 220 (+73%) 180 (+111%) 380
    utf-8 '\U00010000'+'\u8000'*9999 261 (+54%) 201 (+100%) 403

    ascii 'A'*10000 233 (+971%) 1876 (+33%) 2496

    On 32-bit Linux, Intel Atom N570 @ 1.66GHz:

                                          3.2           3.3(vanilla)  patched
    

    utf-8 'A'*10000 345 (+81%) 596 (+5%) 623
    utf-8 'A'*9999+'\x80' 335 (+41%) 303 (+56%) 474
    utf-8 'A'*9999+'\u0100' 336 (-23%) 123 (+110%) 258
    utf-8 'A'*9999+'\u8000' 337 (-24%) 123 (+108%) 256
    utf-8 'A'*9999+'\U00010000' 336 (-24%) 261 (-3%) 254
    utf-8 '\x80'*10000 88 (+66%) 65 (+125%) 146
    utf-8 '\x80'+'A'*9999 334 (+8%) 124 (+190%) 360
    utf-8 '\x80'*9999+'\u0100' 88 (+43%) 65 (+94%) 126
    utf-8 '\x80'*9999+'\u8000' 88 (+43%) 65 (+94%) 126
    utf-8 '\x80'*9999+'\U00010000' 89 (+40%) 65 (+92%) 125
    utf-8 '\u0100'*10000 88 (+85%) 65 (+151%) 163
    utf-8 '\u0100'+'A'*9999 336 (+2%) 77 (+345%) 343
    utf-8 '\u0100'+'\x80'*9999 88 (+86%) 65 (+152%) 164
    utf-8 '\u0100'*9999+'\u8000' 88 (+86%) 65 (+152%) 164
    utf-8 '\u0100'*9999+'\U00010000' 88 (+57%) 65 (+112%) 138
    utf-8 '\u8000'*10000 98 (+79%) 69 (+154%) 175
    utf-8 '\u8000'+'A'*9999 339 (+3%) 77 (+353%) 349
    utf-8 '\u8000'+'\x80'*9999 89 (+84%) 66 (+148%) 164
    utf-8 '\u8000'+'\u0100'*9999 88 (+86%) 65 (+152%) 164
    utf-8 '\u8000'*9999+'\U00010000' 98 (+58%) 69 (+125%) 155
    utf-8 '\U00010000'*10000 104 (+46%) 79 (+92%) 152
    utf-8 '\U00010000'+'A'*9999 339 (-5%) 124 (+160%) 323
    utf-8 '\U00010000'+'\x80'*9999 88 (+84%) 68 (+138%) 162
    utf-8 '\U00010000'+'\u0100'*9999 88 (+83%) 68 (+137%) 161
    utf-8 '\U00010000'+'\u8000'*9999 98 (+63%) 72 (+122%) 160

    ascii 'A'*10000 132 (+499%) 758 (+4%) 791

    @serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels May 6, 2012
    @pitrou
    Copy link
    Member

    pitrou commented May 6, 2012

    64-bit Linux, Intel Core i5 2500K:

                                          3.2           3.3             patched
    

    utf-8 'A'*10000 2550 (+198%) 6828 (+11%) 7607
    utf-8 'A'*9999+'\x80' 2501 (+118%) 2415 (+126%) 5456
    utf-8 'A'*9999+'\u0100' 2501 (-20%) 2297 (-13%) 1996
    utf-8 'A'*9999+'\u8000' 2494 (-14%) 2291 (-7%) 2133
    utf-8 'A'*9999+'\U00010000' 2494 (-11%) 2293 (-3%) 2219
    utf-8 '\x80'*10000 422 (+135%) 517 (+92%) 991
    utf-8 '\x80'+'A'*9999 2513 (+12%) 860 (+228%) 2820
    utf-8 '\x80'*9999+'\u0100' 426 (+102%) 525 (+64%) 862
    utf-8 '\x80'*9999+'\u8000' 426 (+104%) 538 (+62%) 871
    utf-8 '\x80'*9999+'\U00010000' 428 (+105%) 523 (+68%) 878
    utf-8 '\u0100'*10000 425 (+140%) 517 (+97%) 1019
    utf-8 '\u0100'+'A'*9999 2488 (+2%) 820 (+211%) 2549
    utf-8 '\u0100'+'\x80'*9999 426 (+139%) 517 (+97%) 1019
    utf-8 '\u0100'*9999+'\u8000' 426 (+139%) 529 (+93%) 1019
    utf-8 '\u0100'*9999+'\U00010000' 426 (+106%) 509 (+72%) 876
    utf-8 '\u8000'*10000 573 (+28%) 490 (+50%) 733
    utf-8 '\u8000'+'A'*9999 2500 (+1%) 822 (+208%) 2528
    utf-8 '\u8000'+'\x80'*9999 426 (+139%) 530 (+92%) 1018
    utf-8 '\u8000'+'\u0100'*9999 428 (+138%) 509 (+100%) 1018
    utf-8 '\u8000'*9999+'\U00010000' 573 (+17%) 447 (+51%) 673
    utf-8 '\U00010000'*10000 562 (+24%) 552 (+26%) 696
    utf-8 '\U00010000'+'A'*9999 2512 (+3%) 939 (+175%) 2584
    utf-8 '\U00010000'+'\x80'*9999 423 (+140%) 553 (+84%) 1017
    utf-8 '\U00010000'+'\u0100'*9999 426 (+139%) 549 (+85%) 1017
    utf-8 '\U00010000'+'\u8000'*9999 572 (+18%) 479 (+41%) 674

    @serhiy-storchaka
    Copy link
    Member Author

    Thank your, Antoine. Finally Intel Core is defeated!

    If someone wants to repeat tests, see benchmark tools in bpo-14624.

    @serhiy-storchaka
    Copy link
    Member Author

    The patch updated in accordance with Antoine cosmetic comments.

    @pitrou
    Copy link
    Member

    pitrou commented May 9, 2012

    There's a Mac-specific portion in the patch, it would be nice if someone could check that it works.

    @serhiy-storchaka
    Copy link
    Member Author

    It would be good if someone checked on Macs work with command line arguments, including non-valid utf8. The difficulty is that you need to check on both Macs with 16-bit and with 32-bit wchar_t.

    @serhiy-storchaka
    Copy link
    Member Author

    bpo-4388 is related to this Mac-specific portion of the patch.

    @pitrou
    Copy link
    Member

    pitrou commented May 9, 2012

    It would be good if someone checked on Macs work with command line
    arguments, including non-valid utf8. The difficulty is that you need
    to check on both Macs with 16-bit and with 32-bit wchar_t.

    Actually, it should be enough to run the test suite, since we should
    have tests for this.
    As for different wchar_t widths, that's the kind of thing we can leave
    to the buildbots (assuming our OS X buildbots come back alive some
    day :-)).

    @serhiy-storchaka
    Copy link
    Member Author

    I hacked the code (commented out "#if __APPLE__" in
    Objects/unicodeobject.c and Modules/python.c) to start this branch on
    Linux and ran the test (test_cmd_line) with C locale. It passed. Then I
    broke decoder and ran the test again to get the error. I can now confirm
    that the code works correctly on a platform with a 32-bit wchar_t.

    @mdickinson
    Copy link
    Member

    Actually, it should be enough to run the test suite, since we should
    have tests for this.

    I just ran the test suite ("python -m test") on OS X 10.6.8 with 'decode_utf8_5.patch' applied. (64-bit --with-pydebug build of Python.) No test failures.

    test header:

    == CPython 3.3.0a3+ (default:840cb46d0395+, May 9 2012, 20:55:18) [GCC 4.2.1 (Apple Inc. build 5664)]
    == Darwin-10.8.0-i386-64bit little-endian
    == /Users/mdickinson/Python/cpython/build/test_python_39794

    Fragment of configure output relevant to wchar looked like this:

    checking wchar.h usability... yes
    checking wchar.h presence... yes
    checking for wchar.h... yes
    checking size of wchar_t... 4
    checking for UCS-4 tcl... no
    checking whether wchar_t is signed... yes
    no usable wchar_t found

    @vstinner
    Copy link
    Member

    vstinner commented May 9, 2012

    The difficulty is that you need to check on both Macs
    with 16-bit and with 32-bit wchar_t.

    I don't think that the size of wchar_t is configurable: it should always be 32 bits on Mac OS X.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 10, 2012

    New changeset e08c3791f035 by Antoine Pitrou in branch 'default':
    Issue bpo-14738: Speed-up UTF-8 decoding on non-ASCII data. Patch by Serhiy Storchaka.
    http://hg.python.org/cpython/rev/e08c3791f035

    @pitrou
    Copy link
    Member

    pitrou commented May 10, 2012

    The patch is now committed. Well done and thanks for your contribution.

    @pitrou pitrou closed this as completed May 10, 2012
    @serhiy-storchaka
    Copy link
    Member Author

    Thanks Martin for review, which has allowed me to make a quality patch, and for promotion of further research. Thanks Antoine for review, benchmarks, commit, and for the original optimization, which served as the basis for my patch.

    @vstinner
    Copy link
    Member

    If the commit makes Python 3.3 faster than Python 3.2, it is an
    optimisation that should be documented in the What's New in Python 3.3
    document.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants