Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The optimization of string search can cause pessimization #69009

Closed
serhiy-storchaka opened this issue Aug 7, 2015 · 17 comments
Closed

The optimization of string search can cause pessimization #69009

serhiy-storchaka opened this issue Aug 7, 2015 · 17 comments
Assignees
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@serhiy-storchaka
Copy link
Member

BPO 24821
Nosy @pitrou, @vstinner, @ned-deily, @methane, @serhiy-storchaka, @animalize, @zhangyangyu, @mlouielu
PRs
  • bpo-24821: Fixed the slowing down to 25 times in the searching of some #505
  • Files
  • find_char.patch
  • find_char_false_positives.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2017-03-30.06:21:11.700>
    created_at = <Date 2015-08-07.06:11:25.937>
    labels = ['interpreter-core', '3.7', 'performance']
    title = 'The optimization of string search can cause pessimization'
    updated_at = <Date 2017-03-30.06:21:11.699>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2017-03-30.06:21:11.699>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2017-03-30.06:21:11.700>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2015-08-07.06:11:25.937>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['41035', '41040']
    hgrepos = []
    issue_num = 24821
    keywords = ['patch']
    message_count = 17.0
    messages = ['248179', '250410', '250419', '254627', '254653', '254659', '254661', '269020', '274614', '276001', '276005', '290772', '290774', '290775', '290779', '290781', '290819']
    nosy_count = 10.0
    nosy_names = ['pitrou', 'vstinner', 'ned.deily', 'methane', 'python-dev', 'serhiy.storchaka', 'malin', 'azsorkin', 'xiang.zhang', 'louielu']
    pr_nums = ['505']
    priority = 'low'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue24821'
    versions = ['Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    Search in strings is highly optimized for common case. However for some input data the search in non-ascii string becomes unexpectedly slow. Compare:

    $ ./python -m timeit -s 's = "АБВГД"*10**4' -- '"є" in s'
    100000 loops, best of 3: 11.7 usec per loop
    $ ./python -m timeit -s 's = "АБВГД"*10**4' -- '"Є" in s'
    1000 loops, best of 3: 769 usec per loop

    It's because the lowest byte of the code of Ukrainian capital letter Є (U+0404) matches the highest byte of codes of most Cyrillic letters (U+04xx). There are similar issues with some other scripts.

    I think we should use more robust optimization.

    @serhiy-storchaka serhiy-storchaka self-assigned this Aug 7, 2015
    @serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Aug 7, 2015
    @vstinner
    Copy link
    Member

    I think we should use more robust optimization.

    What do you propose? I don't understand the purpose of the issue. Do you want to remove the optimization?

    @serhiy-storchaka
    Copy link
    Member Author

    I was going to provide another optimization (I had an idea), but it is not so easy as looked to me at first glance. This issue exists rather as a reminder to me. I should make a second attempt.

    @serhiy-storchaka
    Copy link
    Member Author

    Proposed patch makes the degenerate case less hard while preserves the optimization for common case.

    $ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("є")'
    1000 loops, best of 3: 330 usec per loop
    $ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("є")'
    1000 loops, best of 3: 325 usec per loop
    $ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
    100 loops, best of 3: 7.81 msec per loop
    $ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("Є")'
    100 loops, best of 3: 8.5 msec per loop
    
    $ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("є")'
    1000 loops, best of 3: 317 usec per loop
    $ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("є")'
    1000 loops, best of 3: 327 usec per loop
    $ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
    1000 loops, best of 3: 1.1 msec per loop
    $ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("Є")'
    1000 loops, best of 3: 964 usec per loop

    The slowdown is decreased from 25 times to 3 times.

    The idea is that after memchr found false positive, make a tens iterations of simple loop before calling memchr again. This splits the cost of the memchr call with a tens of characters.

    The patch also makes a little refactoring. STRINGLIB(fastsearch_memchr_1char) now is renamed and split on two functions STRINGLIB(find_char) and STRINGLIB(rfind_char) with simpler interface. All preconditional checks are moved into these functions. These functions now are directly used in other files.

    @vstinner
    Copy link
    Member

    The patch also makes a little refactoring. STRINGLIB(fastsearch_memchr_1char) now is renamed and split on two functions STRINGLIB(find_char) and STRINGLIB(rfind_char) with simpler interface.

    Could you please push this part of the change? It looks good to me.

    --

    The C library provides a wmemchr() function which can be used to search for a wchar_t character inside a wchar_t* string.

    • for 16-bit wchar_t (ex: Windows, AIX), it can be used for Python UCS2 strings
    • for 32-bit wchar_t (ex: Linux, Mac OS X), it can be used for Python UCS4 strings

    I don't know if the type is signed or not. If the type is signed, we have to ensure that wmemchr() works as expected.

    --

    I didn't review carefully your new heuristic to search for a character.

    Did you try to not use memchr() but a simple C loop for UCS-2 and UCS-4?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 14, 2015

    New changeset 1412be96faf0 by Serhiy Storchaka in branch 'default':
    Issue bpo-24821: Refactor STRINGLIB(fastsearch_memchr_1char) and split it on
    https://hg.python.org/cpython/rev/1412be96faf0

    @serhiy-storchaka
    Copy link
    Member Author

    Could you please push this part of the change? It looks good to me.

    Done. The patch that make an optimization now looks much simpler.

    The C library provides a wmemchr() function which can be used to search for a wchar_t character inside a wchar_t* string.

    Yes, I know. But this is C99 function, hence we should add configure test for it, and I don't know if it exists on Windows. I did not test its performance.

    Did you try to not use memchr() but a simple C loop for UCS-2 and UCS-4?

    A simple C loop is 25-50% slower. My patch tries to use memchr(), but if it founds false positive, runs a simple C loop for limited length. This makes a performance only insignificantly worse in the case of few false positives, but makes it much better in cases of often or grouped together false positives. The common case when there are no false positives is not affected.

    MEMCHR_CUT_OFF is heuristic parameter. It is equal to the number of characters for which memchr()/memrchr() and a simple loop work the same time. On my machine it is 10-15 for UCS1 and 20-50 for UCS2 and UCS4 (depending on direction and character width).

    @serhiy-storchaka
    Copy link
    Member Author

    Ping.

    @serhiy-storchaka
    Copy link
    Member Author

    Could you please make a review Victor?

    @serhiy-storchaka
    Copy link
    Member Author

    I would commit the patch before beta 1, but:

    1. The issue can be considered not as a new feature, but as a fix of performance bug (just we don't will to backport the fix to 3.5). I think the patch can be committed at the beta stage.

    2. I would want to tune magic constants. I experimented on my two computers, and thing they are good enough, but would be nice to test them on other computers. I need to write automatic benchmark for running on foreign computers. This will take a time.

    @ned-deily
    Copy link
    Member

    If it has waited this long and is truly low priority, I think it can wait a little longer until 3.7.

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Mar 6, 2017
    @serhiy-storchaka
    Copy link
    Member Author

    Ukrainian "Є" is not the only character suffered from this issue. I suppose much CJK characters are suffered too. The problem is occurred when the lower byte of searched character matches the upper byte of most characters in the text. For example, searching "乎" (U+4e4e) in the sequence of characters U+4eXX (but not containing U+4e4e and U+4e4f):

    $ ./python -m perf timeit -s 's = "一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争 亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜 仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿"*100' -- 's.find("乎")'

    Unpatched: Median +- std dev: 761 us +- 108 us
    Patched: Median +- std dev: 117 us +- 9 us

    For comparison, searching "乏" (U+4e4f) in the same sequence:

    $ ./python -m perf timeit -s 's = "一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争 亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜 仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿"*100' -- 's.find("乏")'

    Unpatched: Median +- std dev: 12.8 us +- 1.4 us
    Patched: Median +- std dev: 12.6 us +- 1.7 us

    Sorry, I don't know Chinese or Japanese and can't provide more realistic examples.

    @mlouielu
    Copy link
    Mannequin

    mlouielu mannequin commented Mar 29, 2017

    I can now only test on Python3.6, providing much meaningful sentence,
    still trying to use perf on cpython master branch.

    ---------------------------------------------------

    $ python -m perf timeit -s 's="一件乒乓事事亏, 不乏串連产業, 万丈一争今为举, 其乎哀哉"*1000' -- 's.find("乎")'

    Median +- std dev: 228 ns +- 7 ns

    $ python -m perf timeit -s 's="一件乒乓事事亏, 不乏串連产業, 万丈一争今为举, 其乎哀哉"*1000' -- 's.find("乏")'  

    Median +- std dev: 143 ns +- 3 ns

    ---------------------------------------------------

    $ python -m perf timeit -s 's="一件乒乓事事亏, 不乏串連产業, 万丈一争今为举, 其哀哉"*1000' -- 's.find("乎")'  
                                                                           ^^ (missing "")
    Median +- std dev: 100 us +- 3 us
    
    $ python -m perf timeit -s 's="一件乒乓事事亏, 不串連产業, 万丈一争今为举, 其乎哀哉"*1000' -- 's.find("乏")'
                                                 ^^ (missing "")
    Median +- std dev: 1.67 us +- 0.05 us

    @zhangyangyu
    Copy link
    Member

    I can't give a "realistic" example. A more meaningful example may be:

    '。', '\u3002', the Chinese period which used in almost every paragraph,
    '地', '\u5730', which is a common used word.

    ./python3 -m perf timeit -s 's = "你好,我叫李雷。"*1000' 's.find("地")'
    Mean +- std dev: 6.65 us +- 0.27 us
    Mean +- std dev: 4.08 us +- 0.22 us

    It works! :-)

    @vstinner
    Copy link
    Member

    Would it completely kill performances to remove the optimization?

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you for testing Louie. Thank you for your review and testing Xiang.

    Would it completely kill performances to remove the optimization?

    Yes. The optimization is not used if the lowest byte is 0. You can try to search "\0" for getting the result without the optimization. On my netbook the optimization speeds up searching by 5.5 times.

    Search with optimization:
    $ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("Ж")'
    Median +- std dev: 296 us +- 33 us

    Search without optimization:
    $ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("\0")'
    Median +- std dev: 1.65 ms +- 0.10 ms

    Search an unlucky character (unpatched):
    $ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
    Median +- std dev: 14.7 ms +- 1.8 ms

    Search an unlucky character (patched):
    $ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
    Median +- std dev: 2.17 ms +- 0.24 ms

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 0a58f72 by Serhiy Storchaka in branch 'master':
    bpo-24821: Fixed the slowing down to 25 times in the searching of some (#505)
    0a58f72

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    goldsteinn added a commit to goldsteinn/cpython that referenced this issue May 21, 2022
    This was brought up a bit in python#69009 but the larger issue is mostly
    different.
    
    Generally comparable perf for the "good" case where memchr doesn't
    return any collisions (false matches on lower byte) but clearly faster
    with collisions.
    
    Some notes on correctness:
    
    wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along
    with just about all the other wide-char string functions) can and
    often does (x86_64 for example) assume that the input is aligned
    relative to the sizeof(wchar_t). If this is not the case for
    Py_UCS{2|4} then this patch is broken.
    
    Also I think the way I implemented `#define STRINGLIB_FAST_MEMCHR` for
    ucs{2|4}lib break strict-aliasing. If this is an issue but otherwise
    the patch is fine, any suggestions for how to fix it?
    
    Test results:
    ```
    $> ./python -m test -j4
    ...
    == Tests result: SUCCESS ==
    
    406 tests OK.
    
    30 tests skipped:
        test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_devpoll
        test_idle test_ioctl test_kqueue test_launcher test_msilib
        test_nis test_ossaudiodev test_readline test_smtpnet
        test_socketserver test_sqlite3 test_startfile test_tcl test_tix
        test_tk test_ttk_guionly test_ttk_textonly test_turtle
        test_urllib2net test_urllibnet test_winconsoleio test_winreg
        test_winsound test_xmlrpc_net test_zipfile64
    ```
    
    Benchmarked on:
    model name	: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
    
    sizeof(wchar_t) == 4
    
    GLIBC 2.35
    
    ```
    ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018210")' ## Long, No match, No collision
    No wmemchr  : 1000 loops, best of 100: 127 nsec per loop
    With wmemchr: 1000 loops, best of 100: 123 nsec per loop
    
    ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018208")' ## Long, No match, High collision
    No wmemchr  : 1000 loops, best of 100: 1.29 usec per loop
    With wmemchr: 1000 loops, best of 100: 123 nsec per loop
    
    ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018210"' -- 's.find("\U00018210")' ## Long, match, No collision
    No wmemchr  : 1000 loops, best of 100: 136 nsec per loop
    With wmemchr: 1000 loops, best of 100: 130 nsec per loop
    
    ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018208"' -- 's.find("\U00018208")' ## Long, match, High collision
    No wmemchr  : 1000 loops, best of 100: 1.35 usec per loop
    With wmemchr: 1000 loops, best of 100: 131 nsec per loop
    
    ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018210")' ## Short, No match, No collision
    No wmemchr  : 1000 loops, best of 100: 50.2 nsec per loop
    With wmemchr: 1000 loops, best of 100: 52.9 nsec per loop
    
    ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018208")' ## Short, No match, High collision
    No wmemchr  : 1000 loops, best of 100: 69.1 nsec per loop
    With wmemchr: 1000 loops, best of 100: 53.7 nsec per loop
    
    ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018210"' -- 's.find("\U00018210")' ## Short, match, No collision
    No wmemchr  : 1000 loops, best of 100: 53.6 nsec per loop
    With wmemchr: 1000 loops, best of 100: 53.6 nsec per loop
    
    ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018208"' -- 's.find("\U00018208")' ## Short, match, High collision
    No wmemchr  : 1000 loops, best of 100: 69 nsec per loop
    With wmemchr: 1000 loops, best of 100: 50.9 nsec per loop
    ```
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants