This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use optimized string search function in mmap.find()
Type: performance Stage: resolved
Components: Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Dennis Sweeney, benrg, rumpelsepp, vstinner
Priority: normal Keywords: patch

Created on 2022-02-24 12:43 by rumpelsepp, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 31554 closed rumpelsepp, 2022-02-24 12:52
PR 31625 merged Dennis Sweeney, 2022-03-01 01:09
PR 31642 merged vstinner, 2022-03-02 10:34
Messages (7)
msg413902 - (view) Author: Stefan Tatschner (rumpelsepp) * Date: 2022-02-24 12:43
The mmap.find() in  function uses a naive loop to search string matches. This can be optimized “for free” by using libc's memmap(3) function instead.

The relevant file is Modules/mmapmodule.c, the relevant function is mmap_gfind().
msg413903 - (view) Author: Stefan Tatschner (rumpelsepp) * Date: 2022-02-24 12:45
Sorry, I mean memmem(3). :)
msg413968 - (view) Author: (benrg) Date: 2022-02-25 07:14
memmem isn't a standard C function, and some libraries don't have it, notably Microsoft's.

newlib's memmem seems to be the same as glibc's, but is under a BSD 3-clause license instead of LGPL. An older version of newlib's memmem (prior to 2019-01-01) has the license "Permission to use, copy, modify, and distribute this software is freely granted, provided that this notice is preserved", and is still highly optimized and much better than a naive implementation.

Of course, bundling it would no longer be quite so "free".

Old newlib memmem: https://sourceware.org/git/?p=newlib-cygwin.git;a=blob_plain;f=newlib/libc/string/memmem.c;h=25704e467decff5971b34f4189ddfff04ac5fa8e

New newlib memmem: https://sourceware.org/git/?p=newlib-cygwin.git;a=blob_plain;f=newlib/libc/string/memmem.c

Helper file for both: https://sourceware.org/git/?p=newlib-cygwin.git;a=blob_plain;f=newlib/libc/string/str-two-way.h
msg414237 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2022-03-01 03:30
PR 31625 is an alternative proposal.

It uses the Crochemore and Perrin's Two-Way algorithm that @benrg references (see Objects/stringlib/fastsearch.h and Objects/stringlib/stringlib_find_two_way_notes.txt), and is platform-independent.
msg414326 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2022-03-02 04:46
New changeset 6ddb09f35b922a3bbb59e408a3ca7636a6938468 by Dennis Sweeney in branch 'main':
bpo-46848: Use stringlib/fastsearch in mmap (GH-31625)
https://github.com/python/cpython/commit/6ddb09f35b922a3bbb59e408a3ca7636a6938468
msg414327 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2022-03-02 04:49
Thanks for the report!
msg414342 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-03-02 13:15
New changeset b6b711a1aa233001c1874af1d920e459b6bf962c by Victor Stinner in branch 'main':
bpo-46848: Move _PyBytes_Find() to internal C API (GH-31642)
https://github.com/python/cpython/commit/b6b711a1aa233001c1874af1d920e459b6bf962c
History
Date User Action Args
2022-04-11 14:59:56adminsetgithub: 91004
2022-03-02 13:15:30vstinnersetmessages: + msg414342
2022-03-02 10:34:06vstinnersetnosy: + vstinner

pull_requests: + pull_request29762
2022-03-02 04:49:27Dennis Sweeneysetstatus: open -> closed
type: performance
messages: + msg414327

resolution: fixed
stage: patch review -> resolved
2022-03-02 04:46:33Dennis Sweeneysetmessages: + msg414326
2022-03-01 03:30:19Dennis Sweeneysetmessages: + msg414237
2022-03-01 01:09:24Dennis Sweeneysetnosy: + Dennis Sweeney
pull_requests: + pull_request29749
2022-02-25 07:14:36benrgsetnosy: + benrg
messages: + msg413968
2022-02-24 12:52:28rumpelseppsetkeywords: + patch
stage: patch review
pull_requests: + pull_request29675
2022-02-24 12:45:31rumpelseppsetmessages: + msg413903
2022-02-24 12:43:50rumpelseppcreate