Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[patch] improve unicode methods: split() rsplit() and replace() #51871

Closed
florentx mannequin opened this issue Jan 3, 2010 · 24 comments
Closed

[patch] improve unicode methods: split() rsplit() and replace() #51871

florentx mannequin opened this issue Jan 3, 2010 · 24 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode

Comments

@florentx
Copy link
Mannequin

florentx mannequin commented Jan 3, 2010

BPO 7622
Nosy @malemburg, @pitrou, @ericvsmith, @ezio-melotti, @florentx
Files
  • benchmark_split_replace.diff: benchmark results
  • issue7622_test_splitlines.diff
  • stringlib_split_replace_v4c.diff: Patch, apply to trunk
  • stringlib_split_replace_py3k.diff: Patch, apply to py3k
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-01-13.08:09:56.647>
    created_at = <Date 2010-01-03.17:09:44.179>
    labels = ['interpreter-core', 'expert-unicode', 'performance']
    title = '[patch] improve unicode methods: split() rsplit() and replace()'
    updated_at = <Date 2010-01-13.08:09:56.646>
    user = 'https://github.com/florentx'

    bugs.python.org fields:

    activity = <Date 2010-01-13.08:09:56.646>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-01-13.08:09:56.647>
    closer = 'pitrou'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2010-01-03.17:09:44.179>
    creator = 'flox'
    dependencies = []
    files = ['15736', '15744', '15749', '15750']
    hgrepos = []
    issue_num = 7622
    keywords = ['patch']
    message_count = 24.0
    messages = ['97168', '97172', '97173', '97174', '97184', '97194', '97197', '97204', '97208', '97211', '97212', '97213', '97214', '97215', '97216', '97218', '97219', '97220', '97224', '97232', '97267', '97280', '97281', '97698']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'pitrou', 'eric.smith', 'ezio.melotti', 'flox']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue7622'
    versions = ['Python 2.7', 'Python 3.2']

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 3, 2010

    Content of the patch:

    • removed code duplication between bytearray/string/unicode
    • new header "stringlib/split.h" with common methods:
      stringlib_split/_rsplit/_splitlines
    • added "maxcount" argument to "stringlib_count"
    • better performance for split/replace unicode methods

    Benchmark coming soon...

    @florentx florentx mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jan 3, 2010
    @pitrou
    Copy link
    Member

    pitrou commented Jan 3, 2010

    The "split.h" file is missing from your patch.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 3, 2010

    You're right. Oups.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 3, 2010

    The patch looks wrong for bytearrays. They are mutable, so you shouldn't return the original object as an optimization. Here is the current (unpatched) behaviour:

    >>> a = bytearray(b"abc")
    >>> b, = a.split()
    >>> b is a
    False

    On the other hand, you aren't doing this optimization at all in the general case of stringlib_split() and stringlib_rsplit(), while it could be done.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 3, 2010

    Mutable methods split() splitlines() and partition() fixed.
    And added optimization for all immutables methods.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 4, 2010

    added "Makefile.pre.in".

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 4, 2010

    There's some reference leaking somewhere...
    Will investigate.

    ~ $ ./python Lib/test/regrtest.py -R 2:3: test_unicode
    test_unicode leaked [7, 7, 7] references, sum=21

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 4, 2010

    Refleak fixed in PyUnicode_Splitlines.

    @malemburg
    Copy link
    Member

    A few comments on coding style:

    • please keep the existing argument formats as they are, e.g.
        count = countstring(self_s, self_len,
                            from_s, from_len,
                            0, self_len, FORWARD, maxcount);

    or

    /* helper macro to fixup start/end slice values */
    -#define FIX_START_END(obj) \

    • if (start < 0) \
    •    start += (obj)-\>length;                 \\
      
    • if (start < 0) \
    •    start = 0;                              \\
      
    • if (end > (obj)->length) \
    •    end = (obj)-\>length;                    \\
      
    • if (end < 0) \
    •    end += (obj)-\>length;                   \\
      
    • if (end < 0) \
    •    end = 0;
      

    +#define ADJUST_INDICES(start, end, len) \
    + if (end > len) end = len; \
    + else if (end < 0) { end += len; if (end < 0) end = 0; } \
    + if (start < 0) { start += len; if (start < 0) start = 0; }

    and use similar formatting for the replacement function
    calls/macros

    • make sure that the name of a symbol matches the value, e.g.
       #define LONG_BITMASK (LONG_BIT-1)
       #define BLOOM(mask, ch) ((mask & (1 << ((ch) & LONG_BITMASK))))

    LONG_BITMASK has a value of 0x1f (31) - that's a single byte, not
    a long value. In this case, 0x1f is an implementation detail of
    the simplified Bloom filter used for set membership tests in the
    Unicode implementation.

    When adjusting the value to be platform dependent, please check
    that the implementation does work for platforms that have
    more than 31 bits available for (signed) longs.

    Note that you don't need to expose that value separately if
    you stick to using BLOOM() directly.

    • use BLOOM() macro in fastsearch.c

    • when declaring variables with initial values, keep these on separate lines, e.g. don't use this style:

        Py_ssize_t i, j, count=0;
        PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)), *sub;

    instead, write:

        Py_ssize_t i, j;
        Py_ssize_t count=0;
        PyObject *list = PyList_New(PREALLOC_SIZE(maxcount))
        PyObject *sub;
    • always place variable declarations at the top of a function, not into the function body:

    +stringlib_split(
    + PyObject* str_obj, const STRINGLIB_CHAR* str, Py_ssize_t str_len,
    + const STRINGLIB_CHAR* sep, Py_ssize_t sep_len, Py_ssize_t maxcount
    + )
    +{
    + if (sep_len == 0) {
    + PyErr_SetString(PyExc_ValueError, "empty separator");
    + return NULL;
    + }
    + else if (sep_len == 1)
    + return stringlib_split_char(str_obj, str, str_len, sep[0], maxcount);
    +

    • Py_ssize_t i, j, pos, count=0;
      + PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)), *sub;
      + if (list == NULL)
      + return NULL;
    • function declarations should not put parameters on new lines:

    +stringlib_splitlines(
    + PyObject* str_obj, const STRINGLIB_CHAR* str, Py_ssize_t str_len,
    + int keepends
    + )
    +{

    instead use this style:

    -static
    -PyObject *rsplit_substring(PyUnicodeObject *self,

    •                       PyObject \*list,
      
    •                       PyUnicodeObject \*substring,
      
    •                       Py_ssize_t maxcount)
      

    -{

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 4, 2010

    A few comments on coding style:

    Thank you for your remarks. I will update the patch accordingly.

    • make sure that the name of a symbol matches the value, e.g.

      #define LONG_BITMASK (LONG_BIT-1)
      #define BLOOM(mask, ch) ((mask & (1 << ((ch) & LONG_BITMASK))))

      LONG_BITMASK has a value of 0x1f (31) - that's a single byte, not
      a long value. In this case, 0x1f is an implementation detail of
      the simplified Bloom filter used for set membership tests in the
      Unicode implementation.

      When adjusting the value to be platform dependent, please check
      that the implementation does work for platforms that have
      more than 31 bits available for (signed) longs.

      Note that you don't need to expose that value separately if
      you stick to using BLOOM() directly.

    Since the same value is used to build the mask, I assume it's better to keep the value around (or use (LONG_BIT-1) directly?).
    mask |= (1 << (ptr[i] & LONG_BITMASK));

    s/LONG_BITMASK/BLOOM_BITMASK/ is not confusing?

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 4, 2010

    • function declarations should not put parameters on new lines:

    +stringlib_splitlines(

    • PyObject* str_obj, const STRINGLIB_CHAR* str, Py_ssize_t str_len,
    • int keepends
    • )
      +{

    I copied the style of "stringlib/partition.h" for this part.
    Should I update style of "partition.h" too?

    @pitrou
    Copy link
    Member

    pitrou commented Jan 4, 2010

    I copied the style of "stringlib/partition.h" for this part.
    Should I update style of "partition.h" too?

    No, it's ok for stringlib to have its own consistent style and there's no reason to change it IMO.

    More interesting would be benchmark results showing how much this improves the various methods :-)

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 4, 2010

    Patch updated:

    • coding style
    • added macros BLOOM_ADD to unicodeobject.c and fastsearch.h
      (and removed LONG_BITMASK)

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 4, 2010

    And now, the figures.

    There's no gain for the string methods.
    Some unicode methods are faster: split/rsplit/replace:

    Most significant results:

    --- bench_slow.log Trunk
    +++ bench_fast.log Patched

    string unicode
    (ms) (ms) comment

     ========== late match, 100 characters
    -13.30  20.51   s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
    -16.12  29.88   s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
    +13.27  14.38   s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
    +16.19  17.61   s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)

    ========== quick replace multiple character match
    -4.51 159.78 ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*100)
    +3.67 7.30 ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*100)

    ========== quick replace single character match
    -3.73 50.61 ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*100)
    +3.72 7.18 ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*100)

    (full benchmark diff is attached)

    And we save 1000 lines of code cumulated
    (stringobject.c/unicodeobject.c/bytearrayobject.c)

    @pitrou
    Copy link
    Member

    pitrou commented Jan 4, 2010

    I don't think you should remove such blocks:

    -/*

    • Local variables:
    • c-basic-offset: 4
    • indent-tabs-mode: nil
    • End:
      -*/

    There probably are people relying on them :-)

    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    >
    > Florent Xicluna <laxyf@yahoo.fr> added the comment:
    >
    > >> * function declarations should not put parameters on new lines:
    > >>
    > >> +stringlib_splitlines(
    > >> + PyObject* str_obj, const STRINGLIB_CHAR* str, Py_ssize_t str_len,
    > >> + int keepends
    > >> + )
    > >> +{
    >
    > I copied the style of "stringlib/partition.h" for this part.
    > Should I update style of "partition.h" too?

    I'd prefer if you change the coding style to what we use elsewhere
    in Python C code.

    See http://www.python.org/dev/peps/pep-0007/ for more C coding
    style suggestions.

    @ericvsmith
    Copy link
    Member

    I think we should use whatever style is currently being used in the code. If we want to go back through this code (or any other code) and PEP-7-ify it, that should be a separate task.

    Alternately, we could PEP-7-ify it first, then apply these changes.

    @malemburg
    Copy link
    Member

    Eric Smith wrote:

    >
    > Eric Smith <eric@trueblade.com> added the comment:
    >
    > I think we should use whatever style is currently being used in the code. If we want to go back through this code (or any other code) and PEP-7-ify it, that should be a separate task.
    >
    > Alternately, we could PEP-7-ify it first, then apply these changes.

    For any new files added, PEP-7 should always be used.

    For PEP-7-ifying the existing code, we could open a separate ticket or just apply the change as separate patch.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 4, 2010

    Fixed a problem with the splitlines optimization: use PyList_Append instead of PyList_SET_ITEM because there's no preallocated list
    in this case.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 4, 2010

    The test case for the previous issue.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 5, 2010

    This looks generally good. Can you produce a separate patch for py3k? stringobject.c has been replaced with bytesobject.c there.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 5, 2010

    Slight update:

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 5, 2010

    And the Py3k patch.

    (note: previous update v4b -> v4c minimize the differences
    between Py2 and Py3 implementations)

    @pitrou
    Copy link
    Member

    pitrou commented Jan 13, 2010

    Committed in r77461 (trunk), r77462 (py3k). Thank you very much!

    @pitrou pitrou closed this as completed Jan 13, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants