This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Optimization/refactoring for {bytearray, bytes, unicode}.strip()
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.3, Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, ezio.melotti, jcon, msoucy, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2011-08-22 02:59 by jcon, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
stringlib_strip.patch jcon, 2011-09-10 01:05 Move strip() to stringlib review
stringlib_strip2.patch jcon, 2011-10-12 00:38 v2 review
Messages (9)
msg142660 - (view) Author: John O'Connor (jcon) Date: 2011-08-22 02:58
bytearray() is using a less efficient implementation of split() than its similar friends bytes and unicode.

I added a couple extra checks to return early in {bytes,unicode} split().
 - if length is 0
 - if left strip removed all bytes

Looking at just how similar these 3 implementations are I feel that we could also unify/generalize them into one generic helper implementation and leave the object marshaling up to the type specific methods.
msg142882 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-24 13:32
+    if (len == 0) {
+        if (PyByteArray_CheckExact(self)) {
+            Py_INCREF(self);
+            return (PyObject *)self;
+        }
+        return PyByteArray_FromStringAndSize(NULL, 0);
+    }

This looks like a dubious micro-optimization. If len == 0,
all loops will exit early anyway (same for similar snippets in bytesobject.c and unicodeobject.c).

+    if (i == 0 && j == len && PyByteArray_CheckExact(self)) {
+        Py_INCREF(self);
+        return (PyObject *)self;
+    }

bytearray objects are mutable, so you can't return the same object without breaking expected semantics. e.g.:

>>> a = bytearray()
>>> b = a.strip()
>>> b.append(49)
>>> a
bytearray(b'')
>>> b
bytearray(b'1')

> diff -r b5ccdf7c032a Python/bltinmodule.c
> Binary file Python/bltinmodule.c has changed

Uh, what's this? bltinmodule.c shouldn't be considered a binary file.
It probably means you added a NUL byte in it by mistake.

> Looking at just how similar these 3 implementations are I feel that we 
> could also unify/generalize them into one generic helper implementation 
> and leave the object marshaling up to the type specific methods.

You could put it in Objects/stringlib.
msg142941 - (view) Author: John O'Connor (jcon) Date: 2011-08-24 22:53
You are right about those lines in particular; 'dubious' as you say.
Though, the point overall was that the general implementation was noticeably
faster (regardless of those smaller changes).

However, I do think that the single check for len == 0 saves time
particularly
from do_xstrip where we call _getbuffer().

I see. I must have been trigger happy.

> > diff -r b5ccdf7c032a Python/bltinmodule.c
> > Binary file Python/bltinmodule.c has changed
>
> Uh, what's this? bltinmodule.c shouldn't be considered a binary file.
> It probably means you added a NUL byte in it by mistake.
>
> I didn't touch that file at all. I'm not sure what that is. :-P

> Looking at just how similar these 3 implementations are I feel that we
> > could also unify/generalize them into one generic helper implementation
> > and leave the object marshaling up to the type specific methods.
>
> You could put it in Objects/stringlib.
>

I figured. I will when I am able to get to it.

- John
msg142942 - (view) Author: John O'Connor (jcon) Date: 2011-08-24 22:57
Antoine Pitrou <pitrou@free.fr> added the comment:

> This looks like a dubious micro-optimization. If len == 0,
> all loops will exit early anyway (same for similar snippets in bytesobject.c and unicodeobject.c).

You are right about those lines in particular; 'dubious' as you say. 
Though, the point overall was that the general implementation was noticeably
faster (regardless of those smaller changes). 

However, I do think that the single check for len == 0 saves time particularly
from do_xstrip where we call _getbuffer().


> bytearray objects are mutable, so you can't return the same object without breaking expected semantics. e.g.:

I see. I must have been trigger happy.


> diff -r b5ccdf7c032a Python/bltinmodule.c
> Binary file Python/bltinmodule.c has changed

> Uh, what's this? bltinmodule.c shouldn't be considered a binary file.
> It probably means you added a NUL byte in it by mistake.

I didn't touch that file at all. I'm not sure what that is. :-P

> You could put it in Objects/stringlib.

I figured. I will when I am able to get to it.
msg143822 - (view) Author: John O'Connor (jcon) Date: 2011-09-10 01:05
Moved {l,r,}strip to stringlib as well as the bloom filters from unicodeobject. I think it cleans things up nicely. Now there is one implementation for all 3 types. This patch also improves performance for bytearray and slightly for bytes. I see no delta for unicodeobject.

Added a test case for the mistake I made before (returning inc ref for mutable type).

grep -R couldn't find any other references to _PyUnicode_XStrip so I removed it. Though, someone with better knowledge of this should confirm.
msg144752 - (view) Author: John O'Connor (jcon) Date: 2011-10-01 22:41
The patch no longer applies cleanly. Is there enough interest in this to justify rebasing?
msg144871 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-04 10:40
> The patch no longer applies cleanly. Is there enough interest in this to justify rebasing?

Yes, I think it's worth it.
msg145379 - (view) Author: John O'Connor (jcon) Date: 2011-10-12 00:38
New patch.

Please double check the removal of _PyUnicode_XStrip.
msg145445 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-13 00:20
> Please double check the removal of _PyUnicode_XStrip.

I think it's ok. It is private, undocumented, and the only references
returned by Google Code Search are in Python itself.

I'll review the patch later, thank you.
History
Date User Action Args
2022-04-11 14:57:20adminsetgithub: 57016
2015-04-14 20:21:27msoucysetnosy: + msoucy
2013-07-05 23:23:35christian.heimessetversions: + Python 3.4
2011-10-13 00:20:03pitrousetmessages: + msg145445
title: Move strip() to stringlib -> Optimization/refactoring for {bytearray, bytes, unicode}.strip()
2011-10-12 00:42:24jconsettitle: Optimization/refactoring for {bytearray, bytes, unicode}.strip() -> Move strip() to stringlib
2011-10-12 00:39:00jconsetfiles: + stringlib_strip2.patch

messages: + msg145379
2011-10-04 10:40:42pitrousetmessages: + msg144871
title: Optimization/refactoring for {bytearray,bytes,unicode}.strip() -> Optimization/refactoring for {bytearray, bytes, unicode}.strip()
2011-10-01 22:41:32jconsetmessages: + msg144752
title: Optimizations for {bytearray,bytes,unicode}.strip() -> Optimization/refactoring for {bytearray,bytes,unicode}.strip()
2011-09-10 01:06:30jconsetfiles: - strip_perf.patch
2011-09-10 01:05:41jconsetfiles: + stringlib_strip.patch

messages: + msg143822
2011-08-24 22:57:39jconsetmessages: + msg142942
2011-08-24 22:56:08jconsetfiles: - unnamed
2011-08-24 22:53:52jconsetfiles: + unnamed

messages: + msg142941
2011-08-24 13:32:12pitrousetmessages: + msg142882
2011-08-22 02:59:00jconcreate