This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add a version of str.split which returns an iterator
Type: enhancement Stage: resolved
Components: Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Paweł Miech, alex, apalala, georg.brandl, giampaolo.rodola, gregory.p.smith, rhettinger, santoso.wijaya, serhiy.storchaka, terry.reedy, tshepang, uwinx
Priority: normal Keywords:

Created on 2013-03-04 01:04 by alex, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
main.py Paweł Miech, 2021-02-26 12:05
Messages (20)
msg183411 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2013-03-04 01:04
str.split returns a list, which is inefficient when you just want to process items one be one. You could emulate this with str.find and tracking indexes manually, but this should really be a builtin behavior.
msg183414 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-03-04 02:31
The bytes (and bytearray?) version of this should generate memoryview's instead of new bytes objects to avoid a copy.

While not required, It'd be useful if the implementation of this pre-scanned the data internally so that the length of the generated sequence was known up front.  This could imply an internal bitset of vector of split indices is kept for the life of the generator (implementation detail left up to the implementor) if scanning over the input data more than once is undesirable.

Start with a pure python proof of concept to work everywhere, then write a native version.
msg183423 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-04 07:55
> While not required, It'd be useful if the implementation of this pre-scanned the data internally so that the length of the generated sequence was known up front.  This could imply an internal bitset of vector of split indices is kept for the life of the generator (implementation detail left up to the implementor) if scanning over the input data more than once is undesirable.

bytearray can be modified between iterations.
msg183498 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-03-04 20:53
Indeed, a bytearray version would require the talked about but not implemented due to complexity (in pep3118) support for locking a buffer from other mutations.  best concentrate on bytes then.

Do we have a memoryview equivalent for PyUnicode?  If not, we should... (a separate enhancement request)
msg183501 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2013-03-04 22:13
There is no string view that I know of. Interesting idea, though, thanks to the immutability of strings. Would much have to be different other than boundary checking and __hash__ (and hoping extension authors are changing things in-place)?

I say go ahead and open the issue for the idea.
msg183528 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-05 13:54
> Indeed, a bytearray version would require the talked about but not implemented due to complexity (in pep3118) support for locking a buffer from other mutations.

I rather think that a bytearray version can't pre-scan the data. Note that an array for pre-scanned result can be larger than input data (if we split into a large number of small items). Also note that iterative split useful when we do not want to process all input, but only several first items.

Actually I think that in most common cases non-iterative split will be faster than iterative one.
msg183529 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-05 14:05
> There is no string view that I know of. Interesting idea, though, thanks to the immutability of strings. Would much have to be different other than boundary checking and __hash__ (and hoping extension authors are changing things in-place)?

Objects/stringlib/unicode_format.h contains internal structure SubString which can be taken as a basis. But it is unlikely that it will be useful. All API which accept strings will require converting substring views to regular strings. And substring object can consume more memory than full string. This looks like a step backwards from PEP 393.
msg183782 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-03-09 02:02
I personally would have changed both str.split and os.walk to return iterators in 3.0, like many other builtins. The rationale for os.walk continuing to produce a list is that there would be little time saving as the list is not *that* long and most uses look at all the items anyway. (See tracker issue.) This argument might be even stronger for str.split.

When I expressed my view about str.split (after 3.0, I think), Guido said that if people do not look at all the items, they typically need random access, and hence a list anyway, so why make them write list(xxx.split()) all the time?

I will also note that Guido has opposed string views and partial object views in general on the basis that a tiny view can keep a mega object alive, and that it is too much to ask people to keep track of when they should copy the view to release the underlying object.

Iterators also (should) keep the underlying object alive, but they are usually short-lived and not passed around hither and thither.

I personally would prefer that the API for this proposal simply be a new parameter (iter(able)=False/True). But that runs into 'argument values should not change the return type' (but lists and iterators are *both iterables*!). Instead there supposedly should be a new, almost identical function with a new, almost identical name. But that runs into 'we should not add builtins with a really good reason' (which I agree with) and more directly 'we should not repeat the confusing range/xrange mess' (which I also agree with). So the status quo is a Catch 22 situation with conflicting principles that produce paralysis. As I said, I prefer redefining the return as an iterable.
msg183830 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-03-09 16:38
It'd perhaps have been better if things like memoryview were never exposed to the user at all as a distinct type and became an internal implementation detail behind PyBytes and PyUnicode objects (they could hold a reference to something else or collapse that down to their own copy on their own terms, up to the particulars of the Python VM).

Anyways, the above is getting off topic for this issue.  I retract my memoryview suggestion; that belongs in its own issue.

An iterating version of str.split is indeed hard to add today IFF we are against a str.itersplit() method name or against an optional keyword only argument that'd cause split(iterator=True)'s return type to potentially be different.
msg186104 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-04-05 20:02
-1 on os.walk returning an iterator.  The API is already a bit challenging for some and our experience with itertools.groupby() is that returning an inner iterator can be very confusing.
msg186105 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2013-04-05 20:03
Raymond: Is that for the wrong ticket, or was the message incorrect? :)
msg186106 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-04-05 21:02
Alex, it was response to Terry's message: http://bugs.python.org/issue17343#msg183782

FWIW, I'm +1 on an iterator version of str.split().

I'm not sure yet that it would be worthwhile to propagate the idea to other string-like objects though.
msg186160 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-04-06 20:02
If someone wants whip-up a patch for str.iter_index(), I would be happy to review it.   Be sure to add a test case to make sure that the results are non-overlapping:   list('aaaa'.iter_index('aa')) == [0, 2]
msg186192 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-04-07 11:00
I'm guessing Terry wanted to say "os.listdir" instead of "os.walk".
msg186203 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-07 12:47
May be str.iter_indices() or even just str.indices()?
msg186213 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-04-07 15:11
> I'm guessing Terry wanted to say "os.listdir" instead of "os.walk".
yes, sorry.
msg281537 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-11-23 06:32
No one has submitted a patch for this or has expressed an interest in a long time.  Perhaps the use case is already served by re.finditer()

Unassigning.  Feel free to push this forward or to close due to lack on interest.
msg384280 - (view) Author: Martin Winks (uwinx) Date: 2021-01-03 14:23
> Perhaps the use case is already served by re.finditer()

def split_whitespace_ascii(s: str):
    return (pt.group(0) for pt in re.finditer(r"[A-Za-z']+", s))

solution above does not cover all possible data and is incorrect for bytes-like objects.

writing regular expressions for different separators/data would be a quite overheadish, so the idea of one-case solution doesn't seem to go very far and requires a bigger change in code for different separators.

let's try to revive this one :)
msg387721 - (view) Author: Paweł Miech (Paweł Miech) Date: 2021-02-26 12:05
Making string.split iterator sounds like an interesting task. I found this issue because recently we talked in project that string.split returns a list and it can cause increased memory usage footprint for some tasks when there is large response to parse. 

Here is small script, created by my friend Juancarlo Anez, with iterator version of string.split. Compared with default string split it uses much less memory. When running with memory-profiler tool: https://pypi.org/project/memory-profiler/

It creates this output
3299999
Filename: main.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    24   39.020 MiB   39.020 MiB           1   @profile
    25                                         def generate_string():
    26   39.020 MiB    0.000 MiB           1       n = 100000
    27   49.648 MiB    4.281 MiB      100003       long_string = " ".join([uuid.uuid4().hex.upper() for _ in range(n)])
    28   43.301 MiB   -6.348 MiB           1       print(len(long_string))
    29                                         
    30   43.301 MiB    0.000 MiB           1       z = isplit(long_string)
    31   43.301 MiB    0.000 MiB      100001       for line in z:
    32   43.301 MiB    0.000 MiB      100000           continue
    33                                         
    34   52.281 MiB    0.297 MiB      100001       for line in long_string.split():
    35   52.281 MiB    0.000 MiB      100000           continue


You can see that default string.split uses much more memory.
msg387728 - (view) Author: Juancarlo Añez (apalala) * Date: 2021-02-26 15:28
def isplit(text, sep=None, maxsplit=-1):
    """
    A lowmemory-footprint version of:

        iter(text.split(sep, maxsplit))

    Adapted from https://stackoverflow.com/a/9770397
    """

    if maxsplit == 0:
        yield text
    else:
        rsep = re.escape(sep) if sep else r'\s+'
        regex = fr'(?:^|{rsep})((?:(?!{rsep}).)*)'

        for n, p in enumerate(re.finditer(regex, text)):
            if 0 <= maxsplit <= n:
                yield p.string[p.start(1):]
                return
            yield p.group(1)
History
Date User Action Args
2022-04-11 14:57:42adminsetgithub: 61545
2021-02-26 15:28:56apalalasetnosy: + apalala
messages: + msg387728
2021-02-26 12:05:12Paweł Miechsetfiles: + main.py
nosy: + Paweł Miech
messages: + msg387721

2021-01-04 21:13:16brett.cannonsetnosy: - brett.cannon
2021-01-03 14:23:20uwinxsetnosy: + uwinx
messages: + msg384280
2021-01-03 13:58:27serhiy.storchakalinkissue42816 superseder
2017-03-07 18:43:11serhiy.storchakasetstatus: pending -> closed
resolution: rejected
stage: needs patch -> resolved
2016-11-23 06:48:41serhiy.storchakasetstatus: open -> pending
2016-11-23 06:32:22rhettingersetassignee: rhettinger ->
messages: + msg281537
versions: + Python 3.7, - Python 3.4
2013-06-08 11:34:30giampaolo.rodolasetnosy: + giampaolo.rodola
2013-04-07 15:11:40terry.reedysetmessages: + msg186213
2013-04-07 12:47:07serhiy.storchakasetmessages: + msg186203
2013-04-07 11:00:51georg.brandlsetnosy: + georg.brandl
messages: + msg186192
2013-04-06 20:02:39rhettingersetassignee: rhettinger
messages: + msg186160
2013-04-05 21:02:29rhettingersetmessages: + msg186106
2013-04-05 20:03:31alexsetmessages: + msg186105
2013-04-05 20:02:28rhettingersetnosy: + rhettinger
messages: + msg186104
2013-03-09 16:38:24gregory.p.smithsetmessages: + msg183830
2013-03-09 02:02:58terry.reedysetnosy: + terry.reedy
messages: + msg183782
2013-03-09 00:11:46tshepangsetnosy: + tshepang
2013-03-05 14:05:44serhiy.storchakasetmessages: + msg183529
2013-03-05 13:54:18serhiy.storchakasetmessages: + msg183528
2013-03-04 22:13:20brett.cannonsetmessages: + msg183501
2013-03-04 20:53:49gregory.p.smithsetmessages: + msg183498
2013-03-04 18:51:21santoso.wijayasetnosy: + santoso.wijaya
2013-03-04 16:26:35brett.cannonsetnosy: + brett.cannon
2013-03-04 07:55:45serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg183423
2013-03-04 02:31:47gregory.p.smithsetversions: + Python 3.4
nosy: + gregory.p.smith

messages: + msg183414

stage: needs patch
2013-03-04 01:04:26alexcreate