This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Adding manually offset parameter to str/bytes split function
Type: enhancement Stage: resolved
Components: Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: cwr, r.david.murray, serhiy.storchaka, steven.daprano
Priority: normal Keywords:

Created on 2014-09-08 09:29 by cwr, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (7)
msg226564 - (view) Author: Christoph Wruck (cwr) Date: 2014-09-08 09:29
Currently we have a "split" function which splits a str/bytestr into
chunks of their underlying data. This works great for the most tivial jobs.
But there is no possibility to pass an offset parameter into the split
function which indicates the next "user-defined" starting index.

Actually the next starting position will be build upon the last starting
position (of found sep.) + separator length + 1.

It should be possible to manipulate the next starting index by changing this
behavior into:

last starting position (of found sep.) + separator length + OFFSET.

NOTE: The slicing start index (for substring) stay untouched.

This will help us to solve splitting sequences with one or more consecutive
separators. The following demonstrates the actually behavior.

>>> s = 'abc;;def;hij'
>>> s.split(';')
['abc', '', 'def', 'hij']

This works fine for both str/bytes values.
The following demonstrates an "offset variant" of split function.

>>> s = 'abc;;def;hij'
>>> s.split(';', offset=1)
['abc', ';def', 'hij']

The behavior of maxcount/None sep. parameter should be generate the same
output as before.

A change will be affect (as far as I can see):
- split.h
    - split_char/rsplit_char
    - split/rsplit
msg226568 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-09-08 10:29
I'm afraid I don't understand the purpose of this feature request, or what the behaviour is.

You show a simple example:

>>> s = 'abc;;def;hij'
>>> s.split(';', offset=1)
['abc', ';def', 'hij']

but I don't understand why you want to keep the second semi-colon. I would have thought this would be more useful:

# treat runs of the separator as if it were a single separator
['abc', 'def', 'hij']


It might help if you explain under what circumstances you would use this. Also, how does the caller choose a value for offset? Say, I read a string from a data file, or from the user. How do I know what offset to use?

I'm not sure I understand what this offset parameter is supposed to do in general. Here are some examples showing what I think you want, can you tell me if I'm right?

'spam--eggs--cheese----toast'.split('-', offset=1)
--> ['spam', '-eggs', '-cheese', '-', '-toast']


'spam--eggs--cheese--toast'.split('-', offset=8)
--> ['spam', '-eggs--cheese', '-toast']
msg226571 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-08 11:18
Such problems are solved by using regular expressions.

>>> re.findall('(?:^|(?<=;)).?[^;]*', 'abc;;def;hij')
['abc', ';def', 'hij']
msg226577 - (view) Author: Christoph Wruck (cwr) Date: 2014-09-08 12:43
Hi Steven

exactly - you're right with this.

'spam--eggs--cheese----toast'.split('-', offset=1)
--> ['spam', '-eggs', '-cheese', '-', '-toast']

'spam--eggs--cheese--toast'.split('-', offset=8)
--> ['spam', '-eggs--cheese', '-toast']

Okay - the name "offset" might be an unfortunate choice and you are right that this could be hard to understand for a caller. 

One more examples:

The following removes all escape signs to process the octal escape sequences in a second way if the first three characters are digits.

'spam\\055\\\\055-eggs-\\\\rest'.split('\\', offset=1)
--> ['spam', '055', '\\055-eggs-', '\\rest']

# could speed up the split built-in func if a caller knows that every chunk is 3 chars long?
'tic-tac-toe'.split('-', offset=3)

A caller could use the offset parameter to keep all separators between
the last found and offset if it's a part of a chunk. Or if he awaiting a separator followed by itself which should be keeped - in doubt with the same length of separator.
msg226579 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-09-08 13:10
If you want to do complex splitting, the supported way to do so is re.split.  Feel free to take this to python-ideas if you think there is sufficient reason for baking a particular additional splitting functionality into str.split.
msg226580 - (view) Author: Christoph Wruck (cwr) Date: 2014-09-08 13:42
Serhiy, you will be right if you've to split a complex string such spliting strings with more than one separator. In this case I would prefer a regex bases solution too. Otherwise we could actually use the re-lib for every of those jobs without using the fast built-in str/bytes split function. Unfortunately lags re.split/findall again str/bytes split function.
msg226623 - (view) Author: Christoph Wruck (cwr) Date: 2014-09-09 05:44
David, I'll reflect on it. @ALL - Thank's for all answers. 
Should I close this ticket?
History
Date User Action Args
2022-04-11 14:58:07adminsetgithub: 66556
2014-09-12 21:51:58terry.reedysetstatus: open -> closed
resolution: rejected
2014-09-09 05:44:28cwrsetmessages: + msg226623
2014-09-08 13:42:36cwrsetstatus: closed -> open
resolution: rejected -> (no value)
messages: + msg226580
2014-09-08 13:10:43r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg226579

resolution: rejected
stage: resolved
2014-09-08 12:43:44cwrsetmessages: + msg226577
2014-09-08 11:18:54serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg226571
2014-09-08 10:29:18steven.dapranosetnosy: + steven.daprano
messages: + msg226568
2014-09-08 09:29:02cwrcreate