This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.split loses characters matching ungrouped parts of a pattern
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, ezio.melotti, mikehoy, mrabarnett, r.david.murray, triquetra011
Priority: normal Keywords:

Created on 2013-04-08 18:18 by triquetra011, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (12)
msg186324 - (view) Author: Tomasz J. Kotarba (triquetra011) Date: 2013-04-08 18:18
Tested in 2.7 but possibly affects the other versions as well.

A real life example (note the first character '>' being lost):

>>> import re
>>> re.split(r'^>(.*)$', '>Homo sapiens catenin (cadherin-associated)')

produces:

['', 'Homo sapiens catenin (cadherin-associated)', '']


Expected (and IMHO most useful) behaviour would be for it to return:

['', '>Homo sapiens catenin (cadherin-associated)', '']

or (IMHO much less useful as one can already get this one just by adding external grouping parentheses and it is ):

['', '>Homo sapiens catenin (cadherin-associated)', 'Homo sapiens catenin (cadherin-associated)', '']

Not sure whether it can be changed in such a mature and widely used module without breaking compatibility but just adding a new optional parameter for deciding how re.split() deals with patterns containing grouping parentheses and making it default to the current behaviour would be very helpful.
Best Regards
msg186328 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2013-04-08 18:55
It's not a bug.

The documentation says """Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list."""

You're splitting on r'^>(.*)$', but not capturing the '>', therefore it's excluded.

If you want the '>' included, then put it inside the capture group too.
msg186329 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-08 19:13
Thanks for the report, but as Matt said it doesn't look like there is any bug here.  The behavior you report is what the docs say it is, and it seems to me that your "most useful" suggestion would discard the information about the group match, making specifying groups in the separator pointless.
msg186330 - (view) Author: Tomasz J. Kotarba (triquetra011) Date: 2013-04-08 19:20
Hi Matthew,

Thanks for such a quick reply.  I know I can get the > by putting it in grouping parentheses.  That's not the issue here.  The documentation you quoted says that it splits the string by the occurrences _OF_PATTERN_ and that texts of all groups are _ALSO_ returned as _PART_ of the resulting list.  It does not say anywhere (nor does it even suggest that) that parts of the pattern not grouped with parentheses are REMOVED.

That said, I did not report this issue to split hairs (I would rather split strings with regular expressions ;)) and perform liguistic analysis of the current documentation (which is not set in stone and has been changed before).  I did that because I spotted an issue which slightly limits usefulness of re.split() and suggested a potential improvement which would solve the problem and make re.split() even better than it already is.  Whether the powers that be do something with this and improve re.split() is of course not my decision.

Cheers,
T
msg186333 - (view) Author: Tomasz J. Kotarba (triquetra011) Date: 2013-04-08 19:25
Hi R. David Murray,
Thanks for your reply.  I just explained in my previous message to Matthew that documentation does actually support my view (i.e. it is an issue according to the documentation).  Re. the issue you mentioned (discarding information concerning group matching) that (plus maintaining the compatibility with legacy code) is why I suggested adding a new optional argument to re.split.  Apropos discrading information, the current behaviour results in discarding information about parts of the string not enclosed in grouping parentheses. 
Cheers,
T
msg186334 - (view) Author: Tomasz J. Kotarba (triquetra011) Date: 2013-04-08 19:27
Marking as open till I get your response.  I hope you reconsider.
msg186338 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-08 19:30
>>> re.split('-', 'abc-def-jlk')
['abc', 'def', 'jlk']
>>> re.split('(-)', 'abc-def-jlk')
['abc', '-', 'def', '-', 'jlk']

Does that make it a bit clearer?  Maybe we need an actual example in the docs.
msg186341 - (view) Author: Tomasz J. Kotarba (triquetra011) Date: 2013-04-08 19:52
I agree that introducing an example like that plus making some slight changes in wording would be a welcome change to the docs to clearly explain the current behaviour.  Still, I maintain it would be useful to give users the option I described to allow them decide what output they get (i.e. also get texts matching the whole pattern and/or those matching the pattern and groups (e.g. pattern returned as kind of "group 0")).  As I said though, I realise that it is not for me to decide so I am just suggesting it to the powers that be.
Cheers,
T
msg186343 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-08 20:05
As you pointed out, you can already get that behavior by enclosing the entire split expression in a group.  I don't see that there is any functionality missing here.
msg186374 - (view) Author: Tomasz J. Kotarba (triquetra011) Date: 2013-04-09 04:59
Hi,
I can still see one piece of functionality I have mentioned missing. Using my first example, even when one uses '^(>(.*))$' one cannot get ['', '>Homo sapiens catenin (cadherin-associated)', ''] as one will get a four-element list and need to deal with the third element of the returned list (i.e. the match for a group).  Having a parameter I have described before which allows for getting the output similar to what one gets for groups but for the whole pattern (and only that) would be very convenient for some scenarios (like when writing a procedure which processes texts using different (and unknown at the time of writing the procedure) regex patterns which uses a variable number of groups but also the pattern as a whole (also for performing the split operation)).
Of course it can be worked around using many different approaches but still, as I said at start, I believe it would be useful (and would not break compatibility).  Another possible solution (i.e. different than the one I suggested at start) would be to have a parameter to tell re.split to ignore the groups (or, going even further, to select which groups to ignore).  Anyway, I am not the developer of this module so if you feel it would be too much of a bother to add such a parameter just for the sake of convenience then, by all means, please feel free to disregard my comments and just close this report.
Cheers,
T
P.S.  It is very late so I can only hope I have been sane enough to properly / clearly express my thoughts.  Apologies if not.
msg186383 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-09 06:21
Only group the stuff you want to see in the result:

>>> re.split(r'(^>.*$)', '>Homo sapiens catenin (cadherin-associated)')
['', '>Homo sapiens catenin (cadherin-associated)', '']

>>> re.split(r'^(>.*)$', '>Homo sapiens catenin (cadherin-associated)')
['', '>Homo sapiens catenin (cadherin-associated)', '']

If you are using grouping to get alternatives, you can use a non-capturing group:

>>> re.split(r'(ca(?:t|d))', '>Homo sapiens catenin (cadherin-associated)')
['>Homo sapiens ', 'cat', 'enin (', 'cad', 'herin-associated)']

(By the way, I'm a bit confused as to what exactly you are splitting in your original example, since you seem to be matching the whole string, and only if it is the whole string.  On the other hand, regular expressions regularly confuse me... :)

I indeed do not think it is worth complicating the interface to handle the unusual case of accepting and applying unknown regexes.  The one change I could see as a possibility would be to allow all of the groups matched by the split regex to appear as a single sublist.  But I'm not the maintainer of this module either :)
msg186509 - (view) Author: Tomasz J. Kotarba (triquetra011) Date: 2013-04-10 15:40
The example I gave was the simplest possible to illustrate my point but yes, you are correct, I often match the whole string as I do recursive matches.  I do use non-capturing groups but they would not solve the problem I talked about.  Anyway, I had solved my problem before I reported this issue so I would be all right with whatever the outcome of this discussion was but I am glad we have managed to contribute to improving the docs.  Thanks and nice talking to you :)!
History
Date User Action Args
2022-04-11 14:57:44adminsetgithub: 61868
2014-09-14 19:40:02serhiy.storchakasetstatus: pending -> closed
resolution: not a bug
stage: needs patch -> resolved
2013-10-27 16:46:55serhiy.storchakasetstatus: open -> pending
2013-04-14 08:25:58mikehoysetnosy: + mikehoy
2013-04-10 15:40:01triquetra011setmessages: + msg186509
2013-04-09 06:21:24r.david.murraysetmessages: + msg186383
2013-04-09 04:59:52triquetra011setmessages: + msg186374
2013-04-08 20:05:43r.david.murraysetmessages: + msg186343
2013-04-08 19:52:20triquetra011setmessages: + msg186341
components: - Documentation
2013-04-08 19:30:05r.david.murraysetnosy: + docs@python
messages: + msg186338

assignee: docs@python
components: + Documentation
stage: resolved -> needs patch
2013-04-08 19:27:27triquetra011setstatus: closed -> open
resolution: not a bug -> (no value)
messages: + msg186334
2013-04-08 19:25:14triquetra011setmessages: + msg186333
2013-04-08 19:20:16triquetra011setmessages: + msg186330
2013-04-08 19:13:32r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg186329

resolution: not a bug
stage: resolved
2013-04-08 18:55:39mrabarnettsetmessages: + msg186328
2013-04-08 18:18:58triquetra011create