^$ won't split on empty line #39646

jburgy · 2003-12-02T11:01:38Z

BPO	852532
Nosy	@tim-one, @freddrake, @smontanaro
PRs	bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471 bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns (alternate version). #4678

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/smontanaro'
closed_at = <Date 2008-04-13.03:29:45.158>
created_at = <Date 2003-12-02.11:01:38.000>
labels = ['expert-regex']
title = "^$ won't split on empty line"
updated_at = <Date 2017-12-02.17:32:37.093>
user = 'https://bugs.python.org/jburgy'

bugs.python.org fields:

activity = <Date 2017-12-02.17:32:37.093>
actor = 'serhiy.storchaka'
assignee = 'skip.montanaro'
closed = True
closed_date = <Date 2008-04-13.03:29:45.158>
closer = 'skip.montanaro'
components = ['Regular Expressions']
creation = <Date 2003-12-02.11:01:38.000>
creator = 'jburgy'
dependencies = []
files = []
hgrepos = []
issue_num = 852532
keywords = []
message_count = 9.0
messages = ['19230', '19231', '19232', '19233', '19234', '19235', '55563', '55625', '65475']
nosy_count = 6.0
nosy_names = ['tim.peters', 'fdrake', 'effbot', 'skip.montanaro', 'mkc', 'jburgy']
pr_nums = ['4471', '4678']
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue852532'
versions = ['Python 2.3']

jburgy · 2003-12-02T11:01:38Z

Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200
32 bit (Intel)] on win32

>>> import re
>>> re.compile('^$', re.MULTILINE).split('foo\n\nbar')
['foo\n\nbar']

I expect ['foo\n', '\nbar'], since, according to the
documentation $ "in MULTILINE mode also matches
before a newline".

Thanks, Jan

tim-one · 2003-12-02T15:20:27Z

Logged In: YES
user_id=31435

Confirmed on Pythons 2.1.3, 2.2.3, 2.3.2, and current CVS.

More generally, split() doesn't appear to split on any empty
(0-length) match. For example,

>>> pat = re.compile(r'\b')
>>> pat.split('(a b)')
['(a b)']
>>> pat.findall('(a b)')  # but the pattern matches 4 places
['', '', '', '']
>>>

That's probably a design constraint, but isn't documented.
For example, if you split "abc" by the pattern x*, what do you
expect? The pattern matches (with length 0) at 4 places,
but I bet most people would be surprised to get

['', 'a', 'b', 'c', '']

back instead of (as they do get)

['abc']

effbot · 2003-12-11T13:42:10Z

Logged In: YES
user_id=38376

Split never splits on empty substrings; see Tim's answer for a
brief discussion.

Fred, can you perhaps add something to the documentation?

mkc · 2004-01-01T05:28:44Z

Logged In: YES
user_id=555

Hi, I was going to file this bug just now myself, as this
seems like a really useful feature. For example, I've
several times wanted to split on '^' or '^(?=S)' (to split
up a data file into paragraphs that start with an initial
S). Instead I have to do something like '\n(?=S)', which is
rather more hideous.

To answer tim_one's challenge, yes, I *do* expect splitting
by 'x*' to break a string into letters, now that I've
thought about it. To not do so is a bizarre and surprising
behavior, IMO. (Patient: Doctor, when I split on this
nonsense pattern I get nonsense! Doctor: Then don't do that.)

The fix should be near this line in _sre.c, I think.

        if (state.start == state.ptr) {

I could work on a patch if you'll take it...

Mike

jburgy · 2004-01-14T11:07:57Z

Logged In: YES
user_id=618572

Since I really needed the functionality described above, I
came up with a broke-around. It's a sufficient replacement,
maybe it belongs in some FAQ:

>>> import re
>>> re.sub('(?im)^$', '\f', 'foo\n\nbar').split('\f')
['foo\n', '\nbar']

Another "magic" byte could replace '\f'...

Regards, Jan

mkc · 2004-07-11T03:32:35Z

Logged In: YES
user_id=555

I made a patch that addresses this (bpo-988761).

smontanaro · 2007-09-01T17:42:19Z

Doc note checked in as r57878. Can we conclude based upon Tim's
and Fredrik's comments that this behavior is to be expected and
won't change? If so, I'll close this item.

mkc · 2007-09-03T21:22:28Z

Well, I think we can conclude that it's expected by *them*. :-) I
still find it surprising, and it somewhat lessens the utility of
re.split for my use cases. (I think re.finditer may also suffer from
the same problem, but I don't recall.)

If you look at the comments attached to the patch for this bug, it
looks like akuchling and rhettinger more or less saw this as being a bug
worth fixing, though there were questions about exactly what the
correct fix should be.

http://bugs.python.org/issue988761

One comment about the your doc fix: You highlight a fairly useless
zero-character match (e.g., "x*") to demonstrate the behavior, which
might leave the user scratching his head. (I think this case was
originally mentioned as a corner case, not one that would be useful.)
It'd be nice to highlight a
more useful case like '^(?=S)' or perhaps a little more generically
something like '^(?=HEADER)' or '^(?=BEGIN)' which is a usage that
tripped me up in the first place.

Thanks for working on this!

mkc · 2008-04-14T18:48:27Z

I'd feel better about this bug being 'wont fix'ed if I had a sense that
several people considered the patch and thought that it sucked. At the
moment, it seems more like it just fell off of the end without ever
being seriously contemplated. :-(

jburgy mannequin assigned freddrake Dec 2, 2003

jburgy mannequin added the topic-regex label Dec 2, 2003

jburgy mannequin assigned freddrake Dec 2, 2003

jburgy mannequin added the topic-regex label Dec 2, 2003

smontanaro assigned smontanaro and unassigned freddrake Sep 1, 2007

smontanaro closed this as completed Apr 13, 2008

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

^$ won't split on empty line #39646

^$ won't split on empty line #39646

jburgy mannequin commented Dec 2, 2003

jburgy mannequin commented Dec 2, 2003

tim-one commented Dec 2, 2003

effbot mannequin commented Dec 11, 2003

mkc mannequin commented Jan 1, 2004

jburgy mannequin commented Jan 14, 2004

mkc mannequin commented Jul 11, 2004

smontanaro commented Sep 1, 2007

mkc mannequin commented Sep 3, 2007

mkc mannequin commented Apr 14, 2008

^$ won't split on empty line #39646

^$ won't split on empty line #39646

Comments

jburgy mannequin commented Dec 2, 2003

jburgy mannequin commented Dec 2, 2003

tim-one commented Dec 2, 2003

effbot mannequin commented Dec 11, 2003

mkc mannequin commented Jan 1, 2004

jburgy mannequin commented Jan 14, 2004

mkc mannequin commented Jul 11, 2004

smontanaro commented Sep 1, 2007

mkc mannequin commented Sep 3, 2007

mkc mannequin commented Apr 14, 2008