Issue 852532: ^$ won't split on empty line

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/39646

classification

Title:	^$ won't split on empty line
Type:		Stage:
Components:	Regular Expressions	Versions:	Python 2.3

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	skip.montanaro	Nosy List:	effbot, fdrake, jburgy, mkc, skip.montanaro, tim.peters
Priority:	normal	Keywords:

Created on 2003-12-02 11:01 by jburgy, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 4471	merged	serhiy.storchaka, 2017-11-19 23:36
PR 4678	closed	serhiy.storchaka, 2017-12-02 17:32

Messages (9)
msg19230 - (view)	Author: Jan Burgy (jburgy)	Date: 2003-12-02 11:01
Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on win32 >>> import re >>> re.compile('^$', re.MULTILINE).split('foo\n\nbar') ['foo\n\nbar'] I expect ['foo\n', '\nbar'], since, according to the documentation $ "in MULTILINE mode also matches before a newline". Thanks, Jan
msg19231 - (view)	Author: Tim Peters (tim.peters) *	Date: 2003-12-02 15:20
Logged In: YES user_id=31435 Confirmed on Pythons 2.1.3, 2.2.3, 2.3.2, and current CVS. More generally, split() doesn't appear to split on any empty (0-length) match. For example, >>> pat = re.compile(r'\b') >>> pat.split('(a b)') ['(a b)'] >>> pat.findall('(a b)') # but the pattern matches 4 places ['', '', '', ''] >>> That's probably a design constraint, but isn't documented. For example, if you split "abc" by the pattern x*, what do you expect? The pattern matches (with length 0) at 4 places, but I bet most people would be surprised to get ['', 'a', 'b', 'c', ''] back instead of (as they do get) ['abc']
msg19232 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2003-12-11 13:42
Logged In: YES user_id=38376 Split never splits on empty substrings; see Tim's answer for a brief discussion. Fred, can you perhaps add something to the documentation?
msg19233 - (view)	Author: Mike Coleman (mkc)	Date: 2004-01-01 05:28
Logged In: YES user_id=555 Hi, I was going to file this bug just now myself, as this seems like a really useful feature. For example, I've several times wanted to split on '^' or '^(?=S)' (to split up a data file into paragraphs that start with an initial S). Instead I have to do something like '\n(?=S)', which is rather more hideous. To answer tim_one's challenge, yes, I do expect splitting by 'x*' to break a string into letters, now that I've thought about it. To not do so is a bizarre and surprising behavior, IMO. (Patient: Doctor, when I split on this nonsense pattern I get nonsense! Doctor: Then don't do that.) The fix should be near this line in _sre.c, I think. if (state.start == state.ptr) { I could work on a patch if you'll take it... Mike
msg19234 - (view)	Author: Jan Burgy (jburgy)	Date: 2004-01-14 11:07
Logged In: YES user_id=618572 Since I really needed the functionality described above, I came up with a broke-around. It's a sufficient replacement, maybe it belongs in some FAQ: >>> import re >>> re.sub('(?im)^$', '\f', 'foo\n\nbar').split('\f') ['foo\n', '\nbar'] Another "magic" byte could replace '\f'... Regards, Jan
msg19235 - (view)	Author: Mike Coleman (mkc)	Date: 2004-07-11 03:32
Logged In: YES user_id=555 I made a patch that addresses this (#988761).
msg55563 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2007-09-01 17:42
Doc note checked in as r57878. Can we conclude based upon Tim's and Fredrik's comments that this behavior is to be expected and won't change? If so, I'll close this item.
msg55625 - (view)	Author: Mike Coleman (mkc)	Date: 2007-09-03 21:22
Well, I think we can conclude that it's expected by them. :-) I still find it surprising, and it somewhat lessens the utility of re.split for my use cases. (I think re.finditer may also suffer from the same problem, but I don't recall.) If you look at the comments attached to the patch for this bug, it looks like akuchling and rhettinger more or less saw this as being a bug worth fixing, though there were questions about exactly what the correct fix should be. http://bugs.python.org/issue988761 One comment about the your doc fix: You highlight a fairly useless zero-character match (e.g., "x*") to demonstrate the behavior, which might leave the user scratching his head. (I think this case was originally mentioned as a corner case, not one that would be useful.) It'd be nice to highlight a more useful case like '^(?=S)' or perhaps a little more generically something like '^(?=HEADER)' or '^(?=BEGIN)' which is a usage that tripped me up in the first place. Thanks for working on this!
msg65475 - (view)	Author: Mike Coleman (mkc)	Date: 2008-04-14 18:48
I'd feel better about this bug being 'wont fix'ed if I had a sense that several people considered the patch and thought that it sucked. At the moment, it seems more like it just fell off of the end without ever being seriously contemplated. :-(

History
Date	User	Action	Args
2022-04-11 14:56:01	admin	set	github: 39646
2017-12-02 17:32:37	serhiy.storchaka	set	pull_requests: + pull_request4588
2017-11-19 23:36:58	serhiy.storchaka	set	pull_requests: + pull_request4405
2008-04-14 18:48:27	mkc	set	messages: + msg65475
2008-04-13 03:29:45	skip.montanaro	set	status: pending -> closed
2007-09-03 21:22:29	mkc	set	messages: + msg55625
2007-09-01 17:42:19	skip.montanaro	set	status: open -> pending assignee: fdrake -> skip.montanaro resolution: postponed -> wont fix messages: + msg55563 nosy: + skip.montanaro
2003-12-02 11:01:38	jburgy	create