Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

^$ won't split on empty line #39646

Closed
jburgy mannequin opened this issue Dec 2, 2003 · 9 comments
Closed

^$ won't split on empty line #39646

jburgy mannequin opened this issue Dec 2, 2003 · 9 comments
Assignees

Comments

@jburgy
Copy link
Mannequin

jburgy mannequin commented Dec 2, 2003

BPO 852532
Nosy @tim-one, @freddrake, @smontanaro
PRs
  • bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471
  • bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns (alternate version). #4678
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/smontanaro'
    closed_at = <Date 2008-04-13.03:29:45.158>
    created_at = <Date 2003-12-02.11:01:38.000>
    labels = ['expert-regex']
    title = "^$ won't split on empty line"
    updated_at = <Date 2017-12-02.17:32:37.093>
    user = 'https://bugs.python.org/jburgy'

    bugs.python.org fields:

    activity = <Date 2017-12-02.17:32:37.093>
    actor = 'serhiy.storchaka'
    assignee = 'skip.montanaro'
    closed = True
    closed_date = <Date 2008-04-13.03:29:45.158>
    closer = 'skip.montanaro'
    components = ['Regular Expressions']
    creation = <Date 2003-12-02.11:01:38.000>
    creator = 'jburgy'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 852532
    keywords = []
    message_count = 9.0
    messages = ['19230', '19231', '19232', '19233', '19234', '19235', '55563', '55625', '65475']
    nosy_count = 6.0
    nosy_names = ['tim.peters', 'fdrake', 'effbot', 'skip.montanaro', 'mkc', 'jburgy']
    pr_nums = ['4471', '4678']
    priority = 'normal'
    resolution = 'wont fix'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue852532'
    versions = ['Python 2.3']

    @jburgy
    Copy link
    Mannequin Author

    jburgy mannequin commented Dec 2, 2003

    Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200
    32 bit (Intel)] on win32

    >>> import re
    >>> re.compile('^$', re.MULTILINE).split('foo\n\nbar')
    ['foo\n\nbar']

    I expect ['foo\n', '\nbar'], since, according to the
    documentation $ "in MULTILINE mode also matches
    before a newline".

    Thanks, Jan

    @jburgy jburgy mannequin assigned freddrake Dec 2, 2003
    @jburgy jburgy mannequin added the topic-regex label Dec 2, 2003
    @jburgy jburgy mannequin assigned freddrake Dec 2, 2003
    @jburgy jburgy mannequin added the topic-regex label Dec 2, 2003
    @tim-one
    Copy link
    Member

    tim-one commented Dec 2, 2003

    Logged In: YES
    user_id=31435

    Confirmed on Pythons 2.1.3, 2.2.3, 2.3.2, and current CVS.

    More generally, split() doesn't appear to split on any empty
    (0-length) match. For example,

    >>> pat = re.compile(r'\b')
    >>> pat.split('(a b)')
    ['(a b)']
    >>> pat.findall('(a b)')  # but the pattern matches 4 places
    ['', '', '', '']
    >>>

    That's probably a design constraint, but isn't documented.
    For example, if you split "abc" by the pattern x*, what do you
    expect? The pattern matches (with length 0) at 4 places,
    but I bet most people would be surprised to get

    ['', 'a', 'b', 'c', '']

    back instead of (as they do get)

    ['abc']

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Dec 11, 2003

    Logged In: YES
    user_id=38376

    Split never splits on empty substrings; see Tim's answer for a
    brief discussion.

    Fred, can you perhaps add something to the documentation?

    @mkc
    Copy link
    Mannequin

    mkc mannequin commented Jan 1, 2004

    Logged In: YES
    user_id=555

    Hi, I was going to file this bug just now myself, as this
    seems like a really useful feature. For example, I've
    several times wanted to split on '^' or '^(?=S)' (to split
    up a data file into paragraphs that start with an initial
    S). Instead I have to do something like '\n(?=S)', which is
    rather more hideous.

    To answer tim_one's challenge, yes, I *do* expect splitting
    by 'x*' to break a string into letters, now that I've
    thought about it. To not do so is a bizarre and surprising
    behavior, IMO. (Patient: Doctor, when I split on this
    nonsense pattern I get nonsense! Doctor: Then don't do that.)

    The fix should be near this line in _sre.c, I think.

            if (state.start == state.ptr) {

    I could work on a patch if you'll take it...

    Mike

    @jburgy
    Copy link
    Mannequin Author

    jburgy mannequin commented Jan 14, 2004

    Logged In: YES
    user_id=618572

    Since I really needed the functionality described above, I
    came up with a broke-around. It's a sufficient replacement,
    maybe it belongs in some FAQ:

    >>> import re
    >>> re.sub('(?im)^$', '\f', 'foo\n\nbar').split('\f')
    ['foo\n', '\nbar']

    Another "magic" byte could replace '\f'...

    Regards, Jan

    @mkc
    Copy link
    Mannequin

    mkc mannequin commented Jul 11, 2004

    Logged In: YES
    user_id=555

    I made a patch that addresses this (bpo-988761).

    @smontanaro
    Copy link
    Contributor

    Doc note checked in as r57878. Can we conclude based upon Tim's
    and Fredrik's comments that this behavior is to be expected and
    won't change? If so, I'll close this item.

    @smontanaro smontanaro assigned smontanaro and unassigned freddrake Sep 1, 2007
    @mkc
    Copy link
    Mannequin

    mkc mannequin commented Sep 3, 2007

    Well, I think we can conclude that it's expected by *them*. :-) I
    still find it surprising, and it somewhat lessens the utility of
    re.split for my use cases. (I think re.finditer may also suffer from
    the same problem, but I don't recall.)

    If you look at the comments attached to the patch for this bug, it
    looks like akuchling and rhettinger more or less saw this as being a bug
    worth fixing, though there were questions about exactly what the
    correct fix should be.

    http://bugs.python.org/issue988761

    One comment about the your doc fix: You highlight a fairly useless
    zero-character match (e.g., "x*") to demonstrate the behavior, which
    might leave the user scratching his head. (I think this case was
    originally mentioned as a corner case, not one that would be useful.)
    It'd be nice to highlight a
    more useful case like '^(?=S)' or perhaps a little more generically
    something like '^(?=HEADER)' or '^(?=BEGIN)' which is a usage that
    tripped me up in the first place.

    Thanks for working on this!

    @mkc
    Copy link
    Mannequin

    mkc mannequin commented Apr 14, 2008

    I'd feel better about this bug being 'wont fix'ed if I had a sense that
    several people considered the patch and thought that it sucked. At the
    moment, it seems more like it just fell off of the end without ever
    being seriously contemplated. :-(

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants