Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.split emptyok flag (fix for #852532) #40540

Closed
mkc mannequin opened this issue Jul 11, 2004 · 12 comments
Closed

re.split emptyok flag (fix for #852532) #40540

mkc mannequin opened this issue Jul 11, 2004 · 12 comments
Labels
extension-modules C modules in the Modules dir

Comments

@mkc
Copy link
Mannequin

mkc mannequin commented Jul 11, 2004

BPO 988761
Nosy @akuchling, @birkenfeld, @rhettinger, @gpshead
Files
  • emptyok.patch: emptyok patch against current CVS HEAD 2004-07-10
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2008-07-16.22:35:16.479>
    created_at = <Date 2004-07-11.03:25:40.000>
    labels = ['extension-modules']
    title = 're.split emptyok flag (fix for python/cpython#39646)'
    updated_at = <Date 2008-07-16.22:35:16.478>
    user = 'https://bugs.python.org/mkc'

    bugs.python.org fields:

    activity = <Date 2008-07-16.22:35:16.478>
    actor = 'georg.brandl'
    assignee = 'effbot'
    closed = True
    closed_date = <Date 2008-07-16.22:35:16.479>
    closer = 'georg.brandl'
    components = ['Extension Modules']
    creation = <Date 2004-07-11.03:25:40.000>
    creator = 'mkc'
    dependencies = []
    files = ['6089']
    hgrepos = []
    issue_num = 988761
    keywords = ['patch']
    message_count = 12.0
    messages = ['46352', '46353', '46354', '46355', '46356', '46357', '46358', '46359', '46360', '46361', '69372', '69851']
    nosy_count = 8.0
    nosy_names = ['effbot', 'akuchling', 'georg.brandl', 'rhettinger', 'gregory.p.smith', 'mkc', 'colander_man', 'filip']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue988761'
    versions = ['Python 2.6']

    @mkc
    Copy link
    Mannequin Author

    mkc mannequin commented Jul 11, 2004

    This patch addresses bug bpo-852532. The underlying
    problem is that re.split ignores any match it makes
    that has length zero, which blocks a number of useful
    possibilities. The attached patch implements a flag
    'emptyok', which if set to True, causes re.split to
    allow zero length matches.

    My preference would be to just change the behavior of
    re.split, rather than adding this flag. The old
    behavior isn't documented (though a couple of cases in
    test_re.py do depend on it). As a practical matter,
    though, I realize that there may be some code out there
    relying on this undocumented behavior. And I'm hoping
    that this useful feature can be added quickly. Perhaps
    this new behavior could be made the default in a future
    version of Python.

    (Linux 2.6.3 i686)

    @mkc mkc mannequin assigned effbot Jul 11, 2004
    @mkc mkc mannequin added the extension-modules C modules in the Modules dir label Jul 11, 2004
    @mkc mkc mannequin assigned effbot Jul 11, 2004
    @mkc mkc mannequin added the extension-modules C modules in the Modules dir label Jul 11, 2004
    @colanderman
    Copy link
    Mannequin

    colanderman mannequin commented Jul 21, 2004

    Logged In: YES
    user_id=573252

    Practical example where the current behaviour produces
    undesirable results (splitting on character transitions):

    >>> import re
    >>> re.split(r'(?<=[A-Z])(?=[^a-z])','SOMEstring')
    ['SOMEstring']    # desired is ['SOME','string']

    @akuchling
    Copy link
    Member

    Logged In: YES
    user_id=11375

    Overall I like the patch and wouldn't mind seeing the change
    become the default behaviour. However, I'm nervous about
    possibly not understanding the reason the prohibition on
    zero-length matches was added in the first place. Can you
    please do some research in the CVS logs and python-dev
    archives to figure out why the limitation was implemented in
    the first place?

    @mkc
    Copy link
    Mannequin Author

    mkc mannequin commented Jul 28, 2004

    Logged In: YES
    user_id=555

    I picked through CVS, python-dev and google and came up with
    this. The current behavior was present way back in the
    earliest regsub.py in CVS (dated Sep 1992); subsequent
    implementation seem to be mirroring this behavior. The CVS
    comment back in 1992 described split as modeled on nawk. A
    check of nawk(1) confirms that nawk only splits on non-null
    matches. Perl (circa 5.6) on the other hand, appears to
    split the way this patch does (though I wasn't aware of that
    when I wrote the patch) so that might argue in the other
    direction. I would note, too, that re.findall and
    re.finditer tend in this direction ("Empty matches are
    included in the result unless they touch the beginning of
    another match.").

    The python-dev archive doesn't seem to go back far enough to
    be relevant and I'm not sure how to search it. General
    googling (python "re.split" empty match) found a few hits.
    Probably the most relevant is Tim Peters saying "Python
    won't change here (IMO)" and giving the example that he also
    gives in a comment to bug bpo-852532 (which this patch
    addresses). He also wonders in his comment about the
    possibility of a "design constraint", but I think this patch
    addresses that concern.

    As far as I can tell, the current behavior was a design
    decision made over 10 years ago, between two alternatives
    that probably didn't matter much at the time. Skipping
    empty matches probably seemed harmless before
    lookahead/lookbehind assertions. Now, though, the current
    behavior seems like a significant hindrance. Furthermore,
    it ought to be pretty trivial to modify any existing
    patterns to get the old behavior, should that be desired
    (e.g., use 'x+' instead of 'x*').

    (I didn't notice that re.findall doc when I originally wrote
    this patch. Perhaps the doc in the patch should be slightly
    modified to help emphasize the similarity between how
    re.findall and re.split handle empty matches.)

    @mkc
    Copy link
    Mannequin Author

    mkc mannequin commented Sep 3, 2004

    Logged In: YES
    user_id=555

    Apparently this patch is stalled, but I'd like to get it in,
    in some form, for 2.4. The only question, as far as I know,
    is whether empty matches following non-empty matches "count"
    or not (they do in the original patch).

    If I make a patch with the "doesn't count" behavior, could
    we apply that right away? I'd rather get either behavior in
    for 2.4 than wait for 2.5...

    @rhettinger
    Copy link
    Contributor

    Logged In: YES
    user_id=80475

    Fred, what do you think of the proposal. Are the backwards
    compatability issues worth it?

    @mkc
    Copy link
    Mannequin Author

    mkc mannequin commented Dec 9, 2005

    Logged In: YES
    user_id=555

    This patch seems to have been stalled now for over a year.
    Could it be applied? Or, alternatively, could someone
    provide some sort of reason why it shouldn't be? Thanks.

    @filip
    Copy link
    Mannequin

    filip mannequin commented Jan 16, 2006

    Logged In: YES
    user_id=308203

    I agree completely that splitting on non-zero matches should
    be supported - and that the default behavior should change
    at some point - but I don't think this patch quite covers
    it. Taking an example from the dev-python thread back in
    August of 2004
    (http://mail.python.org/pipermail/python-dev/2004-August/047272.html):

    >>> re.split('x*', 'abxxxcdefxxx', emptyok=True)
    ['', 'a', 'b', '', 'c', 'd', 'e', 'f', '', '']

    To me, this means there's an empty string, beginning and
    ending in pos 0, followed by a zero-width divider also
    beginning and ending in the same position, followed by an
    'a', etc. That seems awkward to me. I think a more intuitive
    result would be (I'm omitting the emptyok argument in the
    following examples):

    >>> re.split('x*', 'abxxxcdefxxx')
    ['a', 'b', 'c', 'd', 'e', 'f', '']

    That is, empty matches cause a split when they are not
    adjacent to a non-empty match and not at the beginning or
    the end of the string. Grouping parentheses would, of
    course, reveal the empty-string boundaries:

    >>> re.split('(x*)', 'abxxxcdefxxx')
    ['', 'a', '', 'b', 'xxx', '', 'c', '', 'd', '', 'e', '',
    'f', 'xxx', '']

    Using the same approach, these results would also seem
    perfectly reasonable to me:

    >>> re.split('(?m)$', 'foo\nbar\nbaz')
    ['foo', '\nbar', '\nbaz']
    >>> re.split('(?m)^', 'foo\nbar\nbaz')
    ['foo\n', 'bar\n', 'baz']

    Splitting a one-character string should be possible only if
    the pattern matches that character:

    >>> re.split('\w*', 'a')
    ['', '']
    >>> re.split('\d*', 'a')
    ['a']

    @mkc
    Copy link
    Mannequin Author

    mkc mannequin commented Jan 17, 2006

    Logged In: YES
    user_id=555

    I think I still agree with my original answer on this (see
    http://mail.python.org/pipermail/python-dev/2004-August/047321.html).

    I'm completely worn down on this, though, so I'd happily
    take any of these options as an improvement over the present
    situation.

    @mkc
    Copy link
    Mannequin Author

    mkc mannequin commented Feb 22, 2007

    Hello from 2004! This is your long-lost bug in re.split--how's it going? I'm still alive and well. I think everyone pretty much agrees that I really am a bug, and at least one guy still writes code just to work around me every few weeks or so. My attempt to keep a low profile is doing well--I'm not even documented in the library reference. This allows me to meet new Python users on a regular basis (whether they like it or not).

    Well, that's it for now. If I don't hear from you until then, I'll drop you another line in 2009. (Hey I'm a poet, too!)

    Regards,
    bug 852532/patch 988761

    @gpshead
    Copy link
    Member

    gpshead commented Jul 7, 2008

    take a look at the patch being worked on in issue bpo-3262.

    @birkenfeld
    Copy link
    Member

    Closing as a duplicate of bpo-3262, which seems to be active.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants