Message 46355 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mkc
Recipients
Date	2004-07-28.16:23:32
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=555 I picked through CVS, python-dev and google and came up with this. The current behavior was present way back in the earliest regsub.py in CVS (dated Sep 1992); subsequent implementation seem to be mirroring this behavior. The CVS comment back in 1992 described split as modeled on nawk. A check of nawk(1) confirms that nawk only splits on non-null matches. Perl (circa 5.6) on the other hand, appears to split the way this patch does (though I wasn't aware of that when I wrote the patch) so that might argue in the other direction. I would note, too, that re.findall and re.finditer tend in this direction ("Empty matches are included in the result unless they touch the beginning of another match."). The python-dev archive doesn't seem to go back far enough to be relevant and I'm not sure how to search it. General googling (python "re.split" empty match) found a few hits. Probably the most relevant is Tim Peters saying "Python won't change here (IMO)" and giving the example that he also gives in a comment to bug #852532 (which this patch addresses). He also wonders in his comment about the possibility of a "design constraint", but I think this patch addresses that concern. As far as I can tell, the current behavior was a design decision made over 10 years ago, between two alternatives that probably didn't matter much at the time. Skipping empty matches probably seemed harmless before lookahead/lookbehind assertions. Now, though, the current behavior seems like a significant hindrance. Furthermore, it ought to be pretty trivial to modify any existing patterns to get the old behavior, should that be desired (e.g., use 'x+' instead of 'x*'). (I didn't notice that re.findall doc when I originally wrote this patch. Perhaps the doc in the patch should be slightly modified to help emphasize the similarity between how re.findall and re.split handle empty matches.)

Logged In: YES 
user_id=555

I picked through CVS, python-dev and google and came up with
this.  The current behavior was present way back in the
earliest regsub.py in CVS (dated Sep 1992); subsequent
implementation seem to be mirroring this behavior.  The CVS
comment back in 1992 described split as modeled on nawk.  A
check of nawk(1) confirms that nawk only splits on non-null
matches.  Perl (circa 5.6) on the other hand, appears to
split the way this patch does (though I wasn't aware of that
when I wrote the patch) so that might argue in the other
direction.  I would note, too, that re.findall and
re.finditer tend in this direction ("Empty matches are
included in the result unless they touch the beginning of
another match.").

The python-dev archive doesn't seem to go back far enough to
be relevant and I'm not sure how to search it.  General
googling (python "re.split" empty match) found a few hits. 
Probably the most relevant is Tim Peters saying "Python
won't change here (IMO)" and giving the example that he also
gives in a comment to bug #852532 (which this patch
addresses).  He also wonders in his comment about the
possibility of a "design constraint", but I think this patch
addresses that concern.

As far as I can tell, the current behavior was a design
decision made over 10 years ago, between two alternatives
that probably didn't matter much at the time.  Skipping
empty matches probably seemed harmless before
lookahead/lookbehind assertions.  Now, though, the current
behavior seems like a significant hindrance.  Furthermore,
it ought to be pretty trivial to modify any existing
patterns to get the old behavior, should that be desired
(e.g., use 'x+' instead of 'x*').

(I didn't notice that re.findall doc when I originally wrote
this patch.  Perhaps the doc in the patch should be slightly
modified to help emphasize the similarity between how
re.findall and re.split handle empty matches.)

History
Date	User	Action	Args
2007-08-23 15:38:35	admin	link	issue988761 messages
2007-08-23 15:38:35	admin	create