classification
Title: incorrect pattern in the re module docs for conditional regex
Type: Stage: resolved
Components: Documentation, Regular Expressions Versions: Python 3.1, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, ezio.melotti, orsenthil, pitrou, python-dev, wesley.chun
Priority: normal Keywords:

Created on 2011-02-22 08:48 by wesley.chun, last changed 2011-03-12 03:46 by orsenthil. This issue is now closed.

Files
File name Uploaded Description Edit
re.rst wesley.chun, 2011-02-22 08:48 patched re.rst file
Messages (7)
msg129041 - (view) Author: wesley chun (wesley.chun) Date: 2011-02-22 08:48
In the re docs, it states the following for the conditional regular expression syntax:

(?(id/name)yes-pattern|no-pattern)
Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. no-pattern is optional and can be omitted. For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>) is a poor email matching pattern, which will match with '<user@host.com>' as well as 'user@host.com', but not with '<user@host.com'.

this regex is incomplete as it allows for 'user@host.com>':

>>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)', '<user@host.com>'))
True
>>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)', 'user@host.com'))
True
>>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)', '<user@host.com'))
False
>>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)', 'user@host.com>'))
True

This error has existed since this feature was added in 2.4...
http://docs.python.org/release/2.4.4/lib/re-syntax.html

... through the 3.3. docs...
http://docs.python.org/dev/py3k/library/re.html#regular-expression-syntax

The fix is to add the end char '$' to the regex to get all 4 working:


>>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', '<user@host.com>'))
True
>>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', 'user@host.com'))
True
>>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', '<user@host.com'))
False
>>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', 'user@host.com>'))
False

If accepted, I propose this patch (also attached):

$ svn diff re.rst
Index: re.rst
===================================================================
--- re.rst      (revision 88499)
+++ re.rst      (working copy)
@@ -297,9 +297,9 @@
 ``(?(id/name)yes-pattern|no-pattern)``
    Will try to match with ``yes-pattern`` if the group with given *id* or *name*
    exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
-   can be omitted. For example,  ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
+   can be omitted. For example,  ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email
    matching pattern, which will match with ``'<user@host.com>'`` as well as
-   ``'user@host.com'``, but not with ``'<user@host.com'``.
+   ``'user@host.com'``, but not with ``'<user@host.com'`` nor ``'user@host.com>'`` .
msg129473 - (view) Author: wesley chun (wesley.chun) Date: 2011-02-25 23:55
i wanted to add one additional comment that it would be nice to have a
regex that works with search() (in addition to match()) because such
an email address may appear in the middle of a line, say a From: or
To: email header.

the fix of using a '$' prevents this from happening, so i'm not 100%
satisfied with the patch although it does fix the regex to get it
working with match().
msg129680 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-02-28 03:42
On Tue, Feb 22, 2011 at 08:48:20AM +0000, wesley chun wrote:
> 
> The fix is to add the end char '$' to the regex to get all 4 working:

Better would be a regex for white-space '\s' which would achieve the
same purpose plus it would satisfy the other requirement for using it
with search and can do the search in-middle of the document too.
msg129686 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-02-28 09:31
Thinking about the regex pattern again. The example given is not really wrong. It does what it claims to match, that is '<user@example.com>' and 'user@example.com' and reject <user@example.com' kind of string. Nothing is said about 'user@example.com>' kind of string. 

Also, this is not an example of validating an email address or finding an email address pattern in text data. A good regex for that purposes would be more complex[1][2].

Having said that, if example of conditional regex has been given - the current one is sufficient (in which case no change is required) or a simpler one can be presented, which may not like matching a email address and thus devoid of any expectations of valid patterns.

Also, if we 'really' think that rejecting 'user@example>' is good idea in the example documentation, then having '$' in no-pattern of regex is good enough. No need to think for regex search cases for the explanation given about.


1: http://www.regular-expressions.info/email.html
2: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html
msg130654 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-03-12 02:49
New changeset 06cca90ff105 by orsenthil in branch 'default':
Fix issue11283 - Clarifying a re pattern in the re module docs for conditional regex
http://hg.python.org/cpython/rev/06cca90ff105
msg130657 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-03-12 03:44
New changeset d676601fee6f by Senthil Kumaran in branch '3.1':
Fix issue11283 - Clarifying a re pattern in the re module docs for conditional regex
http://hg.python.org/cpython/rev/d676601fee6f
msg130658 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-03-12 03:46
Okay, fixed in all relevant branches.
History
Date User Action Args
2011-03-12 03:46:46orsenthilsetstatus: open -> closed
nosy: orsenthil, pitrou, ezio.melotti, docs@python, wesley.chun, python-dev
messages: + msg130658

resolution: fixed
stage: patch review -> resolved
2011-03-12 03:44:23python-devsetnosy: orsenthil, pitrou, ezio.melotti, docs@python, wesley.chun, python-dev
messages: + msg130657
2011-03-12 02:49:05python-devsetnosy: + python-dev
messages: + msg130654
2011-02-28 09:31:05orsenthilsetnosy: orsenthil, pitrou, ezio.melotti, docs@python, wesley.chun
messages: + msg129686
2011-02-28 03:42:14orsenthilsetnosy: + orsenthil
messages: + msg129680
2011-02-25 23:55:51wesley.chunsetnosy: pitrou, ezio.melotti, docs@python, wesley.chun
messages: + msg129473
2011-02-25 19:47:41terry.reedysetnosy: + pitrou
stage: patch review

versions: - Python 2.6, Python 2.5
2011-02-22 23:04:20ezio.melottisetnosy: + ezio.melotti
2011-02-22 08:48:19wesley.chuncreate