This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.search not respecting anchor markers in or-ed construction
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.4
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Almer Tigelaar, ezio.melotti, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2015-07-15 10:13 by Almer Tigelaar, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (2)
msg246756 - (view) Author: Almer Tigelaar (Almer Tigelaar) Date: 2015-07-15 10:13
From the documentation ^ should restrict the matching of re.search to the beginning of the string, as mentioned here: https://docs.python.org/3.4/library/re.html#search-vs-match

However, this doesn't always seem to work as the following example shows:

re.search("^([0-9]{4}-[01][0-9]-[0-3][0-9]T[0-2][0-9]:[0-5][0-9]:[0-5][0-9]\\.[0-9]+)|([0-9]{4}-[01][0-9]-[0-3][0-9]T[0-2][0-9]:[0-5][0-9]:[0-5][0-9])|([0-9]{4}-[01][0-9]-[0-3][0-9]T[0-2][0-9]:[0-5][0-9])|([0-9]{4}-[01][0-9]-[0-3][0-9]T[0-2][0-9])|([0-9]{4}-[01][0-9]-[0-3][0-9])|([0-9]{4}-[01][0-9])|([0-9]{4})$", "2015-AE-02T10:16:08.450904")

This should not match since the expression uses or-ed patterns between anchors ^ and $. Based on the "AE" this should not return a match, yet it returns one from positions 22 to 26, based on the last pattern in the or-red sequence of patterns: ([0-9]{4})

This can be worked around by explicitly including the anchor markers in the last pattern as follows:

re.search("^([0-9]{4}-[01][0-9]-[0-3][0-9]T[0-2][0-9]:[0-5][0-9]:[0-5][0-9]\\.[0-9]+)|([0-9]{4}-[01][0-9]-[0-3][0-9]T[0-2][0-9]:[0-5][0-9]:[0-5][0-9])|([0-9]{4}-[01][0-9]-[0-3][0-9]T[0-2][0-9]:[0-5][0-9])|([0-9]{4}-[01][0-9]-[0-3][0-9]T[0-2][0-9])|([0-9]{4}-[01][0-9]-[0-3][0-9])|([0-9]{4}-[01][0-9])|(^[0-9]{4}$)$", "2015-AE-02T10:16:08.450904")

Notice: the last pattern now explicitly includes the anchors: (^[0-9]{4}$), which is factually duplicate with the anchors that already exist at the beginning and end of the entire regular expression!

This work around correctly produces no match (which is the behaviour I expected from the first pattern).
msg246757 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2015-07-15 10:35
The or-ed patterns aren't between the anchors. The ^ is at the start of the first alternative and the $ is at the end of the last alternative.
History
Date User Action Args
2022-04-11 14:58:18adminsetgithub: 68824
2015-07-15 18:52:09serhiy.storchakasetstatus: open -> closed
nosy: + serhiy.storchaka

resolution: not a bug
stage: resolved
2015-07-15 10:35:15mrabarnettsetmessages: + msg246757
2015-07-15 10:13:24Almer Tigelaarcreate