classification
Title: re.Scanner groups
Type: behavior Stage:
Components: Regular Expressions Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: dchron, ezio.melotti, mrabarnett, xtreak
Priority: normal Keywords:

Created on 2020-04-12 07:52 by dchron, last changed 2020-04-12 14:06 by xtreak.

Files
File name Uploaded Description Edit
re.Scanner.txt dchron, 2020-04-12 08:07
Messages (2)
msg366226 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2020-04-12 08:03
Please add a description of the issue you are facing with a simple script of the behavior.
msg366249 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2020-04-12 14:06
Copy paste of the contents in the text file

In the re module there is an experimental feature called Scanner.
Some unexpected behavior was found while working with it.
Here is an example:

>>> re.Scanner([('\w+=(\d+);', lambda s,g: s.match.group(1))]).scan('x=5;')
(['5;'], '')

The obvious error is the semicolon returned via capturing group 1.

Adding a dummy rule at the beginning, seems to solve that issue:

>>> re.Scanner([('z', None), ('\w+=(\d+);', lambda s,g: s.match.group(1))]).scan('x=5;')
(['5'], '')

Adding a capturing group around \w+ also returns the correct answer:

>>> re.Scanner([('z', None), ('(\w+)=(\d+);', lambda s,g: s.match.group(1))]).scan('x=5;')
(['x'], '')

But then, if I ask for the second group, the problem appears again:

>>> re.Scanner([('z', None), ('(\w+)=(\d+);', lambda s,g: s.match.group(2))]).scan('x=5;')
(['5;'], '')
History
Date User Action Args
2020-04-12 14:06:10xtreaksetmessages: + msg366249
2020-04-12 08:07:40dchronsetfiles: + re.Scanner.txt
2020-04-12 08:03:36xtreaksetnosy: + xtreak
messages: + msg366226
2020-04-12 07:52:00dchroncreate