This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Support Unicode line boundaries in regular expression
Type: enhancement Stage: needs patch
Components: Extension Modules, Regular Expressions Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: LewisGaul, ZackerySpytz, ezio.melotti, mrabarnett, pitrou, serhiy.storchaka
Priority: normal Keywords:

Created on 2014-09-25 06:56 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin.

Messages (4)
msg227508 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-25 06:56
Currently regular expressions support on '\n' as line boundary. To meet Unicode standard requirement RL1.6 [1] all Unicode line separators should be supported: '\n', '\r', '\v', '\f', '\x85', '\u2028', '\u2029' and two-character '\r\n'. Also it is recommended that '.' in "dotall" mode matches '\r\n'. Also strongly recommended to support the '\R' pattern which matches all line separators (equivalent to '(?:\\r\n|(?!\r\n)[\n\v\f\r\x85\u2028\u2029]').

>>> [m.start() for m in re.finditer('$', '\r\n\n\r', re.M)]
[1, 2, 4]  # should be [0, 2, 3, 4]
>>> [m.start() for m in re.finditer('^', '\r\n\n\r', re.M)]
[0, 2, 3]  # should be [0, 2, 3, 4]
>>> [m.group() for m in re.finditer('.', '\r\n\n\r', re.M|re.S)]
['\r', '\n', '\n', '\r']  # should be ['\r\n', '\n', '\r']
>>> [m.group() for m in re.finditer(r'\R', '\r\n\n\r')]
[]  # should be ['\r\n', '\n', '\r']

[1] http://www.unicode.org/reports/tr18/#RL1.6
msg227523 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2014-09-25 11:04
For reference, the regex module normally considers the line ending to be '\n', but it has a WORD flag ('(?w)') that turns on the Unicode definition of a 'word' character as well as Unicode line separator.
msg348310 - (view) Author: Zackery Spytz (ZackerySpytz) * (Python triager) Date: 2019-07-22 23:33
> To meet Unicode standard requirement RL1.6 [1] all Unicode line separators should be supported:

It seems that large portions of Modules/_sre.c would have to be rewritten in order to do this.
msg355473 - (view) Author: Lewis Gaul (LewisGaul) * Date: 2019-10-27 15:32
Hi there, I'm running 'EnHackathon' in a couple of weeks, and was wondering if this could be a good issue for a small team of first-time contributors with experience in C to work on.

Would anyone be able to offer any guidance for where to start in Modules/_sre.c?
History
Date User Action Args
2022-04-11 14:58:08adminsetgithub: 66681
2019-10-27 15:32:36LewisGaulsetnosy: + LewisGaul
messages: + msg355473
2019-07-22 23:33:37ZackerySpytzsetnosy: + ZackerySpytz
messages: + msg348310
2014-09-25 11:04:06mrabarnettsetmessages: + msg227523
2014-09-25 06:56:27serhiy.storchakacreate