Issue 43222: Regular expression split fails on 3.6 and not 2.7 or 3.7+

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87388

classification

Title:	Regular expression split fails on 3.6 and not 2.7 or 3.7+
Type:	crash	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.6

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, mrabarnett, probinso, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2021-02-14 08:17 by probinso, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (2)
msg386942 - (view)	Author: Philip (probinso)	Date: 2021-02-14 08:17
I am receiving an unexpected behavior in using regular expressions for splitting a string. It seems like this error exists in `python 3.6` but not `python 2.7` and not `python 3.7+`. Below I have described a minimal example with `tox`. `setup.py` ``` from setuptools import setup setup( name='my-tox-tested-package', version='0.0.1', install_requires=["pytest"] ) ``` `tests/test_re.py` ``` import re import pytest _DIGIT_BOUNDARY_RE = re.compile( r'(?<=\D)(?=\d)\|(?<=\d)(?=\D)' ) def test(): _DIGIT_BOUNDARY_RE.split("10.0.0") ``` `tox.ini` ``` [tox] envlist = py27, py36, py37 requires= pytest [testenv] commands = pytest {posargs: tests} ``` ``` ============================================= FAILURES ================================ _____________________________________________ test ____________________________________ def test(): > _DIGIT_BOUNDARY_RE.split("10.0.0") E ValueError: split() requires a non-empty pattern match. tests/test_god.py:9: ValueError ============================================ short test summary info ================== ... ============================================ test session starts ====================== platform linux -- Python 3.7.5, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 rootdir: /home/probinson/code collected 1 item tests/test_re.py . [100%] ============================================ 1 passed in 0.00s ========================= ____________________________________________ summary ___________________________________ py27: commands succeeded ERROR: py36: commands failed py37: commands succeeded ```
msg386943 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-02-14 09:19
There was a bug in the regular expression engine which caused re.split() working incorrectly with zero-width patterns. Note that in your example _DIGIT_BOUNDARY_RE.split("10.0.0") returns ['10.0.0'] on Python 2.7 -- the result which you unlikely expected. It was impossible to fix that bug without changing behavior of other functions in corner cases and breaking existing code. So we first made re.split() raising an exception instead of returning nonsensical result and added warnings for some other cases to help users to catch potential bugs in their code and avoid ambiguous patterns. You see this in 3.6. In 3.7 we fixed the underlying bug. It caused breakage of some user code, but it made regular expressions more consistent in long perspective and made zero-width patterns more usable. In your particular case, if you still need to support Python 2.7 and 3.6, try to use re.split() with pattern r'(\D+)' or r'(\d+)' (parentheses are meaningful here). It gives almost the same result, except possible prepended and appended empty strings.

History
Date	User	Action	Args
2022-04-11 14:59:41	admin	set	github: 87388
2021-02-15 18:10:58	probinso	set	status: open -> closed resolution: wont fix stage: resolved
2021-02-14 09:19:37	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg386943
2021-02-14 08:17:10	probinso	create