Issue 10703: Regex 0.1.20101210

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54912

classification

Title:	Regex 0.1.20101210
Type:	behavior	Stage:
Components:	Regular Expressions	Versions:	Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:	Adding a new regex module (compatible with re) View: 2636
Assigned To:		Nosy List:	belopolsky, mrabarnett, stiv
Priority:	normal	Keywords:

Created on 2010-12-14 17:41 by stiv, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg123961 - (view)	Author: Steve Moran (stiv)	Date: 2010-12-14 17:41
The regex package doesn't seem to correctly implement the single grapheme match "\X" (\P{M}\p{M}) for pre-Python 3. I'm using the string "íi-te" (i, U+0301, i, -, t, e -- where U+0301 is Unicode COMBINING ACUTE ACCENT), reading it in from a file to bypass Unicode c&p issues in the older IDLEs). stiv@x$ python3.1 Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import regex >>> file = open("test_data", "rt", encoding="utf-8") >>> s = file.readline() >>> print (s) íi-te >>> print (g.findall(s)) ['í', 'i', '-', 't', 'e'] Correct in 3.1 - i+U+0301 considered one grapheme. stiv@x$ python2.7 Python 2.7 (r27:82500, Oct 4 2010, 14:49:53) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import codecs >>> import regex >>> file = codecs.open("test_data", "r", "utf-8") >>> g = regex.compile("\X") >>> s = file.readline() >>> s u'i\u0301i-te' >>> print s.encode("utf-8") íi-te >>> print g.findall(s) [u'i', u'\u0301', u'i', u'-', u't', u'e'] *Not correct -- accent is treated as a separate character. Thanks.
msg123965 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-12-14 18:09
Regex 0.1.20101210 is not part of the standard Python distribution, so this bug report is invalid.
msg123977 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2010-12-14 19:47
The regex module is intended to replace the re module, so its default behaviour is the same: in Python 2, regexes default to matching ASCII, and in Python 3, they default to matching Unicode. If you want to use a regex on a Unicode string in Python 2 then you need to set the Unicode flag, either by providing the UNICODE flag or by putting "(?u)" in the regex itself.
msg124013 - (view)	Author: Steve Moran (stiv)	Date: 2010-12-15 10:30
(Forehead slap.) On Tue, 14 Dec 2010, Matthew Barnett wrote: > > Matthew Barnett <python@mrabarnett.plus.com> added the comment: > > The regex module is intended to replace the re module, so its default behaviour is the same: in Python 2, regexes default to matching ASCII, and in Python 3, they default to matching Unicode. > > If you want to use a regex on a Unicode string in Python 2 then you need to set the Unicode flag, either by providing the UNICODE flag or by putting "(?u)" in the regex itself. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue10703> > _______________________________________ >

History
Date	User	Action	Args
2022-04-11 14:57:10	admin	set	github: 54912
2010-12-15 10:30:14	stiv	set	nosy: belopolsky, mrabarnett, stiv messages: + msg124013
2010-12-14 23:50:48	r.david.murray	set	assignee: mark.dickinson -> nosy: - mark.dickinson
2010-12-14 19:47:12	mrabarnett	set	messages: + msg123977
2010-12-14 18:32:39	r.david.murray	set	assignee: mark.dickinson nosy: + mark.dickinson, mrabarnett
2010-12-14 18:09:57	belopolsky	set	status: open -> closed nosy: + belopolsky messages: + msg123965 superseder: Adding a new regex module (compatible with re) resolution: not a bug
2010-12-14 17:41:57	stiv	create