classification
Title: Regex 0.1.20101210
Type: behavior Stage:
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder: Adding a new regex module (compatible with re)
View: 2636
Assigned To: Nosy List: belopolsky, mrabarnett, stiv
Priority: normal Keywords:

Created on 2010-12-14 17:41 by stiv, last changed 2010-12-15 10:30 by stiv. This issue is now closed.

Messages (4)
msg123961 - (view) Author: Steve Moran (stiv) Date: 2010-12-14 17:41
The regex package doesn't seem to correctly implement the single grapheme match "\X" (\P{M}\p{M}*) for pre-Python 3. I'm using the string "íi-te" (i, U+0301, i, -, t, e -- where U+0301 is Unicode COMBINING ACUTE ACCENT), reading it in from a file to bypass Unicode c&p issues in the older IDLEs). 


stiv@x$ python3.1
Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> file = open("test_data", "rt", encoding="utf-8")
>>> s = file.readline()
>>> print (s)
íi-te
>>> print (g.findall(s))
['í', 'i', '-', 't', 'e']

* Correct in 3.1 - i+U+0301 considered one grapheme.

stiv@x$ python2.7
Python 2.7 (r27:82500, Oct  4 2010, 14:49:53) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import codecs                                
>>> import regex
>>> file = codecs.open("test_data", "r", "utf-8")
>>> g = regex.compile("\X")
>>> s = file.readline()
>>> s
u'i\u0301i-te'
>>> print s.encode("utf-8")
íi-te
>>> print g.findall(s)
[u'i', u'\u0301', u'i', u'-', u't', u'e']

*Not correct -- accent is treated as a separate character.

Thanks.
msg123965 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-14 18:09
Regex 0.1.20101210 is not part of the standard Python distribution, so this bug report is invalid.
msg123977 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2010-12-14 19:47
The regex module is intended to replace the re module, so its default behaviour is the same: in Python 2, regexes default to matching ASCII, and in Python 3, they default to matching Unicode.

If you want to use a regex on a Unicode string in Python 2 then you need to set the Unicode flag, either by providing the UNICODE flag or by putting "(?u)" in the regex itself.
msg124013 - (view) Author: Steve Moran (stiv) Date: 2010-12-15 10:30
(Forehead slap.)

On Tue, 14 Dec 2010, Matthew Barnett wrote:

>
> Matthew Barnett <python@mrabarnett.plus.com> added the comment:
>
> The regex module is intended to replace the re module, so its default behaviour is the same: in Python 2, regexes default to matching ASCII, and in Python 3, they default to matching Unicode.
>
> If you want to use a regex on a Unicode string in Python 2 then you need to set the Unicode flag, either by providing the UNICODE flag or by putting "(?u)" in the regex itself.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue10703>
> _______________________________________
>
History
Date User Action Args
2010-12-15 10:30:14stivsetnosy: belopolsky, mrabarnett, stiv
messages: + msg124013
2010-12-14 23:50:48r.david.murraysetassignee: mark.dickinson ->

nosy: - mark.dickinson
2010-12-14 19:47:12mrabarnettsetmessages: + msg123977
2010-12-14 18:32:39r.david.murraysetassignee: mark.dickinson

nosy: + mark.dickinson, mrabarnett
2010-12-14 18:09:57belopolskysetstatus: open -> closed

nosy: + belopolsky
messages: + msg123965

superseder: Adding a new regex module (compatible with re)
resolution: not a bug
2010-12-14 17:41:57stivcreate