Author stiv
Recipients akitada, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, giampaolo.rodola, gregory.p.smith, jacques, jaylogan, jhalcrow, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, stiv, timehorse, vbr
Date 2010-11-23.15:58:00
SpamBayes Score 2.38276e-07
Marked as misclassified No
Message-id <1290527883.3.0.611174517808.issue2636@psf.upfronthosting.co.za>
In-reply-to
Content
Forgive me if this is just a stupid oversight. 

I'm a linguist and use UTF-8 for "special" characters for linguistics data. This often includes multi-byte Unicode character sequences that are composed as one grapheme. For example the í̵ (if it's displaying correctly for you) is a LATIN SMALL LETTER I WITH STROKE \u0268 combined with COMBINING ACUTE ACCENT \u0301. E.g. a word I'm parsing:

jí̵-e-gɨ

I was pretty excited to find out that this regex library implements the grapheme match \X (equivalent to \P{M}\p{M}*). For the above example I needed to evaluate which sequences of characters can occur across syllable boundaries (here the hyphen "-"), so I'm aiming for:

í̵-e
e-g

When regex couldn't get any better, you awesome developers implemented an overlapped=True flag with findall and finditer. 

Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
>>> import regex
>>> s = "jí̵-e-gɨ"
>>> s
'jí̵-e-gɨ'
>>> m = regex.compile("(\X)(-)(\X)")
>>> m.findall(s, overlapped=False)
[('í̵', '-', 'e')]

But these results are weird to me:

>>> m.findall(s, overlapped=True)
[('í̵', '-', 'e'), ('í̵', '-', 'e'), ('e', '-', 'g'), ('e', '-', 'g'), ('e', '-', 'g')]

Why the extra matches? At first I figured this had something to do with the overlapping match of the grapheme, since it's multiple characters. So I tried it with with out the grapheme match:

>>> m = regex.compile("(.)(-)(.)")
>>> s2 = "a-b-cd-e-f"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b'), ('d', '-', 'e')]

That's right. But with overlap...

>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c'), ('d', '-', 'e'), ('d', '-', 'e'), ('d', '-', 'e'), ('e', '-', 'f'), ('e', '-', 'f')]

Those 'extra' matches are confusing me. 2x b-c, 3x d-e, 2x e-f? Or even more simply:

>>> s2 = "a-b-c"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b')]
>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c')]

Thanks!
History
Date User Action Args
2010-11-23 15:58:03stivsetrecipients: + stiv, loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, jhalcrow
2010-11-23 15:58:03stivsetmessageid: <1290527883.3.0.611174517808.issue2636@psf.upfronthosting.co.za>
2010-11-23 15:58:01stivlinkissue2636 messages
2010-11-23 15:58:00stivcreate