This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author stiv
Recipients akitada, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, giampaolo.rodola, gregory.p.smith, jacques, jaylogan, jhalcrow, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, stiv, timehorse, vbr
Date 2010-11-23.15:58:00
SpamBayes Score 2.3827613e-07
Marked as misclassified No
Message-id <1290527883.3.0.611174517808.issue2636@psf.upfronthosting.co.za>
In-reply-to
Content
Forgive me if this is just a stupid oversight. 

I'm a linguist and use UTF-8 for "special" characters for linguistics data. This often includes multi-byte Unicode character sequences that are composed as one grapheme. For example the í̵ (if it's displaying correctly for you) is a LATIN SMALL LETTER I WITH STROKE \u0268 combined with COMBINING ACUTE ACCENT \u0301. E.g. a word I'm parsing:

jí̵-e-gɨ

I was pretty excited to find out that this regex library implements the grapheme match \X (equivalent to \P{M}\p{M}*). For the above example I needed to evaluate which sequences of characters can occur across syllable boundaries (here the hyphen "-"), so I'm aiming for:

í̵-e
e-g

When regex couldn't get any better, you awesome developers implemented an overlapped=True flag with findall and finditer. 

Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
>>> import regex
>>> s = "jí̵-e-gɨ"
>>> s
'jí̵-e-gɨ'
>>> m = regex.compile("(\X)(-)(\X)")
>>> m.findall(s, overlapped=False)
[('í̵', '-', 'e')]

But these results are weird to me:

>>> m.findall(s, overlapped=True)
[('í̵', '-', 'e'), ('í̵', '-', 'e'), ('e', '-', 'g'), ('e', '-', 'g'), ('e', '-', 'g')]

Why the extra matches? At first I figured this had something to do with the overlapping match of the grapheme, since it's multiple characters. So I tried it with with out the grapheme match:

>>> m = regex.compile("(.)(-)(.)")
>>> s2 = "a-b-cd-e-f"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b'), ('d', '-', 'e')]

That's right. But with overlap...

>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c'), ('d', '-', 'e'), ('d', '-', 'e'), ('d', '-', 'e'), ('e', '-', 'f'), ('e', '-', 'f')]

Those 'extra' matches are confusing me. 2x b-c, 3x d-e, 2x e-f? Or even more simply:

>>> s2 = "a-b-c"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b')]
>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c')]

Thanks!
History
Date User Action Args
2010-11-23 15:58:03stivsetrecipients: + stiv, loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, jhalcrow
2010-11-23 15:58:03stivsetmessageid: <1290527883.3.0.611174517808.issue2636@psf.upfronthosting.co.za>
2010-11-23 15:58:01stivlinkissue2636 messages
2010-11-23 15:58:00stivcreate