Message122221
Forgive me if this is just a stupid oversight.
I'm a linguist and use UTF-8 for "special" characters for linguistics data. This often includes multi-byte Unicode character sequences that are composed as one grapheme. For example the í̵ (if it's displaying correctly for you) is a LATIN SMALL LETTER I WITH STROKE \u0268 combined with COMBINING ACUTE ACCENT \u0301. E.g. a word I'm parsing:
jí̵-e-gɨ
I was pretty excited to find out that this regex library implements the grapheme match \X (equivalent to \P{M}\p{M}*). For the above example I needed to evaluate which sequences of characters can occur across syllable boundaries (here the hyphen "-"), so I'm aiming for:
í̵-e
e-g
When regex couldn't get any better, you awesome developers implemented an overlapped=True flag with findall and finditer.
Python 3.1.2 (r312:79147, May 19 2010, 11:50:28)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
>>> import regex
>>> s = "jí̵-e-gɨ"
>>> s
'jí̵-e-gɨ'
>>> m = regex.compile("(\X)(-)(\X)")
>>> m.findall(s, overlapped=False)
[('í̵', '-', 'e')]
But these results are weird to me:
>>> m.findall(s, overlapped=True)
[('í̵', '-', 'e'), ('í̵', '-', 'e'), ('e', '-', 'g'), ('e', '-', 'g'), ('e', '-', 'g')]
Why the extra matches? At first I figured this had something to do with the overlapping match of the grapheme, since it's multiple characters. So I tried it with with out the grapheme match:
>>> m = regex.compile("(.)(-)(.)")
>>> s2 = "a-b-cd-e-f"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b'), ('d', '-', 'e')]
That's right. But with overlap...
>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c'), ('d', '-', 'e'), ('d', '-', 'e'), ('d', '-', 'e'), ('e', '-', 'f'), ('e', '-', 'f')]
Those 'extra' matches are confusing me. 2x b-c, 3x d-e, 2x e-f? Or even more simply:
>>> s2 = "a-b-c"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b')]
>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c')]
Thanks! |
|
Date |
User |
Action |
Args |
2010-11-23 15:58:03 | stiv | set | recipients:
+ stiv, loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, jhalcrow |
2010-11-23 15:58:03 | stiv | set | messageid: <1290527883.3.0.611174517808.issue2636@psf.upfronthosting.co.za> |
2010-11-23 15:58:01 | stiv | link | issue2636 messages |
2010-11-23 15:58:00 | stiv | create | |
|