This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vbr
Recipients akitada, akuchling, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, gregory.p.smith, jaylogan, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, timehorse, vbr
Date 2010-07-05.21:42:28
SpamBayes Score 0.009073453
Marked as misclassified No
Message-id <1278366150.56.0.330158956411.issue2636@psf.upfronthosting.co.za>
In-reply-to
Content
I just noticed a somehow strange behaviour in matching character sets or alternate matches which contain some more "advanced" unicode characters, if they are in the search pattern with some "simpler" ones. The former seem to be ignored and not matched (the original re engine matches all of them); (win XPh SP3 Czech, Python 2.7; regex issue2636-20100414)

>>> print u"".join(regex.findall(u".", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(regex.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(regex.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(re.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(re.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëēěė

even stranger, if the pattern contains only these "higher" unicode characters, everything works ok: 
>>> print u"".join(regex.findall(u"ē|ě|ė", u"eèéêëēěė"))
ēěė
>>> print u"".join(regex.findall(u"[ēěė]", u"eèéêëēěė"))
ēěė


The characters in question are some accented latin letters (here in ascending codepoints), but it can be other scripts as well.
>>> print regex.findall(u".", u"eèéêëēěė")
[u'e', u'\xe8', u'\xe9', u'\xea', u'\xeb', u'\u0113', u'\u011b', u'\u0117']

The threshold isn't obvious to me, at first I thought, the characters represented as unicode escapes are problematic, whereas those with hexadecimal escapes are ok; however ē - u'\u0113' seems ok too.
(python 3.1 behaves identically:
>>> regex.findall("[eèéêëēěė]", "eèéêëēěė")
['e', 'è', 'é', 'ê', 'ë', 'ē']
>>> regex.findall("[ēěė]", "eèéêëēěė")
['ē', 'ě', 'ė']
)

vbr
History
Date User Action Args
2010-07-05 21:42:31vbrsetrecipients: + vbr, loewis, akuchling, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, rsc, timehorse, mark, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray
2010-07-05 21:42:30vbrsetmessageid: <1278366150.56.0.330158956411.issue2636@psf.upfronthosting.co.za>
2010-07-05 21:42:29vbrlinkissue2636 messages
2010-07-05 21:42:28vbrcreate