Message 109358 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vbr
Recipients	akitada, akuchling, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, gregory.p.smith, jaylogan, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, timehorse, vbr
Date	2010-07-05.21:42:28
SpamBayes Score	0.009073453
Marked as misclassified	No
Message-id	<1278366150.56.0.330158956411.issue2636@psf.upfronthosting.co.za>
In-reply-to

Content
I just noticed a somehow strange behaviour in matching character sets or alternate matches which contain some more "advanced" unicode characters, if they are in the search pattern with some "simpler" ones. The former seem to be ignored and not matched (the original re engine matches all of them); (win XPh SP3 Czech, Python 2.7; regex issue2636-20100414) >>> print u"".join(regex.findall(u".", u"eèéêëēěė")) eèéêëēěė >>> print u"".join(regex.findall(u"[eèéêëēěė]", u"eèéêëēěė")) eèéêëē >>> print u"".join(regex.findall(u"e\|è\|é\|ê\|ë\|ē\|ě\|ė", u"eèéêëēěė")) eèéêëē >>> print u"".join(re.findall(u"[eèéêëēěė]", u"eèéêëēěė")) eèéêëēěė >>> print u"".join(re.findall(u"e\|è\|é\|ê\|ë\|ē\|ě\|ė", u"eèéêëēěė")) eèéêëēěė even stranger, if the pattern contains only these "higher" unicode characters, everything works ok: >>> print u"".join(regex.findall(u"ē\|ě\|ė", u"eèéêëēěė")) ēěė >>> print u"".join(regex.findall(u"[ēěė]", u"eèéêëēěė")) ēěė The characters in question are some accented latin letters (here in ascending codepoints), but it can be other scripts as well. >>> print regex.findall(u".", u"eèéêëēěė") [u'e', u'\xe8', u'\xe9', u'\xea', u'\xeb', u'\u0113', u'\u011b', u'\u0117'] The threshold isn't obvious to me, at first I thought, the characters represented as unicode escapes are problematic, whereas those with hexadecimal escapes are ok; however ē - u'\u0113' seems ok too. (python 3.1 behaves identically: >>> regex.findall("[eèéêëēěė]", "eèéêëēěė") ['e', 'è', 'é', 'ê', 'ë', 'ē'] >>> regex.findall("[ēěė]", "eèéêëēěė") ['ē', 'ě', 'ė'] ) vbr

I just noticed a somehow strange behaviour in matching character sets or alternate matches which contain some more "advanced" unicode characters, if they are in the search pattern with some "simpler" ones. The former seem to be ignored and not matched (the original re engine matches all of them); (win XPh SP3 Czech, Python 2.7; regex issue2636-20100414)

>>> print u"".join(regex.findall(u".", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(regex.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(regex.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(re.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(re.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëēěė

even stranger, if the pattern contains only these "higher" unicode characters, everything works ok: 
>>> print u"".join(regex.findall(u"ē|ě|ė", u"eèéêëēěė"))
ēěė
>>> print u"".join(regex.findall(u"[ēěė]", u"eèéêëēěė"))
ēěė


The characters in question are some accented latin letters (here in ascending codepoints), but it can be other scripts as well.
>>> print regex.findall(u".", u"eèéêëēěė")
[u'e', u'\xe8', u'\xe9', u'\xea', u'\xeb', u'\u0113', u'\u011b', u'\u0117']

The threshold isn't obvious to me, at first I thought, the characters represented as unicode escapes are problematic, whereas those with hexadecimal escapes are ok; however ē - u'\u0113' seems ok too.
(python 3.1 behaves identically:
>>> regex.findall("[eèéêëēěė]", "eèéêëēěė")
['e', 'è', 'é', 'ê', 'ë', 'ē']
>>> regex.findall("[ēěė]", "eèéêëēěė")
['ē', 'ě', 'ė']
)

vbr

History
Date	User	Action	Args
2010-07-05 21:42:31	vbr	set	recipients: + vbr, loewis, akuchling, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, rsc, timehorse, mark, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray
2010-07-05 21:42:30	vbr	set	messageid: <1278366150.56.0.330158956411.issue2636@psf.upfronthosting.co.za>
2010-07-05 21:42:29	vbr	link	issue2636 messages
2010-07-05 21:42:28	vbr	create