Author vbr
Recipients akitada, akuchling, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, gregory.p.smith, jaylogan, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, timehorse, vbr
Date 2010-03-03.23:48:22
SpamBayes Score 4.79555e-12
Marked as misclassified No
Message-id <1267660105.9.0.96148230136.issue2636@psf.upfronthosting.co.za>
In-reply-to
Content
I just noticed a cornercase with the newly introduced grapheme matcher \X, if this is used in the character set:

>>> regex.findall("\X", "abc")
['a', 'b', 'c']
>>> regex.findall("[\X]", "abc")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 218, in findall
  File "regex.pyc", line 1435, in _compile
  File "regex.pyc", line 2351, in optimise
  File "regex.pyc", line 2705, in optimise
  File "regex.pyc", line 2798, in optimise
  File "regex.pyc", line 2268, in __hash__
AttributeError: '_Sequence' object has no attribute '_key'

It obviously doesn't make much sense to use this universal literal in the character class (the same with "." in its metacharacter role) and also http://www.regular-expressions.info/refunicode.html doesn't mention this possibility; but the error message might probably be more descriptive, or the pattern might match "X" or "\" and "\X" (?)

I was originally thinking about the possibility to combine the positive and negative character classes, where e.g. \X would be a kind of base; I am not aware of any re engine supporting this, but I eventually found an unicode guidelines for regular expressions, which also covers this:

http://unicode.org/reports/tr18/#Subtraction_and_Intersection

It also surprises a bit, that these are all included in
Basic Unicode Support: Level 1; (even with arbitrary unions, intersections, differences ...) it suggests, that there is probably no implementation available (AFAIK) - even on this basic level, according to this guideline.

Among other features on this level, the section
http://unicode.org/reports/tr18/#Supplementary_Characters
seems useful, especially the handling of the characters beyond \uffff, also in the form of surrogate pairs as single characters.

This might be useful on the narrow python builds, but it is possible, that there would be be an incompatibility with the handling of these data in "narrow" python itself.

Just some suggestions or rather remarks, as you already implemented many advanced features and are also considering some different approaches ...:-)

vbr
History
Date User Action Args
2010-03-03 23:48:26vbrsetrecipients: + vbr, loewis, akuchling, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, rsc, timehorse, mark, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray
2010-03-03 23:48:25vbrsetmessageid: <1267660105.9.0.96148230136.issue2636@psf.upfronthosting.co.za>
2010-03-03 23:48:24vbrlinkissue2636 messages
2010-03-03 23:48:23vbrcreate