Message 100359 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vbr
Recipients	akitada, akuchling, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, gregory.p.smith, jaylogan, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, timehorse, vbr
Date	2010-03-03.23:48:22
SpamBayes Score	4.795553e-12
Marked as misclassified	No
Message-id	<1267660105.9.0.96148230136.issue2636@psf.upfronthosting.co.za>
In-reply-to

Content
I just noticed a cornercase with the newly introduced grapheme matcher \X, if this is used in the character set: >>> regex.findall("\X", "abc") ['a', 'b', 'c'] >>> regex.findall("[\X]", "abc") Traceback (most recent call last): File "<input>", line 1, in <module> File "regex.pyc", line 218, in findall File "regex.pyc", line 1435, in _compile File "regex.pyc", line 2351, in optimise File "regex.pyc", line 2705, in optimise File "regex.pyc", line 2798, in optimise File "regex.pyc", line 2268, in __hash__ AttributeError: '_Sequence' object has no attribute '_key' It obviously doesn't make much sense to use this universal literal in the character class (the same with "." in its metacharacter role) and also http://www.regular-expressions.info/refunicode.html doesn't mention this possibility; but the error message might probably be more descriptive, or the pattern might match "X" or "\" and "\X" (?) I was originally thinking about the possibility to combine the positive and negative character classes, where e.g. \X would be a kind of base; I am not aware of any re engine supporting this, but I eventually found an unicode guidelines for regular expressions, which also covers this: http://unicode.org/reports/tr18/#Subtraction_and_Intersection It also surprises a bit, that these are all included in Basic Unicode Support: Level 1; (even with arbitrary unions, intersections, differences ...) it suggests, that there is probably no implementation available (AFAIK) - even on this basic level, according to this guideline. Among other features on this level, the section http://unicode.org/reports/tr18/#Supplementary_Characters seems useful, especially the handling of the characters beyond \uffff, also in the form of surrogate pairs as single characters. This might be useful on the narrow python builds, but it is possible, that there would be be an incompatibility with the handling of these data in "narrow" python itself. Just some suggestions or rather remarks, as you already implemented many advanced features and are also considering some different approaches ...:-) vbr

I just noticed a cornercase with the newly introduced grapheme matcher \X, if this is used in the character set:

>>> regex.findall("\X", "abc")
['a', 'b', 'c']
>>> regex.findall("[\X]", "abc")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 218, in findall
  File "regex.pyc", line 1435, in _compile
  File "regex.pyc", line 2351, in optimise
  File "regex.pyc", line 2705, in optimise
  File "regex.pyc", line 2798, in optimise
  File "regex.pyc", line 2268, in __hash__
AttributeError: '_Sequence' object has no attribute '_key'

It obviously doesn't make much sense to use this universal literal in the character class (the same with "." in its metacharacter role) and also http://www.regular-expressions.info/refunicode.html doesn't mention this possibility; but the error message might probably be more descriptive, or the pattern might match "X" or "\" and "\X" (?)

I was originally thinking about the possibility to combine the positive and negative character classes, where e.g. \X would be a kind of base; I am not aware of any re engine supporting this, but I eventually found an unicode guidelines for regular expressions, which also covers this:

http://unicode.org/reports/tr18/#Subtraction_and_Intersection

It also surprises a bit, that these are all included in
Basic Unicode Support: Level 1; (even with arbitrary unions, intersections, differences ...) it suggests, that there is probably no implementation available (AFAIK) - even on this basic level, according to this guideline.

Among other features on this level, the section
http://unicode.org/reports/tr18/#Supplementary_Characters
seems useful, especially the handling of the characters beyond \uffff, also in the form of surrogate pairs as single characters.

This might be useful on the narrow python builds, but it is possible, that there would be be an incompatibility with the handling of these data in "narrow" python itself.

Just some suggestions or rather remarks, as you already implemented many advanced features and are also considering some different approaches ...:-)

vbr

History
Date	User	Action	Args
2010-03-03 23:48:26	vbr	set	recipients: + vbr, loewis, akuchling, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, rsc, timehorse, mark, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray
2010-03-03 23:48:25	vbr	set	messageid: <1267660105.9.0.96148230136.issue2636@psf.upfronthosting.co.za>
2010-03-03 23:48:24	vbr	link	issue2636 messages
2010-03-03 23:48:23	vbr	create