Message 94856 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	RegEx4All
Recipients	RegEx4All
Date	2009-11-03.04:21:59
SpamBayes Score	7.3236506e-09
Marked as misclassified	No
Message-id	<1257222123.85.0.715983159645.issue7255@psf.upfronthosting.co.za>
In-reply-to

Content
Regarding UTS #18 (Unicode Standards for RegEx Engines), which can be found at: http://www.unicode.org/reports/tr18/ Is there a plan or commitment for Python to implement at least "default word boundaries" (a Level 2 feature), rather than the current "simple word boundaries"? I don't believe that the algorithm for this is a whole lot more complicated, but it certainly makes a huge difference for processing non-Roman text. For example, to match the whole word રત without matching the word રતા (which has an additional vowel at the end, the vertical line), with "default word boundary" recognition, you could use the pattern \bરત\b. With Python's current "simple word boundary" recognition, however, the \b assertion is pretty much useless here, and I have yet to see a decent zero-width pattern that can take its place. BTW, the ICU regex libraries do provide this level of Unicode support: http://userguide.icu-project.org/strings/regexp It seems to work perfectly on Indic text, based on the tests I've done. Being open-source, it may be a helpful reference for the algorithm needed. Dan

Regarding UTS #18 (Unicode Standards for RegEx Engines), which can be
found at:
http://www.unicode.org/reports/tr18/

Is there a plan or commitment for Python to implement at least "default
word boundaries" (a Level 2 feature), rather than the current "simple
word boundaries"?  I don't believe that the algorithm for this is a
whole lot more complicated, but it certainly makes a huge difference for
processing non-Roman text.

For example, to match the whole word રત without matching the word રતા
(which has an additional vowel at the end, the vertical line), with
"default word boundary" recognition, you could use the pattern \bરત\b. 
With Python's current "simple word boundary" recognition, however, the
\b assertion is pretty much useless here, and I have yet to see a decent
zero-width pattern that can take its place.

BTW, the ICU regex libraries do provide this level of Unicode support:
http://userguide.icu-project.org/strings/regexp
It seems to work perfectly on Indic text, based on the tests I've done.

Being open-source, it may be a helpful reference for the algorithm needed.

Dan

History
Date	User	Action	Args
2009-11-03 04:22:04	RegEx4All	set	recipients: + RegEx4All
2009-11-03 04:22:03	RegEx4All	set	messageid: <1257222123.85.0.715983159645.issue7255@psf.upfronthosting.co.za>
2009-11-03 04:22:01	RegEx4All	link	issue7255 messages
2009-11-03 04:22:00	RegEx4All	create