Issue 7255: "Default" word boundaries for Unicode data?

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51504

classification

Title:	"Default" word boundaries for Unicode data?
Type:	enhancement	Stage:
Components:	Regular Expressions	Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	works for me
Dependencies:		Superseder:
Assigned To:		Nosy List:	RegEx4All, amaury.forgeotdarc, ezio.melotti, loewis, mrabarnett
Priority:	normal	Keywords:

Created on 2009-11-03 04:22 by RegEx4All, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg94856 - (view)	Author: daniel mccloy (RegEx4All)	Date: 2009-11-03 04:21
Regarding UTS #18 (Unicode Standards for RegEx Engines), which can be found at: http://www.unicode.org/reports/tr18/ Is there a plan or commitment for Python to implement at least "default word boundaries" (a Level 2 feature), rather than the current "simple word boundaries"? I don't believe that the algorithm for this is a whole lot more complicated, but it certainly makes a huge difference for processing non-Roman text. For example, to match the whole word રત without matching the word રતા (which has an additional vowel at the end, the vertical line), with "default word boundary" recognition, you could use the pattern \bરત\b. With Python's current "simple word boundary" recognition, however, the \b assertion is pretty much useless here, and I have yet to see a decent zero-width pattern that can take its place. BTW, the ICU regex libraries do provide this level of Unicode support: http://userguide.icu-project.org/strings/regexp It seems to work perfectly on Indic text, based on the tests I've done. Being open-source, it may be a helpful reference for the algorithm needed. Dan
msg94857 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-11-03 06:12
> Is there a plan or commitment for Python to implement at least "default > word boundaries" (a Level 2 feature), rather than the current "simple > word boundaries"? No such plan exists at this time. Contributions are welcome.
msg113928 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2010-08-14 20:27
These have been added to the new 'regex' module. See issue #2636 or PyPI at: http://pypi.python.org/pypi/regex
msg113979 - (view)	Author: daniel mccloy (RegEx4All)	Date: 2010-08-15 17:51
Woo-HOOO! Am very excited to hear this! Thanks, Matthew! This and also the related \w \W handling (#1693050) should be extremely useful for processing Indic text. I'm a python newbie, so will need to find some help on what I need to do to compile/install/use this source-file download, but if I can figure that out, I'd be very happy to test this against a texts in a variety of Indic scripts. Way to go!
msg113993 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2010-08-15 18:47
If you're on Windows (x86, 32-bit) then compilation isn't necessary - just use the appropriate _regex.pyd.
msg231607 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2014-11-24 16:16
Closing this old issue: either use the 'regex' module, or wait for issue2636.

History
Date	User	Action	Args
2022-04-11 14:56:54	admin	set	github: 51504
2014-11-24 16:16:58	amaury.forgeotdarc	set	status: open -> closed nosy: + amaury.forgeotdarc messages: + msg231607 resolution: works for me
2010-08-15 18:47:44	mrabarnett	set	messages: + msg113993
2010-08-15 17:51:07	RegEx4All	set	messages: + msg113979
2010-08-14 20:27:37	mrabarnett	set	nosy: + mrabarnett messages: + msg113928
2009-11-03 14:34:24	ezio.melotti	set	priority: normal nosy: + ezio.melotti versions: + Python 2.7
2009-11-03 06:12:46	loewis	set	nosy: + loewis messages: + msg94857
2009-11-03 04:22:02	RegEx4All	create