Issue 32198: \b reports false-positives in Indic strings involving combining marks

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/76379

classification

Title:	\b reports false-positives in Indic strings involving combining marks
Type:	behavior	Stage:
Components:	Regular Expressions	Versions:	Python 3.7, Python 3.6, Python 2.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	ezio.melotti, jamadagni, mrabarnett, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2017-12-02 13:28 by jamadagni, last changed 2022-04-11 14:58 by admin.

Messages (2)
msg307430 - (view)	Author: Shriramana Sharma (jamadagni)	Date: 2017-12-02 13:28
Code: import re cons_taml = "[கஙசஞடணதநபமயரலவழளறன]" print(re.findall("\\b" + cons_taml + "ை\|ஐ", "ஐவர் பையன் இசை சிவிகை இல்லை இவ்ஐ")) cons_deva = "[कखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह]" print(re.findall("\\b" + cons_deva + "ै\|ऐ", "ऐषमः तैलम् ईडै समीशै ईक्षै ईक्ऐ")) Specs: Kubuntu Xenial 64 bit Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] on linux Actual Output: ['ஐ', 'பை', 'கை', 'லை', 'ஐ'] ['ऐ', 'तै', 'शै', 'षै', 'ऐ'] Expected Output: ['ஐ', 'பை'] ['ऐ', 'तै'] Rationale: The formulated RE desires to identify words starting with the vowel /ai/ (\u0BC8 ை in Tamil script and \u0948 ै in Devanagari as vowel sign or \u0B90 ஐ \u0910 ऐ as independent vowel). ஐவர் பையன் and ऐषमः तैलम् are the only words fitting this criterion. \b is defined to mark a word boundary and is here applied at the beginning of the RE. Observation: There seems to be some assumption that only GC=Lo characters constitute words. Hence the false positives at ச ி வ ி (க ை) and स म ी (श ै) where the ி and ी are vowel signs, and இ ல ் (ல ை) and ई क ् (ष ै) where the ் and ् are virama characters or vowel cancelling signs. In Indic, such GC=Mc and GC=Mn characters are inalienable parts of words. They should be properly identified as parts of words and no word boundary answering to \b should be generated at their positions.
msg307446 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-12-02 18:28
This is a known issue. See also issue1693050, issue12731, issue25743. I hope it will be solved in 3.7 and maybe the solution will be backported to 2.7 and 3.6 (but not to 3.5, 3.5 takes only security fixes). As a workaround I suggest you to use the third-party regex module. This is a mature module mostly compatible with re, but with better support of Unicode and additional features.

History
Date	User	Action	Args
2022-04-11 14:58:55	admin	set	github: 76379
2017-12-02 18:28:37	serhiy.storchaka	set	versions: + Python 2.7, Python 3.6, Python 3.7, - Python 3.5 nosy: + serhiy.storchaka messages: + msg307446 assignee: serhiy.storchaka
2017-12-02 13:28:09	jamadagni	create