Message 307430 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jamadagni
Recipients	ezio.melotti, jamadagni, mrabarnett
Date	2017-12-02.13:28:09
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1512221289.58.0.213398074469.issue32198@psf.upfronthosting.co.za>
In-reply-to

Content
Code: import re cons_taml = "[கஙசஞடணதநபமயரலவழளறன]" print(re.findall("\\b" + cons_taml + "ை\|ஐ", "ஐவர் பையன் இசை சிவிகை இல்லை இவ்ஐ")) cons_deva = "[कखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह]" print(re.findall("\\b" + cons_deva + "ै\|ऐ", "ऐषमः तैलम् ईडै समीशै ईक्षै ईक्ऐ")) Specs: Kubuntu Xenial 64 bit Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] on linux Actual Output: ['ஐ', 'பை', 'கை', 'லை', 'ஐ'] ['ऐ', 'तै', 'शै', 'षै', 'ऐ'] Expected Output: ['ஐ', 'பை'] ['ऐ', 'तै'] Rationale: The formulated RE desires to identify words starting with the vowel /ai/ (\u0BC8 ை in Tamil script and \u0948 ै in Devanagari as vowel sign or \u0B90 ஐ \u0910 ऐ as independent vowel). ஐவர் பையன் and ऐषमः तैलम् are the only words fitting this criterion. \b is defined to mark a word boundary and is here applied at the beginning of the RE. Observation: There seems to be some assumption that only GC=Lo characters constitute words. Hence the false positives at ச ி வ ி (க ை) and स म ी (श ै) where the ி and ी are vowel signs, and இ ல ் (ல ை) and ई क ् (ष ै) where the ் and ् are virama characters or vowel cancelling signs. In Indic, such GC=Mc and GC=Mn characters are inalienable parts of words. They should be properly identified as parts of words and no word boundary answering to \b should be generated at their positions.

Code:

import re
cons_taml = "[கஙசஞடணதநபமயரலவழளறன]"
print(re.findall("\\b" + cons_taml + "ை|ஐ", "ஐவர் பையன் இசை சிவிகை இல்லை இவ்ஐ"))
cons_deva = "[कखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह]"
print(re.findall("\\b" + cons_deva + "ै|ऐ", "ऐषमः तैलम् ईडै समीशै ईक्षै ईक्ऐ"))

Specs:
Kubuntu Xenial 64 bit
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux

Actual Output:
['ஐ', 'பை', 'கை', 'லை', 'ஐ']
['ऐ', 'तै', 'शै', 'षै', 'ऐ']

Expected Output:
['ஐ', 'பை']
['ऐ', 'तै']

Rationale:

The formulated RE desires to identify words *starting* with the vowel /ai/ (\u0BC8 ை in Tamil script and \u0948 ै in Devanagari as vowel sign or \u0B90 ஐ \u0910 ऐ as independent vowel). ஐவர் பையன் and ऐषमः तैलम् are the only words fitting this criterion. \b is defined to mark a word boundary and is here applied at the beginning of the RE.

Observation:

There seems to be some assumption that only GC=Lo characters constitute words. Hence the false positives at ச ி வ ி (க ை) and स म ी (श ै) where the ி and ी are vowel signs, and இ ல ் (ல ை) and ई क ् (ष ै) where the ் and ् are virama characters or vowel cancelling signs.

In Indic, such GC=Mc and GC=Mn characters are inalienable parts of words. They should be properly identified as parts of words and no word boundary answering to \b should be generated at their positions.

History
Date	User	Action	Args
2017-12-02 13:28:09	jamadagni	set	recipients: + jamadagni, ezio.melotti, mrabarnett
2017-12-02 13:28:09	jamadagni	set	messageid: <1512221289.58.0.213398074469.issue32198@psf.upfronthosting.co.za>
2017-12-02 13:28:09	jamadagni	link	issue32198 messages
2017-12-02 13:28:09	jamadagni	create