This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author jamadagni
Recipients ezio.melotti, jamadagni, mrabarnett
Date 2017-12-02.13:28:09
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1512221289.58.0.213398074469.issue32198@psf.upfronthosting.co.za>
In-reply-to
Content
Code:

import re
cons_taml = "[கஙசஞடணதநபமயரலவழளறன]"
print(re.findall("\\b" + cons_taml + "ை|ஐ", "ஐவர் பையன் இசை சிவிகை இல்லை இவ்ஐ"))
cons_deva = "[कखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह]"
print(re.findall("\\b" + cons_deva + "ै|ऐ", "ऐषमः तैलम् ईडै समीशै ईक्षै ईक्ऐ"))

Specs:
Kubuntu Xenial 64 bit
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux

Actual Output:
['ஐ', 'பை', 'கை', 'லை', 'ஐ']
['ऐ', 'तै', 'शै', 'षै', 'ऐ']

Expected Output:
['ஐ', 'பை']
['ऐ', 'तै']

Rationale:

The formulated RE desires to identify words *starting* with the vowel /ai/ (\u0BC8 ை in Tamil script and \u0948 ै in Devanagari as vowel sign or \u0B90 ஐ \u0910 ऐ as independent vowel). ஐவர் பையன் and ऐषमः तैलम् are the only words fitting this criterion. \b is defined to mark a word boundary and is here applied at the beginning of the RE.

Observation:

There seems to be some assumption that only GC=Lo characters constitute words. Hence the false positives at ச ி வ ி (க ை) and स म ी (श ै) where the ி and ी are vowel signs, and இ ல ் (ல ை) and ई क ् (ष ै) where the ் and ् are virama characters or vowel cancelling signs.

In Indic, such GC=Mc and GC=Mn characters are inalienable parts of words. They should be properly identified as parts of words and no word boundary answering to \b should be generated at their positions.
History
Date User Action Args
2017-12-02 13:28:09jamadagnisetrecipients: + jamadagni, ezio.melotti, mrabarnett
2017-12-02 13:28:09jamadagnisetmessageid: <1512221289.58.0.213398074469.issue32198@psf.upfronthosting.co.za>
2017-12-02 13:28:09jamadagnilinkissue32198 messages
2017-12-02 13:28:09jamadagnicreate