Message 31688 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	nathanlmiles
Recipients
Date	2007-04-02.15:27:11
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
When I try to use r"\w+(?u)" to find words in a unicode Devanagari text bad things happen. Words get chopped into small pieces. I think this is likely because vowel signs such as 093e are not considered to match \w. I think that if you wish \w to be useful for Indic scipts \w will need to be exanded to unclude unicode character categories Mc, Mn, Me. I am using Python 2.4.4 on Windows XP SP2. I ran the following script to see the characters which I think ought to match \w but don't import re import unicodedata text = "" for i in range(0x901,0x939): text += unichr(i) for i in range(0x93c,0x93d): text += unichr(i) for i in range(0x93e,0x94d): text += unichr(i) for i in range(0x950,0x954): text += unichr(i) for i in range(0x958,0x963): text += unichr(i) parts = re.findall("\W(?u)", text) for ch in parts: print "%04x" % ord(ch), unicodedata.category(ch) The odd character here is 0904. Its categorization seems to imply that you are using the uncode 3.0 database but perhaps later versions of Python are using the current 5.0 database.

When I try to use r"\w+(?u)" to find words in a unicode Devanagari text bad things happen. Words get chopped into small pieces. I think this is likely because vowel signs such as 093e are not considered to match \w.

I think that if you wish \w to be useful for Indic
scipts \w will need to be exanded to unclude unicode character categories Mc, Mn, Me.

I am using Python 2.4.4 on Windows XP SP2.

I ran the following script to see the characters which I think ought to match \w but don't

import re
import unicodedata

text = ""
for i in range(0x901,0x939): text += unichr(i)
for i in range(0x93c,0x93d): text += unichr(i)
for i in range(0x93e,0x94d): text += unichr(i)
for i in range(0x950,0x954): text += unichr(i)
for i in range(0x958,0x963): text += unichr(i)
        
parts = re.findall("\W(?u)", text)
for ch in parts:
    print "%04x" % ord(ch), unicodedata.category(ch)

The odd character here is 0904. Its categorization seems to imply that you are using the uncode 3.0 database but perhaps later versions of Python are using the current 5.0 database.

History
Date	User	Action	Args
2007-08-23 14:52:54	admin	link	issue1693050 messages
2007-08-23 14:52:54	admin	create