classification
Title: \w not helpful for non-Roman scripts
Type: Stage:
Components: Regular Expressions Versions: Python 3.1, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, haypo, l0nwlf, lemburg, loewis, mrabarnett, nathanlmiles, rsc, terry.reedy, timehorse
Priority: normal Keywords:

Created on 2007-04-02 15:27 by nathanlmiles, last changed 2010-03-31 01:29 by l0nwlf.

Messages (5)
msg31688 - (view) Author: nlmiles (nathanlmiles) Date: 2007-04-02 15:27
When I try to use r"\w+(?u)" to find words in a unicode Devanagari text bad things happen. Words get chopped into small pieces. I think this is likely because vowel signs such as 093e are not considered to match \w.

I think that if you wish \w to be useful for Indic
scipts \w will need to be exanded to unclude unicode character categories Mc, Mn, Me.

I am using Python 2.4.4 on Windows XP SP2.

I ran the following script to see the characters which I think ought to match \w but don't

import re
import unicodedata

text = ""
for i in range(0x901,0x939): text += unichr(i)
for i in range(0x93c,0x93d): text += unichr(i)
for i in range(0x93e,0x94d): text += unichr(i)
for i in range(0x950,0x954): text += unichr(i)
for i in range(0x958,0x963): text += unichr(i)
        
parts = re.findall("\W(?u)", text)
for ch in parts:
    print "%04x" % ord(ch), unicodedata.category(ch)

The odd character here is 0904. Its categorization seems to imply that you are using the uncode 3.0 database but perhaps later versions of Python are using the current 5.0 database.
msg31689 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2007-04-02 15:38
Python 2.4 is using Unicode 3.2. Python 2.5 ships with Unicode 4.1.

We're likely to ship Unicode 5.x with Python 2.6 or a later release.

Regarding the char classes: I don't think Mc, Mn and Me should be considered parts of a word. Those are marks which usually separate words.
msg76556 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2008-11-28 21:14
Vowel 'marks' are condensed vowel characters and are very much part of
words and do not separate words.  Python3 properly includes Mn and Mc as
identifier characters.

http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords

For instance, the word 'hindi' has 3 consonants 'h', 'n', 'd', 2 vowels
'i' and 'ii' (long i) following 'h' and 'd', and a null vowel (virama)
after 'n'. [The null vowel is needed because no vowel mark indicates the
default vowel short a.  So without it, the word would be hinadii.]
The difference between the devanagari vowel characters, used at the
beginning of words, and the vowel marks, used thereafter, is purely
graphical and not phonological.  In short, in the sanskrit family,
word = syllable+
syllable = vowel | consonant + vowel mark

From a clp post asking why re does not see hindi as a word:

हिन्दी
     ह DEVANAGARI LETTER HA (Lo)
     ि DEVANAGARI VOWEL SIGN I (Mc)
     न DEVANAGARI LETTER NA (Lo)
     ् DEVANAGARI SIGN VIRAMA (Mn)
     द DEVANAGARI LETTER DA (Lo)
     ी DEVANAGARI VOWEL SIGN II (Mc)

.isapha and possibly other unicode methods need fixing also
>>> 'हिन्दी'.isalpha()#2.x and 3.0
False
msg76557 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-11-28 21:33
Unicode TR#18 defines \w as a shorthand for

\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}

which would include all marks. We should recursively check whether we
follow the recommendation (e.g. \p{alpha} refers to all character having
the Alphabetic derived core property, which is Lu+Ll+Lt+Lm+Lo+Nl +
Other_Alphabetic, where Other_Alphabetic is a selected list of
additional character - all from Mn/Mc)
msg81221 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-05 19:51
In issue #2636 I'm using the following:

Alpha is Ll, Lo, Lt, Lu.
Digit is Nd.
Word is Ll, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc.

These are what are specified at
http://www.regular-expressions.info/posixbrackets.html
History
Date User Action Args
2010-03-31 01:29:17l0nwlfsetnosy: + l0nwlf
2010-03-05 15:37:50hayposetnosy: + haypo
2009-05-12 14:41:55ezio.melottisetnosy: + ezio.melotti
2009-02-05 19:51:20mrabarnettsetnosy: + mrabarnett
messages: + msg81221
2008-11-28 21:33:40loewissetnosy: + loewis
messages: + msg76557
2008-11-28 21:14:55terry.reedysetnosy: + terry.reedy
messages: + msg76556
versions: + Python 3.1
2008-09-28 19:20:16timehorsesetnosy: + timehorse
versions: + Python 2.7, - Python 2.4
2008-04-24 21:07:01rscsetnosy: + rsc
2007-04-02 15:27:11nathanlmilescreate