This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tchrist
Recipients Arfrever, ezio.melotti, gvanrossum, loewis, tchrist, terry.reedy, vstinner
Date 2011-09-30.12:37:56
SpamBayes Score 1.3228307e-13
Marked as misclassified No
Message-id <26418.1317386261@chthon>
In-reply-to <4E859BAF.2050505@v.loewis.de>
Content
> Martin v. Löwis <martin@v.loewis.de> added the comment:

> "Split S into words. Change the first letter in a word to upper-case,

Except that I think you actually mean that the first "letter" is 
changed into titlecase not uppercase.  

One might also say *try* to change for all these, in that not
all cased code points in Unicode have casemaps that are different
from themselves.  For example, a superscript lowercase a or b has
no distinct uppercase mapping, the way the non-superscript versions do:

    % (echo xyz; echo ab AB | unisupers) | uc
    XYZ
    ᵃᵇ ᴬᴮ

> and all subsequent letters to lower case. A word is a sequence that
> starts with a letter, followed by letter-related characters."

I don't like the way you have defined letters and letter-related
characters.  The first already has a definition, which is not the
one you are using.  Word characters also has a definition in Unicode,
and it is not the one you are using.  I strongly advise against
redefining standard Unicode properties.  Choose other, unused terms 
if you must.  It is very confusing otherwise.

> Letters are all characters from the "Alphabetic" category, i.e.
> Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic.

Except that is exactly the definition of the Unicode Alphabetic property,
not the Unicode Letter property.  It is a mistake to equate
Letter=Alphabetic, and very confusing too.

I agree that this probably what you want, though.  I just don't think you
should use "letter-related characters" when there is an existing formal
definition that works, or that you should redefine Letter.

> "letter-related" characters are letters + marks (Mn, Mc, Me).

That isn't quite right.  

 * Letters are Lu+Ll+Lt+Lm+Lo.

 * Alphabetic is Letters + Other_Alphabetic.

 * Other_Alphabetic is certain marks (like the iota subscript) and the
   letter numbers (Nl), as well as a few symbols.

 * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.

I think you are looking for here are Word characters without 
Nd + Pc, so just Alphabetic + Mn+Mc+Me.  

Is that right?

--tom

PS: You can do union/intersection stuff with properties to see what
    the resulting sets look like using the unichars command-line tool.

    This is everything that is both alphabetic and also a mark:

    % unichars -gs '\p{Alphabetic}' '\pM'
    ‭ ○ͅ  U+0345 GC=Mn SC=Inherited    COMBINING GREEK YPOGEGRAMMENI
    ‭ ○ְ  U+05B0 GC=Mn SC=Hebrew       HEBREW POINT SHEVA
    ‭ ○ֱ  U+05B1 GC=Mn SC=Hebrew       HEBREW POINT HATAF SEGOL
    ‭ ○ֲ  U+05B2 GC=Mn SC=Hebrew       HEBREW POINT HATAF PATAH
    ‭ ○ֳ  U+05B3 GC=Mn SC=Hebrew       HEBREW POINT HATAF QAMATS
    ...
    ‭ ○ं  U+0902 GC=Mn SC=Devanagari   DEVANAGARI SIGN ANUSVARA
    ‭ ः  U+0903 GC=Mc SC=Devanagari   DEVANAGARI SIGN VISARGA
    ‭ ा  U+093E GC=Mc SC=Devanagari   DEVANAGARI VOWEL SIGN AA
    ‭ ि  U+093F GC=Mc SC=Devanagari   DEVANAGARI VOWEL SIGN I
    ‭ ी  U+0940 GC=Mc SC=Devanagari   DEVANAGARI VOWEL SIGN II
    ‭ ○ु  U+0941 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN U
    ‭ ○ू  U+0942 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN UU
    ‭ ○ृ  U+0943 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN VOCALIC R
    ‭ ○ॄ  U+0944 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN VOCALIC RR
    ...

    While these are the NON-alphabetic marks, which are still Word
    characters though of course:

    % unichars -gs '\P{Alphabetic}' '\pM'
    ‭ ○̀  U+0300 GC=Mn SC=Inherited    COMBINING GRAVE ACCENT
    ‭ ○́  U+0301 GC=Mn SC=Inherited    COMBINING ACUTE ACCENT
    ‭ ○̂  U+0302 GC=Mn SC=Inherited    COMBINING CIRCUMFLEX ACCENT
    ‭ ○̃  U+0303 GC=Mn SC=Inherited    COMBINING TILDE
    ‭ ○̄  U+0304 GC=Mn SC=Inherited    COMBINING MACRON
    ‭ ○̅  U+0305 GC=Mn SC=Inherited    COMBINING OVERLINE
    ‭ ○̆  U+0306 GC=Mn SC=Inherited    COMBINING BREVE
    ‭ ○̇  U+0307 GC=Mn SC=Inherited    COMBINING DOT ABOVE
    ‭ ○̈  U+0308 GC=Mn SC=Inherited    COMBINING DIAERESIS
    ‭ ○̉  U+0309 GC=Mn SC=Inherited    COMBINING HOOK ABOVE
    ‭ ○̊  U+030A GC=Mn SC=Inherited    COMBINING RING ABOVE
    ‭ ○̋  U+030B GC=Mn SC=Inherited    COMBINING DOUBLE ACUTE ACCENT
    ‭ ○̌  U+030C GC=Mn SC=Inherited    COMBINING CARON
    ...

    And here are the Cased code points that are do not change when 
    upper-, title-, or lowercased:

    % unichars -gs '\p{Cased}' '[^\p{CWU}\p{CWT}\p{CWL}]'
    ‭ ª  U+00AA GC=Ll SC=Latin        FEMININE ORDINAL INDICATOR
    ‭ º  U+00BA GC=Ll SC=Latin        MASCULINE ORDINAL INDICATOR
    ‭ ĸ  U+0138 GC=Ll SC=Latin        LATIN SMALL LETTER KRA
    ‭ ƍ  U+018D GC=Ll SC=Latin        LATIN SMALL LETTER TURNED DELTA
    ‭ ƛ  U+019B GC=Ll SC=Latin        LATIN SMALL LETTER LAMBDA WITH STROKE
    ‭ ƪ  U+01AA GC=Ll SC=Latin        LATIN LETTER REVERSED ESH LOOP
    ‭ ƫ  U+01AB GC=Ll SC=Latin        LATIN SMALL LETTER T WITH PALATAL HOOK
    ‭ ƺ  U+01BA GC=Ll SC=Latin        LATIN SMALL LETTER EZH WITH TAIL
    ‭ ƾ  U+01BE GC=Ll SC=Latin        LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE
    ‭ ȡ  U+0221 GC=Ll SC=Latin        LATIN SMALL LETTER D WITH CURL
    ‭ ȴ  U+0234 GC=Ll SC=Latin        LATIN SMALL LETTER L WITH CURL
    ‭ ȵ  U+0235 GC=Ll SC=Latin        LATIN SMALL LETTER N WITH CURL
    ‭ ȶ  U+0236 GC=Ll SC=Latin        LATIN SMALL LETTER T WITH CURL
    ‭ ȷ  U+0237 GC=Ll SC=Latin        LATIN SMALL LETTER DOTLESS J
    ‭ ȸ  U+0238 GC=Ll SC=Latin        LATIN SMALL LETTER DB DIGRAPH
    ‭ ȹ  U+0239 GC=Ll SC=Latin        LATIN SMALL LETTER QP DIGRAPH
    ‭ ɕ  U+0255 GC=Ll SC=Latin        LATIN SMALL LETTER C WITH CURL
    ‭ ɘ  U+0258 GC=Ll SC=Latin        LATIN SMALL LETTER REVERSED E
    ‭ ɚ  U+025A GC=Ll SC=Latin        LATIN SMALL LETTER SCHWA WITH HOOK
    ‭ ɜ  U+025C GC=Ll SC=Latin        LATIN SMALL LETTER REVERSED OPEN E
    ‭ ɝ  U+025D GC=Ll SC=Latin        LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
    ‭ ɞ  U+025E GC=Ll SC=Latin        LATIN SMALL LETTER CLOSED REVERSED OPEN E
    ‭ ɟ  U+025F GC=Ll SC=Latin        LATIN SMALL LETTER DOTLESS J WITH STROKE
    ‭ ɡ  U+0261 GC=Ll SC=Latin        LATIN SMALL LETTER SCRIPT G
    ‭ ɢ  U+0262 GC=Ll SC=Latin        LATIN LETTER SMALL CAPITAL G
    ‭ ɤ  U+0264 GC=Ll SC=Latin        LATIN SMALL LETTER RAMS HORN
    ‭ ɥ  U+0265 GC=Ll SC=Latin        LATIN SMALL LETTER TURNED H
    ‭ ɦ  U+0266 GC=Ll SC=Latin        LATIN SMALL LETTER H WITH HOOK
    ...

    You can get unichars from http://training.perl.com/scripts/unichars
    where you might also care to get uniprops and perhaps uninames to go
    with it.  There are other Unicode tools there (the directory is
    100% Unicode tools, not general scripts as its name suggests), but
    those are the important ones, I reckon.
History
Date User Action Args
2011-09-30 12:37:58tchristsetrecipients: + tchrist, gvanrossum, loewis, terry.reedy, vstinner, ezio.melotti, Arfrever
2011-09-30 12:37:57tchristlinkissue12737 messages
2011-09-30 12:37:56tchristcreate