Message144688
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> "Split S into words. Change the first letter in a word to upper-case,
Except that I think you actually mean that the first "letter" is
changed into titlecase not uppercase.
One might also say *try* to change for all these, in that not
all cased code points in Unicode have casemaps that are different
from themselves. For example, a superscript lowercase a or b has
no distinct uppercase mapping, the way the non-superscript versions do:
% (echo xyz; echo ab AB | unisupers) | uc
XYZ
ᵃᵇ ᴬᴮ
> and all subsequent letters to lower case. A word is a sequence that
> starts with a letter, followed by letter-related characters."
I don't like the way you have defined letters and letter-related
characters. The first already has a definition, which is not the
one you are using. Word characters also has a definition in Unicode,
and it is not the one you are using. I strongly advise against
redefining standard Unicode properties. Choose other, unused terms
if you must. It is very confusing otherwise.
> Letters are all characters from the "Alphabetic" category, i.e.
> Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic.
Except that is exactly the definition of the Unicode Alphabetic property,
not the Unicode Letter property. It is a mistake to equate
Letter=Alphabetic, and very confusing too.
I agree that this probably what you want, though. I just don't think you
should use "letter-related characters" when there is an existing formal
definition that works, or that you should redefine Letter.
> "letter-related" characters are letters + marks (Mn, Mc, Me).
That isn't quite right.
* Letters are Lu+Ll+Lt+Lm+Lo.
* Alphabetic is Letters + Other_Alphabetic.
* Other_Alphabetic is certain marks (like the iota subscript) and the
letter numbers (Nl), as well as a few symbols.
* Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.
I think you are looking for here are Word characters without
Nd + Pc, so just Alphabetic + Mn+Mc+Me.
Is that right?
--tom
PS: You can do union/intersection stuff with properties to see what
the resulting sets look like using the unichars command-line tool.
This is everything that is both alphabetic and also a mark:
% unichars -gs '\p{Alphabetic}' '\pM'
○ͅ U+0345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI
○ְ U+05B0 GC=Mn SC=Hebrew HEBREW POINT SHEVA
○ֱ U+05B1 GC=Mn SC=Hebrew HEBREW POINT HATAF SEGOL
○ֲ U+05B2 GC=Mn SC=Hebrew HEBREW POINT HATAF PATAH
○ֳ U+05B3 GC=Mn SC=Hebrew HEBREW POINT HATAF QAMATS
...
○ं U+0902 GC=Mn SC=Devanagari DEVANAGARI SIGN ANUSVARA
ः U+0903 GC=Mc SC=Devanagari DEVANAGARI SIGN VISARGA
ा U+093E GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN AA
ि U+093F GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN I
ी U+0940 GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN II
○ु U+0941 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN U
○ू U+0942 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN UU
○ृ U+0943 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC R
○ॄ U+0944 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC RR
...
While these are the NON-alphabetic marks, which are still Word
characters though of course:
% unichars -gs '\P{Alphabetic}' '\pM'
○̀ U+0300 GC=Mn SC=Inherited COMBINING GRAVE ACCENT
○́ U+0301 GC=Mn SC=Inherited COMBINING ACUTE ACCENT
○̂ U+0302 GC=Mn SC=Inherited COMBINING CIRCUMFLEX ACCENT
○̃ U+0303 GC=Mn SC=Inherited COMBINING TILDE
○̄ U+0304 GC=Mn SC=Inherited COMBINING MACRON
○̅ U+0305 GC=Mn SC=Inherited COMBINING OVERLINE
○̆ U+0306 GC=Mn SC=Inherited COMBINING BREVE
○̇ U+0307 GC=Mn SC=Inherited COMBINING DOT ABOVE
○̈ U+0308 GC=Mn SC=Inherited COMBINING DIAERESIS
○̉ U+0309 GC=Mn SC=Inherited COMBINING HOOK ABOVE
○̊ U+030A GC=Mn SC=Inherited COMBINING RING ABOVE
○̋ U+030B GC=Mn SC=Inherited COMBINING DOUBLE ACUTE ACCENT
○̌ U+030C GC=Mn SC=Inherited COMBINING CARON
...
And here are the Cased code points that are do not change when
upper-, title-, or lowercased:
% unichars -gs '\p{Cased}' '[^\p{CWU}\p{CWT}\p{CWL}]'
ª U+00AA GC=Ll SC=Latin FEMININE ORDINAL INDICATOR
º U+00BA GC=Ll SC=Latin MASCULINE ORDINAL INDICATOR
ĸ U+0138 GC=Ll SC=Latin LATIN SMALL LETTER KRA
ƍ U+018D GC=Ll SC=Latin LATIN SMALL LETTER TURNED DELTA
ƛ U+019B GC=Ll SC=Latin LATIN SMALL LETTER LAMBDA WITH STROKE
ƪ U+01AA GC=Ll SC=Latin LATIN LETTER REVERSED ESH LOOP
ƫ U+01AB GC=Ll SC=Latin LATIN SMALL LETTER T WITH PALATAL HOOK
ƺ U+01BA GC=Ll SC=Latin LATIN SMALL LETTER EZH WITH TAIL
ƾ U+01BE GC=Ll SC=Latin LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE
ȡ U+0221 GC=Ll SC=Latin LATIN SMALL LETTER D WITH CURL
ȴ U+0234 GC=Ll SC=Latin LATIN SMALL LETTER L WITH CURL
ȵ U+0235 GC=Ll SC=Latin LATIN SMALL LETTER N WITH CURL
ȶ U+0236 GC=Ll SC=Latin LATIN SMALL LETTER T WITH CURL
ȷ U+0237 GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J
ȸ U+0238 GC=Ll SC=Latin LATIN SMALL LETTER DB DIGRAPH
ȹ U+0239 GC=Ll SC=Latin LATIN SMALL LETTER QP DIGRAPH
ɕ U+0255 GC=Ll SC=Latin LATIN SMALL LETTER C WITH CURL
ɘ U+0258 GC=Ll SC=Latin LATIN SMALL LETTER REVERSED E
ɚ U+025A GC=Ll SC=Latin LATIN SMALL LETTER SCHWA WITH HOOK
ɜ U+025C GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E
ɝ U+025D GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
ɞ U+025E GC=Ll SC=Latin LATIN SMALL LETTER CLOSED REVERSED OPEN E
ɟ U+025F GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J WITH STROKE
ɡ U+0261 GC=Ll SC=Latin LATIN SMALL LETTER SCRIPT G
ɢ U+0262 GC=Ll SC=Latin LATIN LETTER SMALL CAPITAL G
ɤ U+0264 GC=Ll SC=Latin LATIN SMALL LETTER RAMS HORN
ɥ U+0265 GC=Ll SC=Latin LATIN SMALL LETTER TURNED H
ɦ U+0266 GC=Ll SC=Latin LATIN SMALL LETTER H WITH HOOK
...
You can get unichars from http://training.perl.com/scripts/unichars
where you might also care to get uniprops and perhaps uninames to go
with it. There are other Unicode tools there (the directory is
100% Unicode tools, not general scripts as its name suggests), but
those are the important ones, I reckon. |
|
Date |
User |
Action |
Args |
2011-09-30 12:37:58 | tchrist | set | recipients:
+ tchrist, gvanrossum, loewis, terry.reedy, vstinner, ezio.melotti, Arfrever |
2011-09-30 12:37:57 | tchrist | link | issue12737 messages |
2011-09-30 12:37:56 | tchrist | create | |
|