Message 144802 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

Author	tchrist
Recipients	ezio.melotti, gvanrossum, lemburg, loewis, mrabarnett, tchrist, terry.reedy
Date	2011-10-03.02:25:21
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<9072.1317605488@chthon>
In-reply-to	<4E88DE0D.2020800@udel.edu>

Content
>> Really? White space makes things harder to read? I thought Pythonistas >> believed the opposite of that. > I was surprised at that too ;-). One person's opinion in a specific > context. Don't generalize. The example I initially showed probably wasn't the best for that. Mostly I was trying to demonstrate how useful it is to have user-defined properties is all. But I have no asked for that (I have asked for properties, though). >> English titling rules >> only capitalize the first word in hyphenated words, which is why it's >> Anti‐intellectual not Anti-Intellectual. > Except that I can imagine someone using the latter as a noun to make the > work more officious or something. If Good-Looking looks more officous than Good-looking, I bet GOOD-LOOKING is better still. :) > There are no official English titling rules and as you noted, > publishers vary. If there aren't any rules, then how come all book and movie titles always look the same? :) I don't think anyone would argue with these two: 1. Capitalize the first word, the last word, and the word right after a colon (or semicolon). 2. Capitalize all intervening words except for articles (a, an, the) and short prepositions. Those are the basic rules. The main problem is that "short" isn't well defined--and indeed, there are even places where "preposition" isn't well defined either. English has sentence casing (only the first word) and headline casing (most of them). It's problematic that computer people call capitalizing each word titlecasing, since in English, this is never correct. http://www.chicagomanualofstyle.org/CMS_FAQ/CapitalizationTitles/CapitalizationTitles23.html Although Chicago style lowercases prepositions (but see CMOS 8.157 for exceptions), some style guides uppercase them. Ask your editor for a style guide. I myself usually fall back to the Chicago Manual of Style or the Oxford Guide to Style. I don't think I do anything neither of them says to do. But I completely agree that this should not be in the titlecase() function. I think the docs for the function might perhaps say something about how it does not mean correct English headline case when it says titlecase, but that's largely just nitpicking. > I agree that str.title should do something sensible > based on Unicode, with the improvements you mentioned. One of the goals of Unicode is that casing not be language dependent. And they almost got there, too. The Turkic I is the most notable exception. Did you know there is a problem with all the case stuff in Python? It was clearly put in before they had realized that they needed to have things other the Lu/Lt/Ll have casing properties. That's why there is a difference betwen GC=Ll and the Lowercase property. str.islower() Return true if all cased characters in the string are lowercase and there is at least one cased character, false otherwise. Cased characters are those with general category property being one of “Lu”, “Ll”, or “Lt” and lowercase characters are those with general category property “Ll”. http://docs.python.org/release/3.2/library/stdtypes.html That really isn't right. A cased character is one with the Unicode "Cased" property, and a lowercase character is one wiht the Unicode "Lowercase" property. The General Category is actually immaterial here. I've spent all bloody day trying to model Python's islower, isupper, and istitle functions, but I get all kinds of errors, both in the definitions and in the models of the definitions. Under both 2.7 and 3.2, I get all these bugs: ᶜ not islower() but has at least one cased character with all cased characters lowercase! ᴰ not islower() but has at least one cased character with all cased characters lowercase! ⓚ not islower() but has at least one cased character with all cased characters lowercase! ͅ not islower() but has at least one cased character with all cased characters lowercase! Ⅷ not isupper() but has at least one cased character with all cased characters uppercase! Ⅷ not istitle() but should be ⅷ not islower() but has at least one cased character with all cased characters lowercase! 2ⁿᵈ not islower() but has at least one cased character with all cased characters lowercase! 2ᴺᴰ not islower() but has at least one cased character with all cased characters lowercase! Ὰͅ isupper() but fails to have at least one cased character with all cased characters uppercase! ThisIsInTitleCaseYouKnow not istitle() but should be Mᶜ isupper() but fails to have at least one cased character with all cased characters uppercase! ᶜM isupper() but fails to have at least one cased character with all cased characters uppercase! ᶜM istitle() but should not be MᶜKINLEY isupper() but fails to have at least one cased character with all cased characters uppercase! I really don't understand. BTW, I feel that MᶜKinley is titlecase in that lowercase always follows uppercase and uppercase never follows itself. And Python agrees with me. But that same definition should vet ThisIsInTitleCaseYouKnow, but Python disagrees. I really don't understand any of these functions. I'm very sad. I think they are wrong, but maybe I am. It is extremely confusing. Shall I file a separate bug report? --tom from __future__ import unicode_literals from __future__ import print_function import regex VERBOSE = 0 data = [ # first test the problem cases just one at a time "\N{MODIFIER LETTER SMALL C}", "\N{SUPERSCRIPT LATIN SMALL LETTER N}", "\N{MODIFIER LETTER CAPITAL D}", "\N{CIRCLED LATIN SMALL LETTER K}", "\N{COMBINING GREEK YPOGEGRAMMENI}", "\N{ROMAN NUMERAL EIGHT}", "\N{SMALL ROMAN NUMERAL EIGHT}", "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}", "\N{LATIN LETTER SMALL CAPITAL R}", # test superscripts "2\N{SUPERSCRIPT LATIN SMALL LETTER N}\N{MODIFIER LETTER SMALL D}", "2\N{MODIFIER LETTER CAPITAL N}\N{MODIFIER LETTER CAPITAL D}", "2\N{FEMININE ORDINAL INDICATOR}", # as in "segunda" # test romans "ROMAN NUMERAL EIGHT IS \N{ROMAN NUMERAL EIGHT}", "roman numeral eight is \N{SMALL ROMAN NUMERAL EIGHT}", # test small caps "\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL A}\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL E}", # test cased combining mark (this is in titlecase) "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}", "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI} \N{GREEK CAPITAL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}", # test cased symbols "circle \N{CIRCLED LATIN SMALL LETTER K}", "CIRCLE \N{CIRCLED LATIN CAPITAL LETTER K}", # test titlecased code point 3-way "\N{LATIN CAPITAL LETTER DZ}", "\N{LATIN CAPITAL LETTER DZ}UR", "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}ur", "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}", "\N{LATIN SMALL LETTER DZ}ur", "\N{LATIN SMALL LETTER DZ}", # test titlecase "FBI", "F B I", "F.B.I", "HP Company", "H.P. Company", "ThisIsInTitleCaseYouKnow", "M\N{MODIFIER LETTER SMALL C}", "\N{MODIFIER LETTER SMALL C}M", "M\N{MODIFIER LETTER SMALL C}Kinley", # titlecase "M\N{MODIFIER LETTER SMALL C}KINLEY", # uppercase "m\N{MODIFIER LETTER SMALL C}kinley", # lowercase # Return true if the string is a titlecased string and there # is at least one character, for example uppercase characters may # only follow uncased characters and lowercase characters only # cased ones. Return false otherwise. # Return true if all cased characters in the string are lowercase and there is at least one cased character, ] for s in data: # "Return true if all cased characters in the string are lowercase # and there is at least one cased character" if s.islower(): if not ( regex.search(r'\p{cased}', s) and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)): print(s+" islower() but fails to have at least one cased character with all cased characters lowercase!") else: if ( regex.search(r'\p{cased}', s) and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)): print(s+" not islower() but has at least one cased character with all cased characters lowercase!") # "Return true if all cased characters in the string are uppercase # and there is at least one cased character" if s.isupper(): if not ( regex.search(r'\p{cased}', s) and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)): print(s+" isupper() but fails to have at least one cased character with all cased characters uppercase!") else: if ( regex.search(r'\p{cased}', s) and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)): print(s+" not isupper() but has at least one cased character with all cased characters uppercase!") # "Return true if the string is a titlecased string and there is at # least one character, for example uppercase characters may only # follow uncased characters and lowercase characters only cased ones." has_it = s.istitle() want_it1 = ( # at least one title/uppercase regex.search(r'[\p{Lt}\p{uppercase}]', s) and not # plus no title/uppercase follows cased character regex.search(r'(?<=\p{cased})[\p{Lt}\p{uppercase}]', s) and not # plus no lowercase follows uncased character regex.search(r'(?<=\P{CASED})\p{lowercase}', s) ) want_it = regex.search(r'''(?x) ^ (?: \P{CASED} * [\p{Lt}\p{uppercase}] (?! [\p{Lt}\p{uppercase}] ) \p{lowercase} * ) + \P{CASED} * $ ''', s) if VERBOSE: if has_it and want_it: print( s + " istitle() and should be (OK)") if not has_it and not want_it: print( s + " not istitle() and should not be (OK)") if has_it and not want_it: print( s + " istitle() but should not be") if want_it and not has_it: print( s + " not istitle() but should be")

>> Really?  White space makes things harder to read?  I thought Pythonistas
>> believed the opposite of that.

> I was surprised at that too ;-). One person's opinion in a specific 
> context. Don't generalize.

The example I initially showed probably wasn't the best for that.
Mostly I was trying to demonstrate how useful it is to have user-defined
properties is all.  But I have no asked for that (I have asked for properties,
though).

>> English titling rules
>> only capitalize the first word in hyphenated words, which is why it's
>> Anti‐intellectual not Anti-Intellectual.

> Except that I can imagine someone using the latter as a noun to make the 
> work more officious or something. 

If Good-Looking looks more officous than Good-looking, I bet GOOD-LOOKING
is better still. :)

> There are no official English titling rules and as you noted,
> publishers vary. 

If there aren't any rules, then how come all book and movie titles always
look the same?  :)  I don't think anyone would argue with these two:

 1. Capitalize the first word, the last word, and the word right after a
    colon (or semicolon).

 2. Capitalize all intervening words except for articles (a, an, the)
    and short prepositions.

Those are the basic rules.  The main problem is that "short" isn't
well defined--and indeed, there are even places where "preposition" 
isn't well defined either.  

English has sentence casing (only the first word) and headline casing (most of them).
It's problematic that computer people call capitalizing each word titlecasing,
since in English, this is never correct.

    http://www.chicagomanualofstyle.org/CMS_FAQ/CapitalizationTitles/CapitalizationTitles23.html

     Although Chicago style lowercases prepositions (but see CMOS 8.157
     for exceptions), some style guides uppercase them. Ask your editor
     for a style guide.

I myself usually fall back to the Chicago Manual of Style or the Oxford
Guide to Style.  I don't think I do anything neither of them says to do.

But I completely agree that this should *not* be in the titlecase()
function.  I think the docs for the function might perhaps say something
about how it does not mean correct English headline case when it says
titlecase, but that's largely just nitpicking.

> I agree that str.title should do something sensible
> based on Unicode, with the improvements you mentioned.

One of the goals of Unicode is that casing not be language dependent.  And
they almost got there, too.  The Turkic I is the most notable exception.

Did you know there is a problem with all the case stuff in Python?  It 
was clearly put in before they had realized that they needed to have
things other the Lu/Lt/Ll have casing properties.  That's why there is
a difference betwen GC=Ll and the Lowercase property.

    str.islower()

    Return true if all cased characters in the string are lowercase and
    there is at least one cased character, false otherwise. Cased
    characters are those with general category property being one of
    “Lu”, “Ll”, or “Lt” and lowercase characters are those with general
    category property “Ll”.

    http://docs.python.org/release/3.2/library/stdtypes.html

That really isn't right.  A cased character is one with the Unicode "Cased"
property, and a lowercase character is one wiht the Unicode "Lowercase"
property.  The General Category is actually immaterial here.

I've spent all bloody day trying to model Python's islower, isupper, and istitle
functions, but I get all kinds of errors, both in the definitions and in the
models of the definitions.    Under both 2.7 and 3.2, I get all these bugs:

    ᶜ not islower() but has at least one cased character with all cased characters lowercase!
    ᴰ not islower() but has at least one cased character with all cased characters lowercase!
    ⓚ not islower() but has at least one cased character with all cased characters lowercase!
    ͅ not islower() but has at least one cased character with all cased characters lowercase!
    Ⅷ not isupper() but has at least one cased character with all cased characters uppercase!
    Ⅷ not istitle() but should be
    ⅷ not islower() but has at least one cased character with all cased characters lowercase!
    2ⁿᵈ not islower() but has at least one cased character with all cased characters lowercase!
    2ᴺᴰ not islower() but has at least one cased character with all cased characters lowercase!
    Ὰͅ isupper() but fails to have at least one cased character with all cased characters uppercase!
    ThisIsInTitleCaseYouKnow not istitle() but should be
    Mᶜ isupper() but fails to have at least one cased character with all cased characters uppercase!
    ᶜM isupper() but fails to have at least one cased character with all cased characters uppercase!
    ᶜM istitle() but should not be
    MᶜKINLEY isupper() but fails to have at least one cased character with all cased characters uppercase!

I really don't understand.    BTW, I feel that MᶜKinley is titlecase in that lowercase
always follows uppercase and uppercase never follows itself.  And Python agrees with me.
But that same definition should vet ThisIsInTitleCaseYouKnow, but Python disagrees.

I really don't understand any of these functions.  I'm very sad.  I think they are
wrong, but maybe I am.  It is extremely confusing.

Shall I file a separate bug report?

--tom

from __future__ import unicode_literals
from __future__ import print_function

import regex

VERBOSE = 0 

data = [

  # first test the problem cases just one at a time
    "\N{MODIFIER LETTER SMALL C}",
    "\N{SUPERSCRIPT LATIN SMALL LETTER N}",
    "\N{MODIFIER LETTER CAPITAL D}", 
    "\N{CIRCLED LATIN SMALL LETTER K}",
    "\N{COMBINING GREEK YPOGEGRAMMENI}",
    "\N{ROMAN NUMERAL EIGHT}",
    "\N{SMALL ROMAN NUMERAL EIGHT}",
    "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}",
    "\N{LATIN LETTER SMALL CAPITAL R}",

  # test superscripts
    "2\N{SUPERSCRIPT LATIN SMALL LETTER N}\N{MODIFIER LETTER SMALL D}", 
    "2\N{MODIFIER LETTER CAPITAL N}\N{MODIFIER LETTER CAPITAL D}",
    "2\N{FEMININE ORDINAL INDICATOR}", # as in "segunda"

  # test romans
    "ROMAN NUMERAL EIGHT IS \N{ROMAN NUMERAL EIGHT}",
    "roman numeral eight is \N{SMALL ROMAN NUMERAL EIGHT}",

  # test small caps
    "\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL A}\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL E}",

  # test cased combining mark (this is in titlecase)
    "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}",
    "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI} \N{GREEK CAPITAL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}",

  # test cased symbols
    "circle  \N{CIRCLED LATIN SMALL LETTER K}",
    "CIRCLE  \N{CIRCLED LATIN CAPITAL LETTER K}",

  # test titlecased code point 3-way
    "\N{LATIN CAPITAL LETTER DZ}",
    "\N{LATIN CAPITAL LETTER DZ}UR",
    "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}ur",
    "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}",
    "\N{LATIN SMALL LETTER DZ}ur",
    "\N{LATIN SMALL LETTER DZ}",

  # test titlecase

    "FBI", "F B I", "F.B.I",
    "HP Company", "H.P. Company",
    "ThisIsInTitleCaseYouKnow",

    "M\N{MODIFIER LETTER SMALL C}",
    "\N{MODIFIER LETTER SMALL C}M",

    "M\N{MODIFIER LETTER SMALL C}Kinley",  # titlecase
    "M\N{MODIFIER LETTER SMALL C}KINLEY",  # uppercase
    "m\N{MODIFIER LETTER SMALL C}kinley",  # lowercase

    # Return true if the string is a titlecased string and there
    # is at least one character, for example uppercase characters may
    # only follow uncased characters and lowercase characters only
    # cased ones. Return false otherwise.

    # Return true if all cased characters in the string are lowercase and there is at least one cased character,
]

for s in data:

  # "Return true if all cased characters in the string are lowercase 
  #  and there is at least one cased character"

    if s.islower():
        if not (        regex.search(r'\p{cased}', s) 
                and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)):
            print(s+" islower() but fails to have at least one cased character with all cased characters lowercase!")
    else:
        if (        regex.search(r'\p{cased}', s) 
            and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)):
            print(s+" not islower() but has at least one cased character with all cased characters lowercase!")

  # "Return true if all cased characters in the string are uppercase 
  #  and there is at least one cased character"

    if s.isupper():
        if not (        regex.search(r'\p{cased}', s) 
                and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)):
            print(s+" isupper() but fails to have at least one cased character with all cased characters uppercase!")
    else:
        if (        regex.search(r'\p{cased}', s) 
            and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)):
            print(s+" not isupper() but has at least one cased character with all cased characters uppercase!")

  # "Return true if the string is a titlecased string and there is at
  # least one character, for example uppercase characters may only
  # follow uncased characters and lowercase characters only cased ones."

    has_it  = s.istitle()
    want_it1 = (  
          # at least one title/uppercase
                regex.search(r'[\p{Lt}\p{uppercase}]', s) 
                  and not 
          # plus no title/uppercase follows cased character
               regex.search(r'(?<=\p{cased})[\p{Lt}\p{uppercase}]', s)
                  and not 
          # plus no lowercase follows uncased character
               regex.search(r'(?<=\P{CASED})\p{lowercase}', s)
              )

    want_it  = regex.search(r'''(?x) 
        ^ 
            (?:
                \P{CASED} * 
                [\p{Lt}\p{uppercase}] 
                (?! [\p{Lt}\p{uppercase}] )
                    \p{lowercase} *
            ) +
            \P{CASED} * 
        $
    ''', s)

    if VERBOSE:
        if has_it and want_it:
            print( s + " istitle() and should be (OK)")
        if not has_it and not want_it:
            print( s + " not istitle() and should not be (OK)")

    if has_it and not want_it:
        print( s + " istitle() but should not be")

    if want_it and not has_it:
        print( s + " not istitle() but should be")

History
Date	User	Action	Args
2011-10-03 02:25:23	tchrist	set	recipients: + tchrist, lemburg, gvanrossum, loewis, terry.reedy, ezio.melotti, mrabarnett
2011-10-03 02:25:23	tchrist	link	issue12753 messages
2011-10-03 02:25:21	tchrist	create