Author tchrist
Recipients ezio.melotti, gvanrossum, lemburg, loewis, mrabarnett, tchrist, terry.reedy
Date 2011-10-02.05:33:36
SpamBayes Score 0.0
Marked as misclassified No
Message-id <6829.1317533598@chthon>
In-reply-to <4E872F1E.6050604@v.loewis.de>
Content
>> Perl does not provide the old 1.0 names at all.  We don't have a Unicode
>> 1.0 legacy to support, which makes this cleaner.  However, we do provide
>> for the names of the C0 and C1 Control Codes, because apart from Unicode
>> 1.0, they don't condescend to name the ASCII or Latin1 control codes. =20

> If there would be a reasonably official source for these names, and one
> that guarantees that there is no collision with UCD names, I could
> accept doing so for Python as well.

The C0 and C1 control code names don't change.  There is/was one stability
issue where they screwed up, because they ended up having a UAX (required)
and a UTS (not required) fighting because of the dumb stuff they did with
the Emoji names. They neglected to prefix them with "Emoji ..." or some
such, the way things like "GREEK ... LETTER ..." or "MATHEMATICAL ..." or
"MUSICAL ..." did.  The problem is they stole BELL without calling it EMOJI
BELL.  This is C0 name for Control-G.  Dimwits.

The problem with official names is that they have things in them that you
are not expected in names.  Do you really and truly mean to tell me you
think it is somehow **good** that people are forced to write

    \N{LINE FEED (LF)}

Rather than the more obvious pair of 

    \N{LINE FEED}
    \N{LF}

??

If so, then I don't understand that.  Nobody in their right 
mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED}"'
    U+000A
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LF}"'
    U+000A
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED (LF)}"'
    U+000A

    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE}"'
    U+0085
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEL}"'
    U+0085
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE (NEL)}"'
    U+0085

>> We also provide for certain well known aliases from the Names file:
>> anything that says "* commonly abbreviated as ...", so things like LRO
>> and ZWJ and such.

> -1. Readability counts, writability not so much (I know this is
> different for Perl :-). 

I actually very strongly resent and rebuff that entire mindset in the most
extreme way possible.  Well-written Perl code is perfectly readable by
people who speak that langauge.  If you find Perl code that isn't readable,
it is by definition not well-written.

*PLEASE* don't start.  

Yes, I just got done driving 16 hours and am overtired, but it's 
something I've been fighting against all of professional career.
It's a "leyenda negra".

> If there is too much aliasing, people will
> wonder what these codes actually mean.

There are 15 "commonly abbreviated as" aliases in the Names.txt file.

    * commonly abbreviated as NBSP
    * commonly abbreviated as SHY
    * commonly abbreviated as CGJ
    * commonly abbreviated ZWSP
    * commonly abbreviated ZWNJ
    * commonly abbreviated ZWJ
    * commonly abbreviated LRM
    * commonly abbreviated RLM
    * commonly abbreviated LRE
    * commonly abbreviated RLE
    * commonly abbreviated PDF
    * commonly abbreviated LRO
    * commonly abbreviated RLO
    * commonly abbreviated NNBSP
    * commonly abbreviated WJ

All of the standards documents *talk* about things like LRO and ZWNJ.
I guess the standards aren't "readable" then, right? :)

From the charnames manpage, which shows that we really don't just make
these up as we feel like (although we could; see below).  They're all from
this or that standard:

    ALIASES
       A few aliases have been defined for convenience: instead
       of having to use the official names

           LINE FEED (LF)
           FORM FEED (FF)
           CARRIAGE RETURN (CR)
           NEXT LINE (NEL)

       (yes, with parentheses), one can use

           LINE FEED
           FORM FEED
           CARRIAGE RETURN
           NEXT LINE
           LF
           FF
           CR
           NEL

       All the other standard abbreviations for the controls,
       such as "ACK" for "ACKNOWLEDGE" also can be used.

       One can also use

           BYTE ORDER MARK
           BOM

       and these abbreviations

           Abbreviation        Full Name

           CGJ                 COMBINING GRAPHEME JOINER
           FVS1                MONGOLIAN FREE VARIATION SELECTOR ONE
           FVS2                MONGOLIAN FREE VARIATION SELECTOR TWO
           FVS3                MONGOLIAN FREE VARIATION SELECTOR THREE
           LRE                 LEFT-TO-RIGHT EMBEDDING
           LRM                 LEFT-TO-RIGHT MARK
           LRO                 LEFT-TO-RIGHT OVERRIDE
           MMSP                MEDIUM MATHEMATICAL SPACE
           MVS                 MONGOLIAN VOWEL SEPARATOR
           NBSP                NO-BREAK SPACE
           NNBSP               NARROW NO-BREAK SPACE
           PDF                 POP DIRECTIONAL FORMATTING
           RLE                 RIGHT-TO-LEFT EMBEDDING
           RLM                 RIGHT-TO-LEFT MARK
           RLO                 RIGHT-TO-LEFT OVERRIDE
           SHY                 SOFT HYPHEN
           VS1                 VARIATION SELECTOR-1
           .
           .
           .
           VS256               VARIATION SELECTOR-256
           WJ                  WORD JOINER
           ZWJ                 ZERO WIDTH JOINER
           ZWNJ                ZERO WIDTH NON-JOINER
           ZWSP                ZERO WIDTH SPACE

       For backward compatibility one can use the old names for
       certain C0 and C1 controls

           old                         new

           FILE SEPARATOR              INFORMATION SEPARATOR FOUR
           GROUP SEPARATOR             INFORMATION SEPARATOR THREE
           HORIZONTAL TABULATION       CHARACTER TABULATION
           HORIZONTAL TABULATION SET   CHARACTER TABULATION SET
           HORIZONTAL TABULATION WITH JUSTIFICATION    CHARACTER TABULATION
                                                       WITH JUSTIFICATION
           PARTIAL LINE DOWN           PARTIAL LINE FORWARD
           PARTIAL LINE UP             PARTIAL LINE BACKWARD
           RECORD SEPARATOR            INFORMATION SEPARATOR TWO
           REVERSE INDEX               REVERSE LINE FEED
           UNIT SEPARATOR              INFORMATION SEPARATOR ONE
           VERTICAL TABULATION         LINE TABULATION
           VERTICAL TABULATION SET     LINE TABULATION SET

       but the old names in addition to giving the character will
       also give a warning about being deprecated.

       And finally, certain published variants are usable,
       including some for controls that have no Unicode names:

           name                                   character

           END OF PROTECTED AREA                  END OF GUARDED AREA, U+0097
           HIGH OCTET PRESET                      U+0081
           HOP                                    U+0081
           IND                                    U+0084
           INDEX                                  U+0084
           PAD                                    U+0080
           PADDING CHARACTER                      U+0080
           PRIVATE USE 1                          PRIVATE USE ONE, U+0091
           PRIVATE USE 2                          PRIVATE USE TWO, U+0092
           SGC                                    U+0099
           SINGLE GRAPHIC CHARACTER INTRODUCER    U+0099
           SINGLE-SHIFT 2                         SINGLE SHIFT TWO, U+008E
           SINGLE-SHIFT 3                         SINGLE SHIFT THREE, U+008F
           START OF PROTECTED AREA                START OF GUARDED AREA, U+0096

    perl v5.14.0                2011-05-07                          2

Those are the defaults.  They are overridable.  That's because we feel that
people should be able to name their character constants however they feel
makes sense for them.  If they get tired of typing 

    \N{LATIN SMALL LETTER U WITH DIAERESIS}

let alone

    \N{LATIN CAPITAL LETTER THORN WITH STROKE THROUGH DESCENDER}

then they can, because there is a mechanism for making aliases:

    use charnames ":full", ":alias" => {
	U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS",
	u_uml => "LATIN SMALL LETTER U WITH DIAERESIS",
    };

That way you can do 

    s/\N{U_uml}/UE/;
    s/\N{u_uml}/ue/;

This is probably not as persuasive as the private-use case described below.

It is important to remember that all charname bindings in Perl are attached
to a *lexically-scoped declaration.  It is completely constrained to
operate only within that lexical scope.  That's why the compiler replaces
things like

    use charnames ":full", ":alias" => {
	U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS",
	u_uml => "LATIN SMALL LETTER U WITH DIAERESIS",
    };

    my $find_u_uml = qr/\N{u_uml}/i;

    print "Seach pattern is: $find_u_uml\n";

Which dutifully prints out:

    Seach pattern is: (?^ui:\N{U+FC})

So charname bindings are never "hard to read" because the effect is
completely lexically constrained, and can never leak outside of the scope.

I realize (or at least, believe) that Python has no notion of nested
lexical scopes, and like many things, this sort of thing can therefore
never work there because of that.

The most persuasive use-case for user-defined names is for private-use
area code points.  These will never have an official name.  But it is 
just fine to use them.  Don't they deserve a better name, one that makes
sense within your own program that uses them?  Of course they do.

For example, Apple has a bunch of private-use glyphs they use all the time.
In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate
logo/glyph thingie of an apple with a bite taken out of it.  (Microsoft
also has a bunch of these.)  If you upgrade MacRoman to Unicode, you will
find that that 0xF0 maps to code point U+F8FF using the regular converter.

Now what are you supposed to do in your program when you want a named character
there?  You certainly do not want to make users put an opaque magic number
as a Unicode escape.  That is always really lame, because the whole reason 
we have \N{...} escapes is so we don't have to put mysterious unreadable magic
numbers in our code!!

So all you do is 

    use charnames ":alias" => {
        "APPLE LOGO" => 0xF8FF,
    };

and now you can use \N{APPLE LOGO} anywhere within that lexical scope.  The
compiler will dutifully resolve it to U+F8FF, since all name lookups happen
at compile-time.  And it cannot leak out of the scope.

I assert that this facility makes your program more readable, and its
absence  makes your program less readable.

Private use characters are important in Asian texts, but they are also
important for other things.  For example, Unicode intends to get around
to allocating Tengwar up the the SMP.  However, lots of stupid old code
can't use full Unicode, being constrained to UCS-2 only.  So many Tengwar
fonts start at a different base, and put it in the private use area instead
or the SMP.  Here are two constants:

    use constant {
        TB_CONSCRIPT_UNICODE_REGISTRY    => 0x00_E000,  # private use
        TB_UNICODE_CONSORTIIUM           => 0x01_6080,  # where it will really go
    };

I have an entire Tengwar module that makes heavy use of named 
private-use characters.  All I do is this:

    use constant TENGWAR_BASE => TB_CONSCRIPT_UNICODE_REGISTRY;

    use charnames ":alias" => { 
      reverse (
        (TENGWAR_BASE + 0x00) => "TENGWAR LETTER TINCO",
        (TENGWAR_BASE + 0x01) => "TENGWAR LETTER PARMA",
        (TENGWAR_BASE + 0x02) => "TENGWAR LETTER CALMA",
        (TENGWAR_BASE + 0x03) => "TENGWAR LETTER QUESSE",
        (TENGWAR_BASE + 0x04) => "TENGWAR LETTER ANDO",
        ....
      )
    };

Now you can write \N{TENGWAR LETTER TINCO} etc.  See how slick that is?
Consider the alternative.  Magic numbers.  Worse, magic numbers with funny
calculations in them.  That is just so wrong that it completely justifies
letting people name things how they want to, so long as they don't make
other people do the same.  What people do in the privacy of their own
lexical scope is their own business.

It gets better.  Perl lets you define your character properties, too.
Therefore I can write things like \p{Is_Tengwar_Decimal} and such.
Right now I have these properties:

    In_Tengwar, Is_Tengwar
    In_Tengwar_Alphanumerics
    In_Tengwar_Consonants, In_Tengwar_Vowels, In_Tengwar_Alphabetics
    In_Tengwar_Numerals, Is_Tengwar_Decimal, Is_Tengwar_Duodecimal
    In_Tengwar_Punctuation
    In_Tengwar_Marks 

So I have code in my Tengwar module that does stuff like this, using
my own named characters (which again, are compile-time resolved and 
work only within this lexical scope):

     chr( $1 + ord("\N{TENGWAR DIGIT ZERO}") )

Not to mention this using my own properties:

    $TENGWAR_GRAPHEME_RX = qr/(?:(?=\p{In_Tengwar})\P{In_Tengwar_Marks}\p{In_Tengwar_Marks}*)|\p{In_Tengwar_Marks}/x;

Actually, I'm fibbing.  I *never* write regexes all on one line like
that: they are abhorrent to me.  The pattern really looks like this in
the code:

    $TENGWAR_GRAPHEME_RX = qr{
        (?:
            (?= \p{In_Tengwar} ) \P{In_Tengwar_Marks}   # Either one basechar...
            \p{In_Tengwar_Marks} *                      # ... plus 0 or more marks
        ) | 
            \p{In_Tengwar_Marks}                        # or else a naked unpaired mark.
    }x;

People who write patterns without whitespace for cognitive chunking (plus
comments for explanation) are wicked wicked wicked.  Frankly I'm surprised 
Python doesn't require it. :)/2

Anyway, do you see how much better that is than opaque unreadable magic
numbers?  Can you just imagine the sheer horror of writing that sort of
code without the ability to define your own named characters *and* your 
own character properties?  It's beautiful, simple, clean, and readable.
I'll even go so far as to call it intuitive.

No, I don't expect Python to do this sort of thing.  You don't have proper
scoping, so you can't ever do it cleanly the way Perl can.

I just wanted to give a concrete example where flexibility leads to a 
much more readable program than inflexibility ever can.  

--tom

    "We hates magic numberses.  We hates them forevers!"
        --Sméagol the Hacker
History
Date User Action Args
2011-10-02 05:33:42tchristsetrecipients: + tchrist, lemburg, gvanrossum, loewis, terry.reedy, ezio.melotti, mrabarnett
2011-10-02 05:33:41tchristlinkissue12753 messages
2011-10-02 05:33:36tchristcreate