Message 144758 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ezio.melotti, gvanrossum, lemburg, loewis, mrabarnett, tchrist, terry.reedy
Date	2011-10-02.06:46:25
SpamBayes Score	5.551115e-16
Marked as misclassified	No
Message-id	<1317537986.58.0.797304932484.issue12753@psf.upfronthosting.co.za>
In-reply-to

Content
> The problem with official names is that they have things in them that > you are not expected in names. Do you really and truly mean to tell > me you think it is somehow good that people are forced to write > \N{LINE FEED (LF)} > Rather than the more obvious pair of > \N{LINE FEED} > \N{LF} > ?? Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely because that's a Unicode 1 name, and nowadays these codepoints are simply marked as '<control>'. > If so, then I don't understand that. Nobody in their right > mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they? They probably don't, but they just write \n anyway. I don't think we need to support any of these aliases, especially if they are not defined in the Unicode standard. I'm also not sure humans use \N{...}: you don't want to write 'R\N{LATIN SMALL LETTER E WITH ACUTE}sum\N{LATIN SMALL LETTER E WITH ACUTE}' and you would need to look up the exact name somewhere anyway before using it (unless you know them by heart). If 'R\xe9sum\xe9' or 'R\u00e9sum\u00e9' are too obscure and/or magic, you can always print() them and get 'Résumé' (or just write 'Résumé' directly in the source). > All of the standards documents talk about things like LRO and ZWNJ. > I guess the standards aren't "readable" then, right? :) Right, I had to read down till the table with the meanings before figuring out what they were (and I already forgot it). > The most persuasive use-case for user-defined names is for private-use > area code points. These will never have an official name. But it is > just fine to use them. Don't they deserve a better name, one that > makes sense within your own program that uses them? Of course they do. > > For example, Apple has a bunch of private-use glyphs they use all the time. > In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate > logo/glyph thingie of an apple with a bite taken out of it. (Microsoft > also has a bunch of these.) If you upgrade MacRoman to Unicode, you will > find that that 0xF0 maps to code point U+F8FF using the regular converter. > > Now what are you supposed to do in your program when you want a named character > there? You certainly do not want to make users put an opaque magic number > as a Unicode escape. That is always really lame, because the whole reason > we have \N{...} escapes is so we don't have to put mysterious unreadable magic > numbers in our code!! > > So all you do is > use charnames ":alias" => { > "APPLE LOGO" => 0xF8FF, > }; > > and now you can use \N{APPLE LOGO} anywhere within that lexical scope. The > compiler will dutifully resolve it to U+F8FF, since all name lookups happen > at compile-time. And it cannot leak out of the scope. This is actually a good use case for \N{..}. One way to solve that problem is doing: apples = { 'APPLE': '\uF8FF', 'GREEN APPLE': '\U0001F34F', 'RED APPLE': '\U0001F34E', } and then: print('I like {GREEN APPLE} and {RED APPLE}, but not {APPLE}.'.format(*apples)) This requires the format call for each string and it's a workaround, but at least is readable (I hope you don't have too many apples in your strings). I guess we could add some way to define a global list of names, and that would probably be enough for most applications. Making it per-module would be more complicated and maybe not too elegant. > People who write patterns without whitespace for cognitive chunking (plus > comments for explanation) are wicked wicked wicked. Frankly I'm surprised > Python doesn't require it. :)/2 I actually find those less* readable. If there's something fancy in the regex, a comment before it is welcomed, but having to read a regex divided on several lines and remove meaningless whitespace and redundant comments just makes the parsing more difficult for me.

> The problem with official names is that they have things in them that 
> you are not expected in names.  Do you really and truly mean to tell 
> me you think it is somehow **good** that people are forced to write
>    \N{LINE FEED (LF)}
> Rather than the more obvious pair of 
>    \N{LINE FEED}
>    \N{LF}
> ??

Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely because that's a Unicode 1 name, and nowadays these codepoints are simply marked as '<control>'.

> If so, then I don't understand that.  Nobody in their right 
> mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

They probably don't, but they just write \n anyway.  I don't think we need to support any of these aliases, especially if they are not defined in the Unicode standard.

I'm also not sure humans use \N{...}: you don't want to write
  'R\N{LATIN SMALL LETTER E WITH ACUTE}sum\N{LATIN SMALL LETTER E WITH ACUTE}'
and you would need to look up the exact name somewhere anyway before using it (unless you know them by heart).
If 'R\xe9sum\xe9' or 'R\u00e9sum\u00e9' are too obscure and/or magic, you can always print() them and get 'Résumé' (or just write 'Résumé' directly in the source).

> All of the standards documents *talk* about things like LRO and ZWNJ.
> I guess the standards aren't "readable" then, right? :)

Right, I had to read down till the table with the meanings before figuring out what they were (and I already forgot it).

> The most persuasive use-case for user-defined names is for private-use
> area code points.  These will never have an official name.  But it is
> just fine to use them.  Don't they deserve a better name, one that 
> makes sense within your own program that uses them?  Of course they do.
>
> For example, Apple has a bunch of private-use glyphs they use all the time.
> In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate
> logo/glyph thingie of an apple with a bite taken out of it.  (Microsoft
> also has a bunch of these.)  If you upgrade MacRoman to Unicode, you will
> find that that 0xF0 maps to code point U+F8FF using the regular converter.
>
> Now what are you supposed to do in your program when you want a named character
> there?  You certainly do not want to make users put an opaque magic number
> as a Unicode escape.  That is always really lame, because the whole reason 
> we have \N{...} escapes is so we don't have to put mysterious unreadable magic
> numbers in our code!!
>
> So all you do is 
>    use charnames ":alias" => {
>        "APPLE LOGO" => 0xF8FF,
>    };
>
> and now you can use \N{APPLE LOGO} anywhere within that lexical scope.  The
> compiler will dutifully resolve it to U+F8FF, since all name lookups happen
> at compile-time.  And it cannot leak out of the scope.

This is actually a good use case for \N{..}.

One way to solve that problem is doing:
    apples = {
        'APPLE': '\uF8FF',
        'GREEN APPLE': '\U0001F34F',
        'RED APPLE': '\U0001F34E',
    }
and then:
   print('I like {GREEN APPLE} and {RED APPLE}, but not {APPLE}.'.format(**apples))

This requires the format call for each string and it's a workaround, but at least is readable (I hope you don't have too many apples in your strings).

I guess we could add some way to define a global list of names, and that would probably be enough for most applications.  Making it per-module would be more complicated and maybe not too elegant.

> People who write patterns without whitespace for cognitive chunking (plus
> comments for explanation) are wicked wicked wicked.  Frankly I'm surprised 
> Python doesn't require it. :)/2

I actually find those *less* readable.  If there's something fancy in the regex, a comment *before* it is welcomed, but having to read a regex divided on several lines and remove meaningless whitespace and redundant comments just makes the parsing more difficult for me.

History
Date	User	Action	Args
2011-10-02 06:46:26	ezio.melotti	set	recipients: + ezio.melotti, lemburg, gvanrossum, loewis, terry.reedy, mrabarnett, tchrist
2011-10-02 06:46:26	ezio.melotti	set	messageid: <1317537986.58.0.797304932484.issue12753@psf.upfronthosting.co.za>
2011-10-02 06:46:26	ezio.melotti	link	issue12753 messages
2011-10-02 06:46:25	ezio.melotti	create