Author tchrist
Recipients ezio.melotti, mrabarnett, tchrist, terry.reedy
Date 2011-08-19.23:26:17
SpamBayes Score 1.83729e-11
Marked as misclassified No
Message-id <27236.1313796370@chthon>
In-reply-to <1313794258.42.0.875776753601.issue12753@psf.upfronthosting.co.za>
Content
"Terry J. Reedy" <report@bugs.python.org> wrote
   on Fri, 19 Aug 2011 22:50:58 -0000: 

> My current opinion is that adding the aliases might be done in current
> releases. It certainly would serve the any user who does not know to
> misspell 'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars.

Yes, I think the 11 aliases pose no problem.  It's amazing the trouble
you get into from having a fat-fingered amanuesis typing your laws 
into indelible stone tablets.

> Adding named sequences is definitely a feature request. The definition
> of .lookup(name) would be enlarged to "Look up character by name,
> alias, or named sequence" with reference to the specific files. The
> meaning of \N{} would also have to be enlarged.

But these do.  The problem is bracketed character classes.  
Yes, if you got named reference into the regex compiler as a raw
string, it could in theory rewrite

    [abc\N{seq}] 

as 

    (?:[abc]|\N{seq})

but that doesn't help if the sequence got replaced as a string escape.
At which point you have different behavior in the two lookalike cases.

If you ask how we do this in Perl, the answer is "poorly".  It really only
works well in strings, not charclasses, although there is a proposal to do
a rewrite during compilation like I've spelled out above.  Seems messy for
something that might(?) not get much use.  But it would be nice for \N{} to
work to access the whole namespace without prejudice.  I have a feeling
this may be a case of trying to keep one's cake and eating it too, as
the two goals seem to rule each other out.

>> "If you look at the ICU UCharacter class, you can see that they provide a more"

> More what ;-)

More expressive set of lookup functions where it is clear which thing
you are getting.  I believe the ICU regexes only support one-char returns
for \N{...}, not multis per the sequences.  But I may not be looking
at the right docs for ICU; not sure.

> I presume ICU =International Components for Unicode, icu-project.org/
> "Offers a portable set of C/C++ and Java libraries for Unicode support,
> software internationalization (I18N) and globalization (G11N)." [appears
> to be free, open source, and possibly usable within Python]

Well, there are some Python bindings for ICU that I was eager to try out,
because I wanted to see whether I couild get at full/real Unicode collation
that way, but I had trouble getting the Python bindings to compile.  Not
sure why.  The documentation for the Python bindings isn't very um wordy,
and it isn't clear how tightly integrated it all is: there's talk about C++
strings that kind of scares me. :)

Hm, and maybe they are only for Python 2 not Python 3, which I try to do
all my Python stuff in because it seems like it has a better Unicode model.

--tom
History
Date User Action Args
2011-08-19 23:26:18tchristsetrecipients: + tchrist, terry.reedy, ezio.melotti, mrabarnett
2011-08-19 23:26:18tchristlinkissue12753 messages
2011-08-19 23:26:17tchristcreate