Message 73711 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	timehorse
Recipients	mrabarnett, pitrou, timehorse
Date	2008-09-24.13:15:02
SpamBayes Score	8.3266727e-16
Marked as misclassified	No
Message-id	<1222262104.35.0.651075836206.issue3511@psf.upfronthosting.co.za>
In-reply-to

Content
I think this is even more complicated when you consider that localization my be an issue. Consider "Á": is this grammatically before "A" or after "a"? From a character set point of view, it is typically after "a" but when Locale is taken into account, all that is done is there is a change to relative ordering, so Á appears somewhere before A and B. But when this is done, does that mean that [9-Á] is going to cover ALL uppercase and ALL lowercase and ALL characters with ord from 91 to 96 and 123 to 127 and all kinds of other UNICODE symbols? And how will this effect case-insensitivity. In a sense, I think it may only be safe to say that character class ranges are ONLY appropriate over Alphabetic character ranges or numeric character ranges, since the order of the ASCII symbols between 0 and 47, 56 and 64, 91 adn 96 and 123 and 127, though well-defined, are none the less implementation dependent. When we bring UNICODE into this, things get even more befuddled with some Latin characters in Latin-1, some in Latin-2, Cyrillic, Hebrew, Arabic, Chinese, Japanese and Korean character sets just to name a few of the most common! And how does a total ordering of characters apply to them? In the end, I think it's just dangerous to define character group ranges that span the gap BETWEEN numbers and alphabetics. Instead, I think a better solution is simply to implement Emacs / Perl style named character classes as in issue 2636 sub-item 8. I do agree this is a problem, but as I see it, the solution may not be that simple, especially in a UNICODE world.

I think this is even more complicated when you consider that
localization my be an issue.  Consider "Á": is this grammatically before
 "A" or after "a"?  From a character set point of view, it is typically
after "a" but when Locale is taken into account, all that is done is
there is a change to relative ordering, so Á appears somewhere before A
and B.  But when this is done, does that mean that [9-Á] is going to
cover ALL uppercase and ALL lowercase and ALL characters with ord from
91 to 96 and 123 to 127 and all kinds of other UNICODE symbols?  And how
will this effect case-insensitivity.

In a sense, I think it may only be safe to say that character class
ranges are ONLY appropriate over Alphabetic character ranges or numeric
character ranges, since the order of the ASCII symbols between 0 and 47,
56 and 64, 91 adn 96 and 123 and 127, though well-defined, are none the
less implementation dependent.  When we bring UNICODE into this, things
get even more befuddled with some Latin characters in Latin-1, some in
Latin-2, Cyrillic, Hebrew, Arabic, Chinese, Japanese and Korean
character sets just to name a few of the most common!  And how does a
total ordering of characters apply to them?

In the end, I think it's just dangerous to define character group ranges
that span the gap BETWEEN numbers and alphabetics.  Instead, I think a
better solution is simply to implement Emacs / Perl style named
character classes as in issue 2636 sub-item 8.

I do agree this is a problem, but as I see it, the solution may not be
that simple, especially in a UNICODE world.

History
Date	User	Action	Args
2008-09-24 13:15:04	timehorse	set	recipients: + timehorse, pitrou, mrabarnett
2008-09-24 13:15:04	timehorse	set	messageid: <1222262104.35.0.651075836206.issue3511@psf.upfronthosting.co.za>
2008-09-24 13:15:03	timehorse	link	issue3511 messages
2008-09-24 13:15:02	timehorse	create