Issue1528154
Created on 2006-07-25 04:44 by gmarketer, last changed 2011-03-09 03:36 by terry.reedy.
| Messages (11) | |||
|---|---|---|---|
| msg54861 - (view) | Author: gmarketer (gmarketer) | Date: 2006-07-25 04:44 | |
The special sequences consist of "\" and another character need to be added to RE sintax to simplify the finding of several Unicode classes like: * All uppercase letters * All lowercase letters |
|||
| msg54862 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2006-07-25 07:45 | |
Logged In: YES user_id=38388 Could you make your request a little more specific ? We already have catregories in the re module, so adding a few more would be possible (patches are welcome !). However, we do need to know why you need them and whether there are other RE implementations that already have such special matching characters, e.g. the Perl RE implementation. |
|||
| msg54863 - (view) | Author: gmarketer (gmarketer) | Date: 2006-07-26 02:06 | |
Logged In: YES
user_id=1334865
We need to process several strings in utf-8 and need to use
regular expressions to match pattern, for ex.:
r"[ANY_LANGUAGE_UPPERCASE_LETTER,0-9ANY_LANGUAGE_LOWERCASE_LETTER]+|NOT_ANY_LANGUAGE_CURRENCY"
We don't know how to implement this logic by our hands.
Also, I found this logic implemented in Microsoft dot NET
regular expressions:
\p{name} Matches any character in the named character
class 'name'. Supported names are Unicode groups and block
ranges. For example Ll, Nd, Z, IsGreek, IsBoxDrawing, and Sc
(currency).
\P{name} Matches text not included in the named
character class 'name'.
We need same logic in regular expressions.
|
|||
| msg54864 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2006-09-10 10:36 | |
Logged In: YES user_id=21627 If anything, I think Python should implement Unicode TR#18: http://www.unicode.org/unicode/reports/tr18/ This does include the \p notation for property expressions, e.g. \p{Ll} or \p{East Asian Width:Narrow}. We currently don't include the Script property, so \p{Greek} could not be implemented (we can, of course, add support for the script property). I can't find anything in the report that makes \p{IsGreek} valid, so we shouldn't support it. |
|||
| msg54865 - (view) | Author: Fredrik Lundh (effbot) * ![]() |
Date: 2006-12-04 09:27 | |
note that posix uses a special set syntax, [:name:], for this purpose: [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:print:] [:upper:] [:blank:] [:graph:] [:punct:] [:xdigit:] adding a new character escape will probably break more existing expressions, but no matter what syntax we chose, this is (micro-)PEP territory. |
|||
| msg54866 - (view) | Author: Fredrik Lundh (effbot) * ![]() |
Date: 2006-12-04 09:28 | |
note that posix uses a special set syntax, [:name:], for this purpose: [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:print:] [:upper:] [:blank:] [:graph:] [:punct:] [:xdigit:] adding a new character escape will probably break more existing expressions, but no matter what syntax we chose, this is (micro-)PEP territory. |
|||
| msg84503 - (view) | Author: Daniel Diniz (ajaksu2) | Date: 2009-03-30 04:49 | |
Has this been addressed for 2.6/3.0? Do the LOCALE and UNICODE constants cover this? |
|||
| msg84532 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2009-03-30 09:48 | |
No progress has been made. I still maintain that TR18 should be implemented. I'm not so sure whether the POSIX special groups should be provided. My understanding is that they originally were meant to integrate with the locale support, and change with locale. For Unicode, Annex C of TR18 makes a recommendation on how to provide the POSIX properties, and offers two alternative definitions: Standard Recommendation and POSIX Compatible. That alone tells me that it is best not to provide support for them: refuse the temptation to guess. |
|||
| msg84544 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-03-30 13:32 | |
I implemented \p, \P and [:...:] for the simple categories (eg "Lu" and "upper", but not "IsGreek") in the work I did for issue #2636. |
|||
| msg100129 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-25 23:49 | |
\p{name} is supported for Unicode properties, scripts and blocks in my regex module (see issue #2636).
It also supports the POSIX set syntax, although I'm not sure that we really need to have 2 ways of doing it, eg \p{Alpha} and [[:Alpha:]].
|
|||
| msg130412 - (view) | Author: Terry J. Reedy (terry.reedy) * ![]() |
Date: 2011-03-09 03:36 | |
Is there a practical issue left here? Mathew says his regex module does as requested, but adding that to the stdlib is a separate issue. Martin would like an implementation of Unicode TR18, but that is also another issue. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2011-03-09 03:36:39 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg130412 versions: + Python 3.3, - Python 3.1, Python 2.7 |
| 2010-12-07 20:13:29 | admin | set | messages: + msg54862 |
| 2010-12-07 20:12:56 | admin | set | messages: + msg54865, - msg54862 |
| 2010-12-07 19:25:16 | loewis | set | messages: - msg54865 |
| 2010-02-25 23:49:47 | mrabarnett | set | messages: + msg100129 |
| 2009-03-30 13:32:50 | mrabarnett | set | nosy:
+ mrabarnett messages: + msg84544 |
| 2009-03-30 09:48:11 | loewis | set | priority: high -> normal messages: + msg84532 |
| 2009-03-30 04:49:54 | ajaksu2 | set | versions:
+ Python 3.1, Python 2.7 nosy: + ajaksu2 messages: + msg84503 components: + Unicode stage: test needed |
| 2006-07-25 04:44:06 | gmarketer | create | |
