Title: New sequences for Unicode groups and block ranges needed
Type: enhancement Stage: test needed
Components: Regular Expressions, Unicode Versions: Python 3.4
Status: open Resolution:
Dependencies: 2636 Superseder:
Assigned To: Nosy List: ajaksu2, effbot, ezio.melotti, gmarketer, lemburg, loewis, mrabarnett, terry.reedy
Priority: normal Keywords:

Created on 2006-07-25 04:44 by gmarketer, last changed 2019-03-15 23:59 by BreamoreBoy.

Messages (15)
msg54861 - (view) Author: gmarketer (gmarketer) Date: 2006-07-25 04:44
The special sequences consist of "\" and another
character need to be added to RE sintax to simplify the
finding of several Unicode classes like:
 * All uppercase letters
 * All lowercase letters
msg54862 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-07-25 07:45
Logged In: YES 

Could you make your request a little more specific ?

We already have catregories in the re module, so adding a
few more would be possible (patches are welcome !). However,
we do need to know why you need them and whether there are
other RE implementations that already have such special
matching characters, e.g. the Perl RE implementation.
msg54863 - (view) Author: gmarketer (gmarketer) Date: 2006-07-26 02:06
Logged In: YES 

We need to process several strings in utf-8 and need to use
regular expressions to match pattern, for ex.:

We don't know how to implement this logic by our hands.

Also, I found this logic implemented in Microsoft dot NET
regular expressions:

\p{name}        Matches any character in the named character
class 'name'. Supported names are Unicode groups and block
ranges. For example Ll, Nd, Z, IsGreek, IsBoxDrawing, and Sc

\P{name}        Matches text not included in the named
character class 'name'. 

We need same logic in regular expressions.
msg54864 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-09-10 10:36
Logged In: YES 

If anything, I think Python should implement Unicode TR#18:

This does include the \p notation for property expressions,
e.g. \p{Ll} or \p{East Asian Width:Narrow}.

We currently don't include the Script property, so \p{Greek}
could not be implemented (we can, of course, add support for
the script property). I can't find anything in the report
that makes \p{IsGreek} valid, so we shouldn't support it.
msg54865 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2006-12-04 09:27
note that posix uses a special set syntax, [:name:], for this purpose:

[:alnum:]   [:cntrl:]   [:lower:]   [:space:]
[:alpha:]   [:digit:]   [:print:]   [:upper:]
[:blank:]   [:graph:]   [:punct:]   [:xdigit:]

adding a new character escape will probably break more existing expressions, but no matter what syntax we chose, this is (micro-)PEP territory.
msg54866 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2006-12-04 09:28
note that posix uses a special set syntax, [:name:], for this purpose:

[:alnum:]   [:cntrl:]   [:lower:]   [:space:]
[:alpha:]   [:digit:]   [:print:]   [:upper:]
[:blank:]   [:graph:]   [:punct:]   [:xdigit:]

adding a new character escape will probably break more existing expressions, but no matter what syntax we chose, this is (micro-)PEP territory.
msg84503 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2009-03-30 04:49
Has this been addressed for 2.6/3.0? Do the LOCALE and UNICODE constants
cover this?
msg84532 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-03-30 09:48
No progress has been made. I still maintain that TR18 should be 

I'm not so sure whether the POSIX special groups should be provided. My 
understanding is that they originally were meant to integrate with the 
locale support, and change with locale. For Unicode, Annex C of TR18 makes 
a recommendation on how to provide the POSIX properties, and offers two 
alternative definitions: Standard Recommendation and POSIX Compatible. 
That alone tells me that it is best not to provide support for them: 
refuse the temptation to guess.
msg84544 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2009-03-30 13:32
I implemented \p, \P and [:...:] for the simple categories (eg "Lu" and
"upper", but not "IsGreek") in the work I did for issue #2636.
msg100129 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2010-02-25 23:49
\p{name} is supported for Unicode properties, scripts and blocks in my regex module (see issue #2636).

It also supports the POSIX set syntax, although I'm not sure that we really need to have 2 ways of doing it, eg \p{Alpha} and [[:Alpha:]].
msg130412 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-03-09 03:36
Is there a practical issue left here? Mathew says his regex module does as requested, but adding that to the stdlib is a separate issue. Martin would like an implementation of Unicode TR18, but that is also another issue.
msg185757 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-04-01 18:57
I am trying to decide if this issue still serves a purpose. It seems to be a request to add something to the existing re module. Fredrik semi-rejected the idea without a (micro)-pep. A python-ideas discussion is now another option. Matthew's regex implementation already has the feature, so this issue would be moot if it were ever part of the stdlib. But the fate of #2636 is unclear. Rereading, it now seems that implementing the feature in the current re module using the TR18 syntax would be this issue, if someone were to do it. So I will not close yet.
msg185759 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-01 18:59
We should really just include "regex" in 3.4.
msg221632 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-26 19:02
Is there an easy way to find out how many other issues have #2636 as a dependency?
msg221761 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-06-28 01:39
This seems to be the only one currently.
Other issues might have closed in favor of #2636 though.
Date User Action Args
2019-03-15 23:59:53BreamoreBoysetnosy: - BreamoreBoy
2014-06-28 01:39:50ezio.melottisetmessages: + msg221761
2014-06-26 19:02:49BreamoreBoysetnosy: + BreamoreBoy
messages: + msg221632
2013-04-01 18:59:56ezio.melottisetnosy: + ezio.melotti
messages: + msg185759
2013-04-01 18:57:16terry.reedysetdependencies: + Adding a new regex module (compatible with re)
messages: + msg185757
versions: + Python 3.4, - Python 3.3
2011-03-09 03:36:39terry.reedysetnosy: + terry.reedy

messages: + msg130412
versions: + Python 3.3, - Python 3.1, Python 2.7
2010-12-07 20:13:29adminsetmessages: + msg54862
2010-12-07 20:12:56adminsetmessages: + msg54865, - msg54862
2010-12-07 19:25:16loewissetmessages: - msg54865
2010-02-25 23:49:47mrabarnettsetmessages: + msg100129
2009-03-30 13:32:50mrabarnettsetnosy: + mrabarnett
messages: + msg84544
2009-03-30 09:48:11loewissetpriority: high -> normal

messages: + msg84532
2009-03-30 04:49:54ajaksu2setversions: + Python 3.1, Python 2.7
nosy: + ajaksu2

messages: + msg84503

components: + Unicode
stage: test needed
2006-07-25 04:44:06gmarketercreate