Issue 12734: Request for property support in Python re lib

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56943

classification

Title:	Request for property support in Python re lib
Type:	enhancement	Stage:
Components:	Regular Expressions	Versions:	Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, Socob, ezio.melotti, gvanrossum, jayvdb, mrabarnett, r.david.murray, tchrist
Priority:	normal	Keywords:

Created on 2011-08-11 20:14 by tchrist, last changed 2022-04-11 14:57 by admin.

Messages (7)
msg141925 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-11 20:14
Python supports no Unicode properties in its re library, making it unsuitable for work with Unicode. This is therefore a formal request for the Python re library to support Unicode properties. The eleven properties required by Unicode Technical Report #18's RL1.2 are the bare minimum which must be added to make it possible to use Python reguyar expressions on Unicode. The proposed RL2.7 on Full Properties is even better. That is found at http://unicode.org/reports/tr18/proposed.html#Full_Properties Although by the time you read this, it will have been made an official part of tr18. Matthew Barnett's replacement library for re, called regex, support 67 Unicode properties at last count, including the strongly recommended loose matching. The standard re library needs to be spiffed up to make it suitable for Unicode processing; it is not currently usable for that due to this missing functionality. I quote from the Level 1 conformance requirement of tr18: "Level 1: This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level are independent of country or language. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing." pass RL1.1 Hex Notation fail RL1.2 Properties fail RL1.2a Compatibility Properties fail RL1.3 Subtraction and Intersection fail RL1.4 Simple Word Boundaries fail RL1.5 Simple Loose Matches fail RL1.6 Line Boundaries fail RL1.7 Supplementary Code Points (withdrawn) RL2.1 Canonical Equivalents fail RL2.2 Extended Grapheme Clusters fail RL2.3 Default Word Boundaries fail RL2.4 Default Case Conversion pass RL2.5 Name Properties fail RL2.6 Wildcards in Property Values fail RL2.7 Full Properties I won’t even talk about Level 3. ICU, Perl, and Java7 all meet Level One conformance requirements with several Level 2 requirements also met. It is important for Python to meet the Unicode Standard in this so that people can use Python for regex matching Unicode text. They currently cannot usefully do so per the requirements of tr18.
msg141931 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-08-12 00:15
I think the only way re is going to get "spiffed up" is by replacing it with Matthew's library. This is a goal, but I'm not sure where exactly we are in the process. The more Matthew's code gets tested (especially for compatibility with the current re API), the closer we will be to that goal.
msg141934 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-12 01:46
I've been a lot of testing of Matthew's regex library against UTS#18 issues, but only somewhat incidentally testing re. To use regex, one has to accept that certain things will work differently than they work in re, because he is following Unicode definitions for things like casefolding. But I doubt that is the sort of difference you are talking about. One of the things that Java, Go, and Perl all do is run regression tests against the whole Unicode Character Database to make sure nothing gets hosed, missed, or otherwise out of sync. That might a sort of regression test you might like to add.
msg142113 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-15 10:43
This indeed should be "fixed" by replacing 're' with 'regex'. So I would suggest to focus your tests on 'regex' and report them there so that possible bugs gets fixed and tested before we include the module in the stdlib.
msg142141 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-15 19:05
Sorry I didn't include a test case. Hope this makes up for it. If not, please tell me how to write better test cases. :( Yeah ok, so I'm a bit persnickety or even unorthodox about my vertical alignment, but it really helps to make what is different from one to line to the next stand out if the parts that are the same from line to line are at the same column every time.
msg142142 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-15 19:06
Oh whoops, that was the long ticket. Shall I reupload to the right number?
msg143042 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2011-08-26 21:25
+1 on adding the feature to 3.3 in whichever way makes sense.

History
Date	User	Action	Args
2022-04-11 14:57:20	admin	set	github: 56943
2021-01-26 09:56:36	jayvdb	set	nosy: + jayvdb
2017-07-24 02:20:19	Socob	set	nosy: + Socob
2016-04-25 06:08:17	serhiy.storchaka	link	issue24194 dependencies
2013-07-10 19:10:17	terry.reedy	set	versions: + Python 3.4, - Python 3.3
2011-08-26 21:25:18	gvanrossum	set	nosy: + gvanrossum messages: + msg143042
2011-08-15 19:51:59	tchrist	set	files: - nametests.py
2011-08-15 19:06:20	tchrist	set	messages: + msg142142
2011-08-15 19:05:16	tchrist	set	files: + nametests.py messages: + msg142141
2011-08-15 10:43:44	ezio.melotti	set	messages: + msg142113
2011-08-13 00:57:58	mrabarnett	set	nosy: + mrabarnett
2011-08-12 18:06:11	eric.araujo	set	versions: + Python 3.3, - Python 3.2
2011-08-12 18:04:58	Arfrever	set	nosy: + Arfrever
2011-08-12 01:46:34	tchrist	set	messages: + msg141934
2011-08-12 00:18:12	ezio.melotti	set	nosy: + ezio.melotti
2011-08-12 00:15:57	r.david.murray	set	nosy: + r.david.murray messages: + msg141931
2011-08-11 20:14:13	tchrist	create