classification
Title: Support \u and \U escapes in regexes
Type: enhancement Stage: resolved
Components: Library (Lib), Regular Expressions, Unicode Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: eric.araujo, ezio.melotti, georg.brandl, ishimoto, mrabarnett, pitrou, python-dev, serhiy.storchaka, timehorse
Priority: critical Keywords: needs review, patch

Created on 2008-08-24 20:33 by georg.brandl, last changed 2012-06-23 11:48 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
re_unicode_escapes.diff georg.brandl, 2008-08-24 20:33
3665.patch ishimoto, 2010-07-11 05:09
re_unicode_escapes.diff serhiy.storchaka, 2012-06-01 06:43 Regenerate georg.brandl's patch for review review
3665.patch serhiy.storchaka, 2012-06-01 06:44 Regenerate ishimoto's patch for review review
re_unicode_escapes-2.patch serhiy.storchaka, 2012-06-17 12:48 + PEP 393, + cleanup, + tests review
re_unicode_escapes-3.patch serhiy.storchaka, 2012-06-18 08:02 + byte patterns, + tests, + docs review
Messages (14)
msg71861 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-08-24 20:33
Since \u and \U aren't interpolated in raw strings anymore, the re
module should support those escapes in addition to the \x and octal ones
it already does.  Attached patch.
msg71864 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-24 20:49
- Check that it also works for chars > 0xFFFF (even in UCS2 builds, at
least when the chars are not part of [character range])
- What does happen with e.g. [\U00010000-\U00010001] on an UCS build?
msg71865 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-24 20:49
(in the last sentence, I meant UCS2. Sorry)
msg71868 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-08-24 20:58
These concerns indeed must be handled: On narrow unicode builds, chars >
0xffff must be converted to surrogates. In ranges, they should raise an
error.

Additionally, this should at least raise an error too:

>>> re.compile("[\U00100000]").match("\U00100000").group()
'\udbc0'
msg109961 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2010-07-11 05:09
Here's an updated patch for py3k branch. 
As per Georg's comment, I added to check codepoint in the character 
ranges, conversion to the surrogate pairs. I also added check to raise 
exception if codepoint > 0x10ffff.
I with to English speakers to fix error messages in the patch.
msg138219 - (view) Author: √Čric Araujo (eric.araujo) * (Python committer) Date: 2011-06-12 20:30
FYI,
+                raise error("bogus escape: %s" % repr(escape))

can be written simply as

+                raise error("bogus escape: %r" % escape)
msg162052 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-01 06:25
I don't think it is worth to target it for 2.7 and 3.2 (it's new feature, not bugfix), but for 3.3 it will be very useful.

Since PEP 393 conversion to the surrogate pairs is no longer relevant.
msg162830 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-14 21:23
Georg, Atsuo, how are you?
msg163065 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-17 12:48
Here is updated (in conforming with PEP 393) patch. In additional octal and hexadecimal escaping cleared, illegal error message for hexadecimal escaping fixed. Added new tests for octal and hexadecimal escaping.
msg163094 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-18 08:02
I forgot about byte patterns. Here is an updated patch.
msg163580 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-23 11:23
Any chance to commit the patch today and to get this feature in Python 3.3?
msg163584 - (view) Author: Roundup Robot (python-dev) Date: 2012-06-23 11:32
New changeset b1dbd8827e79 by Antoine Pitrou in branch 'default':
Issue #3665: \u and \U escapes are now supported in unicode regular expressions.
http://hg.python.org/cpython/rev/b1dbd8827e79
msg163585 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-06-23 11:33
> Any chance to commit the patch today and to get this feature in Python 
> 3.3?

Thanks for reminding us! It's now in 3.3.
msg163590 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-23 11:48
Thank you for the quick response.
History
Date User Action Args
2012-06-23 11:48:13serhiy.storchakasetmessages: + msg163590
2012-06-23 11:33:41pitrousetstatus: open -> closed
resolution: fixed
messages: + msg163585

stage: commit review -> resolved
2012-06-23 11:32:54python-devsetnosy: + python-dev
messages: + msg163584
2012-06-23 11:28:36pitrousetassignee: pitrou
stage: patch review -> commit review
2012-06-23 11:23:04serhiy.storchakasetmessages: + msg163580
2012-06-18 08:02:24serhiy.storchakasetfiles: + re_unicode_escapes-3.patch

messages: + msg163094
2012-06-17 12:48:06serhiy.storchakasetfiles: + re_unicode_escapes-2.patch

messages: + msg163065
2012-06-14 21:23:59serhiy.storchakasetmessages: + msg162830
2012-06-01 06:44:38serhiy.storchakasetfiles: + 3665.patch
2012-06-01 06:43:52serhiy.storchakasetfiles: + re_unicode_escapes.diff
2012-06-01 06:37:02serhiy.storchakasetfiles: - 3665.patch
2012-06-01 06:36:47serhiy.storchakasetfiles: - re_unicode_escapes.diff
2012-06-01 06:36:02serhiy.storchakasetfiles: + 3665.patch
2012-06-01 06:35:08serhiy.storchakasetfiles: + re_unicode_escapes.diff
2012-06-01 06:25:29serhiy.storchakasetversions: - Python 2.7, Python 3.2
nosy: + serhiy.storchaka

messages: + msg162052

components: + Regular Expressions, Unicode
type: behavior -> enhancement
2011-11-29 06:16:10ezio.melottisetnosy: + mrabarnett
2011-07-21 05:14:12ezio.melottisetkeywords: + needs review
stage: patch review
2011-06-12 20:30:55eric.araujosetnosy: + eric.araujo
messages: + msg138219
2011-06-12 18:32:20terry.reedysetversions: + Python 3.2, Python 3.3, - Python 3.1
2010-08-04 14:38:30ezio.melottisetnosy: + ezio.melotti
2010-07-11 05:09:51ishimotosetfiles: + 3665.patch
nosy: + ishimoto
messages: + msg109961

2008-09-27 14:27:18timehorsesetversions: + Python 3.1, Python 2.7, - Python 3.0
2008-09-27 14:20:42timehorsesetnosy: + timehorse
2008-08-24 20:58:27georg.brandlsetmessages: + msg71868
2008-08-24 20:49:33pitrousetmessages: + msg71865
2008-08-24 20:49:11pitrousetnosy: + pitrou
messages: + msg71864
2008-08-24 20:33:51georg.brandlcreate