Message 141917 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	tchrist
Date	2011-08-11.19:03:54
SpamBayes Score	1.9692692e-11
Marked as misclassified	No
Message-id	<1313089435.8.0.838915767835.issue12729@psf.upfronthosting.co.za>
In-reply-to

Content
Python is in flagrant violation of the very most basic premises of Unicode Technical Report #18 on Regular Expressions, which requires that a regex engine support Unicode characters as "basic logical units independent of serialization like UTF‑*". Because sometimes you must specify ".." to match a single Unicode character -- whenever those code points are above the BMP and you are on a narrow build -- Python regexes cannot be reliably used for Unicode text. % python3.2 Python 3.2 (r32:88445, Jul 21 2011, 14:44:19) [GCC 4.2.1 (Apple Inc. build 5664)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}" >>> print(g) ᾲ >>> print(re.search(r'\w', g)) <_sre.SRE_Match object at 0x10051f988> >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}" >>> print(p) 𝒫 >>> print(re.search(r'\w', p)) None >>> print(re.search(r'..', p)) # ← 𝙏𝙃𝙄𝙎 𝙄𝙎 𝙏𝙃𝙀 𝙑𝙄𝙊𝙇𝘼𝙏𝙄𝙊𝙉 𝙍𝙄𝙂𝙃𝙏 𝙃𝙀𝙍𝙀 <_sre.SRE_Match object at 0x10051f988> >>> print(len(chr(0x1D4AB))) 2 That is illegal in Unicode regular expressions.

Python is in flagrant violation of the very most basic premises of Unicode Technical Report #18 on Regular Expressions, which requires that a regex engine support Unicode characters as "basic logical units independent of serialization like UTF‑*".  Because sometimes you must specify ".." to match a single Unicode character -- whenever those code points are above the BMP and you are on a narrow build -- Python regexes cannot be reliably used for Unicode text.

 % python3.2
 Python 3.2 (r32:88445, Jul 21 2011, 14:44:19)
 [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import re
 >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
 >>> print(g)
ᾲ
 >>> print(re.search(r'\w', g))
 <_sre.SRE_Match object at 0x10051f988>
 >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
 >>> print(p)
𝒫
 >>> print(re.search(r'\w', p))
None
 >>> print(re.search(r'..', p))   # ← 𝙏𝙃𝙄𝙎 𝙄𝙎 𝙏𝙃𝙀 𝙑𝙄𝙊𝙇𝘼𝙏𝙄𝙊𝙉 𝙍𝙄𝙂𝙃𝙏 𝙃𝙀𝙍𝙀 
<_sre.SRE_Match object at 0x10051f988>
 >>> print(len(chr(0x1D4AB)))
2

That is illegal in Unicode regular expressions.

History
Date	User	Action	Args
2011-08-11 19:03:55	tchrist	set	recipients: + tchrist
2011-08-11 19:03:55	tchrist	set	messageid: <1313089435.8.0.838915767835.issue12729@psf.upfronthosting.co.za>
2011-08-11 19:03:55	tchrist	link	issue12729 messages
2011-08-11 19:03:54	tchrist	create