Message141917
Python is in flagrant violation of the very most basic premises of Unicode Technical Report #18 on Regular Expressions, which requires that a regex engine support Unicode characters as "basic logical units independent of serialization like UTFβ*". Because sometimes you must specify ".." to match a single Unicode character -- whenever those code points are above the BMP and you are on a narrow build -- Python regexes cannot be reliably used for Unicode text.
% python3.2
Python 3.2 (r32:88445, Jul 21 2011, 14:44:19)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
>>> print(g)
αΎ²
>>> print(re.search(r'\w', g))
<_sre.SRE_Match object at 0x10051f988>
>>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
>>> print(p)
π«
>>> print(re.search(r'\w', p))
None
>>> print(re.search(r'..', p)) # β ππππΒ ππΒ πππΒ πππππΌππππΒ πππππΒ ππππ
<_sre.SRE_Match object at 0x10051f988>
>>> print(len(chr(0x1D4AB)))
2
That is illegal in Unicode regular expressions. |
|
Date |
User |
Action |
Args |
2011-08-11 19:03:55 | tchrist | set | recipients:
+ tchrist |
2011-08-11 19:03:55 | tchrist | set | messageid: <1313089435.8.0.838915767835.issue12729@psf.upfronthosting.co.za> |
2011-08-11 19:03:55 | tchrist | link | issue12729 messages |
2011-08-11 19:03:54 | tchrist | create | |
|