Message 142058 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, ezio.melotti, jkloth, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date	2011-08-14.15:47:37
SpamBayes Score	5.7741802e-09
Marked as misclassified	No
Message-id	<1313336859.29.0.142959598943.issue12749@psf.upfronthosting.co.za>
In-reply-to

Content
On neither narrow nor wide builds does this UTF8-encoded bit run without raising an exception: if re.search("[𝒜-𝒵]", "𝒞", re.UNICODE): print("match 1 passed") else: print("match 2 failed") The best you can possibly do is to use both a wide build and symbolic literals, in which case it will pass. But remove either of both of those conditions and you fail. This is too restrictive for full Unicode use. There should never be any sitation where [a-z] fails to match c when a < c < z, and neither a nor z is something special in a character class. There is, or perhaps should be, no difference at all between "[a-z]" and "[𝒜-𝒵]", just as there is, or at least should b, no difference between "c" and "𝒞". You can’t have second-class citizens like this that can't be used. And no, this one is not fixed by Matthew Barnett's regex library. There is some dumb UCS-2 assumption lurking deep in Python somewhere that makes this break, even on wide builds, which is incomprehensible to me.

On neither narrow nor wide builds does this UTF8-encoded bit run without raising an exception: 

   if re.search("[𝒜-𝒵]", "𝒞", re.UNICODE): 
       print("match 1 passed")
   else:
       print("match 2 failed")

The best you can possibly do is to use both a wide build *and* symbolic literals, in which case it will pass. But remove either of both of those conditions and you fail.  This is too restrictive for full Unicode use. 

There should never be any sitation where [a-z] fails to match c when a < c < z, and neither a nor z is something special in a character class.  There is, or perhaps should be, no difference at all between "[a-z]" and "[𝒜-𝒵]", just as there is, or at least should b, no difference between "c" and "𝒞". You can’t have second-class citizens like this that can't be used.

And no, this one is *not* fixed by Matthew Barnett's regex library. There is some dumb UCS-2 assumption lurking deep in Python somewhere that makes this break, even on wide builds, which is incomprehensible to me.

History
Date	User	Action	Args
2011-08-14 15:47:39	tchrist	set	recipients: + tchrist, terry.reedy, pitrou, jkloth, ezio.melotti, mrabarnett, Arfrever, r.david.murray
2011-08-14 15:47:39	tchrist	set	messageid: <1313336859.29.0.142959598943.issue12749@psf.upfronthosting.co.za>
2011-08-14 15:47:38	tchrist	link	issue12749 messages
2011-08-14 15:47:38	tchrist	create