Message 226959 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	ezio.melotti, mrabarnett, pitrou, serhiy.storchaka, vstinner
Date	2014-09-16.16:11:02
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1517987.ehv9QctU5f@raxxla>
In-reply-to	<1410870992.9.0.200595296146.issue22407@psf.upfronthosting.co.za>

Content
Yes, one of solution is to deprecate re.LOCALE for unicode strings and then make it incompatible with unicode strings. But I think it would be good to implement locale-aware matching. Example. >>> for a in 'Ii\u0130\u0131': ... for b in 'Ii\u0130\u0131': ... if a != b and re.match(a, b, re.I): print(a, '~', b) ... I ~ i I ~ İ i ~ I i ~ İ İ ~ I İ ~ i This is incorrect result in Turkish. Capital dotless "I" matches capital "İ" with dot above, and small dotless "ı" doesn't match anything. Regex produces more relevant output, which includes matches for Turkish and English: I ~ i I ~ ı i ~ I i ~ İ İ ~ i ı ~ I With locale tr_TR.utf8 (with the patch): >>> for a in 'Ii\u0130\u0131': ... for b in 'Ii\u0130\u0131': ... if a != b and re.match(a, b, re.I\|re.L): print(a, '~', b) ... I ~ ı i ~ İ İ ~ i ı ~ I This is correct result in Turkish. Therefore there is a use case for this feature.

Yes, one of solution is to deprecate re.LOCALE for unicode strings and then 
make it incompatible with unicode strings. But I think it would be good to 
implement locale-aware matching.

Example.

>>> for a in 'Ii\u0130\u0131':
...     for b in 'Ii\u0130\u0131':
...         if a != b and re.match(a, b, re.I): print(a, '~', b)
... 
I ~ i
I ~ İ
i ~ I
i ~ İ
İ ~ I
İ ~ i

This is incorrect result in Turkish. Capital dotless "I" matches capital "İ" 
with dot above, and small dotless "ı" doesn't match anything.

Regex produces more relevant output, which includes matches for Turkish and 
English:

I ~ i
I ~ ı
i ~ I
i ~ İ
İ ~ i
ı ~ I

With locale tr_TR.utf8 (with the patch):

>>> for a in 'Ii\u0130\u0131':
...     for b in 'Ii\u0130\u0131':
...         if a != b and re.match(a, b, re.I|re.L): print(a, '~', b)
... 
I ~ ı
i ~ İ
İ ~ i
ı ~ I

This is correct result in Turkish.

Therefore there is a use case for this feature.

History
Date	User	Action	Args
2014-09-16 16:11:02	serhiy.storchaka	set	recipients: + serhiy.storchaka, pitrou, vstinner, ezio.melotti, mrabarnett
2014-09-16 16:11:02	serhiy.storchaka	link	issue22407 messages
2014-09-16 16:11:02	serhiy.storchaka	create