Message 183705 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	acdha
Recipients	acdha, ezio.melotti, mrabarnett
Date	2013-03-07.20:52:32
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1362689552.47.0.00880933015098.issue17381@psf.upfronthosting.co.za>
In-reply-to

Content
I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text: >>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', flags=re.IGNORECASE) 'Архангельская губерния' >>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0) '' The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded: >>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=re.IGNORECASE\|regex.UNICODE) u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f' In contrast, the regex module behaves as expected: >>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=regex.IGNORECASE\|regex.UNICODE) u'' (Transcript maintained at https://gist.github.com/acdha/5111687)

I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text:

>>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', flags=re.IGNORECASE)
'Архангельская губерния'
>>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0)
''

The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded:

>>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=re.IGNORECASE|regex.UNICODE)
u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f'


In contrast, the regex module behaves as expected:

>>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=regex.IGNORECASE|regex.UNICODE)
u''

(Transcript maintained at https://gist.github.com/acdha/5111687)

History
Date	User	Action	Args
2013-03-07 20:52:32	acdha	set	recipients: + acdha, ezio.melotti, mrabarnett
2013-03-07 20:52:32	acdha	set	messageid: <1362689552.47.0.00880933015098.issue17381@psf.upfronthosting.co.za>
2013-03-07 20:52:32	acdha	link	issue17381 messages
2013-03-07 20:52:32	acdha	create