Message 226871 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	ezio.melotti, mrabarnett, pitrou, serhiy.storchaka
Date	2014-09-14.15:43:15
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1410709398.9.0.134852560063.issue22407@psf.upfronthosting.co.za>
In-reply-to

Content
Current implementation of re.LOCALE support for Unicode strings is nonsensical. It correctly works only on Latin1 locales (because Unicode string interpreted as Latin1 decoded bytes string. all characters outside UCS1 range considered as non-words), on other locales it got strange and useless results. >>> import re, locale >>> locale.setlocale(locale.LC_CTYPE, 'ru_RU.cp1251') 'ru_RU.cp1251' >>> re.match(br'\w', 'µ'.encode('cp1251'), re.L) <_sre.SRE_Match object; span=(0, 1), match=b'\xb5'> >>> re.match(r'\w', 'µ', re.L) <_sre.SRE_Match object; span=(0, 1), match='µ'> >>> re.match(br'\w', 'ё'.encode('cp1251'), re.L) <_sre.SRE_Match object; span=(0, 1), match=b'\xb8'> >>> re.match(r'\w', 'ё', re.L) Proposed patch fixes re.LOCALE support for Unicode strings. It uses the wide-character equivalents of C characters functions (towlower(), iswalpha(), etc). The problem is that these functions are not exists in C89, they are introduced only in C99. Gcc understand them, we should check other compilers. However these functions are already used on FreeBSD and MacOS.

Current implementation of re.LOCALE support for Unicode strings is nonsensical. It correctly works only on Latin1 locales (because Unicode string interpreted as Latin1 decoded bytes string. all characters outside UCS1 range considered as non-words), on other locales it got strange and useless results.

>>> import re, locale
>>> locale.setlocale(locale.LC_CTYPE, 'ru_RU.cp1251')
'ru_RU.cp1251'
>>> re.match(br'\w', 'µ'.encode('cp1251'), re.L)
<_sre.SRE_Match object; span=(0, 1), match=b'\xb5'>
>>> re.match(r'\w', 'µ', re.L)
<_sre.SRE_Match object; span=(0, 1), match='µ'>
>>> re.match(br'\w', 'ё'.encode('cp1251'), re.L)
<_sre.SRE_Match object; span=(0, 1), match=b'\xb8'>
>>> re.match(r'\w', 'ё', re.L)

Proposed patch fixes re.LOCALE support for Unicode strings. It uses the wide-character equivalents of C characters functions (towlower(), iswalpha(), etc).

The problem is that these functions are not exists in C89, they are introduced only in C99. Gcc understand them, we should check other compilers. However these functions are already used on FreeBSD and MacOS.

History
Date	User	Action	Args
2014-09-14 15:43:19	serhiy.storchaka	set	recipients: + serhiy.storchaka, pitrou, ezio.melotti, mrabarnett
2014-09-14 15:43:18	serhiy.storchaka	set	messageid: <1410709398.9.0.134852560063.issue22407@psf.upfronthosting.co.za>
2014-09-14 15:43:18	serhiy.storchaka	link	issue22407 messages
2014-09-14 15:43:18	serhiy.storchaka	create