Issue 22407: re.LOCALE is nonsensical for Unicode

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/66597

classification

Title:	re.LOCALE is nonsensical for Unicode
Type:	enhancement	Stage:	resolved
Components:	Extension Modules, Library (Lib), Regular Expressions, Unicode	Versions:	Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:	22838	Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	Arfrever, ezio.melotti, martin.panter, mrabarnett, pitrou, python-dev, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2014-09-14 15:43 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
re_unicode_locale.patch	serhiy.storchaka, 2014-09-14 15:43		review
re_deprecate_unicode_locale.patch	serhiy.storchaka, 2014-10-09 15:10		review

Messages (9)
msg226871 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-09-14 15:43
Current implementation of re.LOCALE support for Unicode strings is nonsensical. It correctly works only on Latin1 locales (because Unicode string interpreted as Latin1 decoded bytes string. all characters outside UCS1 range considered as non-words), on other locales it got strange and useless results. >>> import re, locale >>> locale.setlocale(locale.LC_CTYPE, 'ru_RU.cp1251') 'ru_RU.cp1251' >>> re.match(br'\w', 'µ'.encode('cp1251'), re.L) <_sre.SRE_Match object; span=(0, 1), match=b'\xb5'> >>> re.match(r'\w', 'µ', re.L) <_sre.SRE_Match object; span=(0, 1), match='µ'> >>> re.match(br'\w', 'ё'.encode('cp1251'), re.L) <_sre.SRE_Match object; span=(0, 1), match=b'\xb8'> >>> re.match(r'\w', 'ё', re.L) Proposed patch fixes re.LOCALE support for Unicode strings. It uses the wide-character equivalents of C characters functions (towlower(), iswalpha(), etc). The problem is that these functions are not exists in C89, they are introduced only in C99. Gcc understand them, we should check other compilers. However these functions are already used on FreeBSD and MacOS.
msg226949 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-09-16 12:36
I don't think we should fix this in 2.x: some people may rely on the old behaviour, and it will be difficult for them to debug. In 3.x, I simply propose we deprecate re.LOCALE for unicode strings and make it a no-op.
msg226959 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-09-16 16:11
Yes, one of solution is to deprecate re.LOCALE for unicode strings and then make it incompatible with unicode strings. But I think it would be good to implement locale-aware matching. Example. >>> for a in 'Ii\u0130\u0131': ... for b in 'Ii\u0130\u0131': ... if a != b and re.match(a, b, re.I): print(a, '~', b) ... I ~ i I ~ İ i ~ I i ~ İ İ ~ I İ ~ i This is incorrect result in Turkish. Capital dotless "I" matches capital "İ" with dot above, and small dotless "ı" doesn't match anything. Regex produces more relevant output, which includes matches for Turkish and English: I ~ i I ~ ı i ~ I i ~ İ İ ~ i ı ~ I With locale tr_TR.utf8 (with the patch): >>> for a in 'Ii\u0130\u0131': ... for b in 'Ii\u0130\u0131': ... if a != b and re.match(a, b, re.I\|re.L): print(a, '~', b) ... I ~ ı i ~ İ İ ~ i ı ~ I This is correct result in Turkish. Therefore there is a use case for this feature.
msg226960 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-09-16 16:12
Ha, I always forget about the Turkish locale case...
msg228876 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-10-09 15:10
Here is simple patch which just deprecate using of the re.LOCALE flag with str patterns. It also deprecates using of the re.LOCALE flag with the re.ASCII flag (with bytes patterns) and adds some re.LOCALE related tests.
msg231022 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-11 11:06
If there are no objections I'll commit the re_deprecate_unicode_locale.patch patch. But it would be good if someone will review doc changes.
msg231924 - (view)	Author: Martin Panter (martin.panter) *	Date: 2014-12-01 10:38
Looks like revision 561d1d0de518 was to fix this issue, but the NEWS entry has the wrong reference number
msg231927 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-12-01 10:49
Indeed. Thank you Martin.
msg231931 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-12-01 11:16
New changeset abc7fe393016 by Serhiy Storchaka in branch 'default': Fixed issue number in Misc/NEWS for issue #22407. https://hg.python.org/cpython/rev/abc7fe393016

History
Date	User	Action	Args
2022-04-11 14:58:08	admin	set	github: 66597
2014-12-01 11:16:44	python-dev	set	nosy: + python-dev messages: + msg231931
2014-12-01 10:49:06	serhiy.storchaka	set	messages: + msg231927
2014-12-01 10:38:05	martin.panter	set	nosy: + martin.panter messages: + msg231924
2014-12-01 09:53:48	serhiy.storchaka	set	status: open -> closed type: behavior -> enhancement resolution: fixed stage: patch review -> resolved
2014-11-11 11:06:22	serhiy.storchaka	set	messages: + msg231022
2014-11-11 11:02:03	serhiy.storchaka	set	dependencies: + Convert re tests to unittest
2014-11-11 10:58:37	serhiy.storchaka	set	assignee: serhiy.storchaka
2014-10-09 15:10:21	serhiy.storchaka	set	files: + re_deprecate_unicode_locale.patch messages: + msg228876 versions: - Python 2.7, Python 3.4
2014-09-21 09:27:33	Arfrever	set	nosy: + Arfrever
2014-09-16 16:12:50	pitrou	set	messages: + msg226960
2014-09-16 16:11:02	serhiy.storchaka	set	messages: + msg226959
2014-09-16 12:36:32	pitrou	set	messages: + msg226949
2014-09-16 12:11:17	vstinner	set	components: + Unicode
2014-09-16 12:11:10	vstinner	set	nosy: + vstinner
2014-09-14 15:43:18	serhiy.storchaka	create