classification
Title: Locale dependent regexps on different locales
Type: behavior Stage: resolved
Components: Extension Modules, Library (Lib), Regular Expressions Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, mrabarnett, pitrou, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2014-09-14 16:23 by serhiy.storchaka, last changed 2017-04-30 05:38 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
re_locale_caching_demo.py serhiy.storchaka, 2014-09-14 16:23 Demo
re_locale_caching3.patch serhiy.storchaka, 2014-09-19 09:31
Messages (15)
msg226874 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-14 16:23
Locale-specific case-insensitive regular expression matching works only when the pattern was compiled on the same locale as used for matching. Due to caching this can cause unexpected result.

Attached script demonstrates this (it requires two locales: ru_RU.koi8-r and ru_RU.cp1251). The output is:

locale ru_RU.koi8-r
  b'1\xa3' ('1ё') matches b'1\xb3' ('1Ё')
  b'1\xa3' ('1ё') doesn't match b'1\xbc' ('1╪')
locale ru_RU.cp1251
  b'1\xa3' ('1Ј') doesn't match b'1\xb3' ('1і')
  b'1\xa3' ('1Ј') matches b'1\xbc' ('1ј')
locale ru_RU.cp1251
  b'2\xa3' ('2Ј') doesn't match b'2\xb3' ('2і')
  b'2\xa3' ('2Ј') matches b'2\xbc' ('2ј')
locale ru_RU.koi8-r
  b'2\xa3' ('2ё') doesn't match b'2\xb3' ('2Ё')
  b'2\xa3' ('2ё') matches b'2\xbc' ('2╪')

b'\xa3' matches b'\xb3' on KOI8-R locale if the pattern was compiled on KOI8-R locale and matches b'\xb3' if the pattern was compiled on CP1251 locale.

I see three possible ways to solve this issue:

1. Avoid caching of locale-depending case-insensitive patterns. This definitely will decrease performance of the use of locale-depending case-insensitive regexps (if user don't use own caching) and may be slightly decrease performance of the use of other regexps.

2. Clear precompiled regexps cache on every locale change. This can look simpler, but is vulnerable to locale changes from extensions.

3. Do not lowercase characters at compile time (in locale-depending case-insensitive patterns). This needs to introduce new opcode for case-insensitivity matching or at least rewriting implementation of current opcodes (less efficient). On other way, this is more correct implementation than current one. The problem is that this is incompatible with those distributions which updates only Python library but not statically linked binary (e.g. Vim with Python support). May be there are some workarounds.
msg226878 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2014-09-14 16:59
The support for locales in the re module is limited to those with 1 byte per character, and only for a few properties (those provided by the underlying C library), so maybe it could do the following:

If the LOCALE flag is set, then read the current locale and build a table of its properties.

Let the compiled pattern refer to the property table.

When matching, use the property table referred to by the pattern.
msg227033 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-18 09:54
Yes, it is possible to build full property table for bytes regexps at regexp compile time. But it is impossible for unicode regexps (issue22407). And in any case this doesn't solve original problem: re.match(pattern, string, re.L|re.I) can return unexpected result if the same pattern already was used with different locale.
msg227038 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2014-09-18 11:05
When you lookup the pattern in the cache, include the current locale as part of the key if the pattern is locale-sensitive (you can let it be None if the pattern is not locale-sensitive).
msg227044 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-18 13:52
Here is a patch which implements Matthew's suggestion. It significant slow down the use of locale-sensitive regular expressions, there is a possibility for race condition between compiling and matching, and it doesn't solve the issue for explicitly cached expressions. Also I prefer that matching depends on locale at the time of matching, not at the time of compiling.

This patch can be considered as nonperfect solution for 3.4 and 2.7. But for 3.5 I'll try to implement better solution.

Microbenchmark:
$ ./python -m timeit -s 'import re' -- 're.match(br"\w+", b"abc", re.L)'

Before patch: 100000 loops, best of 3: 10.4 usec per loop

After patch: 10000 loops, best of 3: 37.5 usec per loop
msg227045 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-09-18 13:54
Rather than introduce a perf regression in 2.7 and 3.4, I would suggest to simply fix the issue in 3.5.
msg227046 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2014-09-18 14:11
@Serhiy: You're overlooking that the LOCALE flag could be inline, e.g. r'(?L)\w+'.

Basically, if you've seen the pattern before, you know whether it has an inline LOCALE flag; if you haven't seen the pattern before, you'll need to parse it anyway, and then you'll discover whether it has an inline LOCALE flag.
msg227050 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-18 15:38
Good catch Matthew!

After fixing this and yet one bug (LC_CTYPE should be used instead of LC_ALL), and adding more optimizations, the performance is increased. Now the result of above microbenchmark is 18.5 usec per loop.
msg227087 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-19 09:31
Moved the import to the top level as Antoine suggested.
msg229920 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-24 12:51
If there are no objections I'll commit the patch.
msg230302 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-30 21:57
Patch looks good to me.
msg230306 - (view) Author: Roundup Robot (python-dev) Date: 2014-10-30 23:04
New changeset 6d2788f9b20a by Serhiy Storchaka in branch '2.7':
Issue #22410: Module level functions in the re module now cache compiled
https://hg.python.org/cpython/rev/6d2788f9b20a

New changeset cbdc658b7797 by Serhiy Storchaka in branch '3.4':
Issue #22410: Module level functions in the re module now cache compiled
https://hg.python.org/cpython/rev/cbdc658b7797

New changeset df9c1caf3654 by Serhiy Storchaka in branch 'default':
Issue #22410: Module level functions in the re module now cache compiled
https://hg.python.org/cpython/rev/df9c1caf3654
msg230310 - (view) Author: Roundup Robot (python-dev) Date: 2014-10-30 23:39
New changeset d565dbf576f9 by Serhiy Storchaka in branch '2.7':
Fixed compile error in issue #22410. The _locale module is optional.
https://hg.python.org/cpython/rev/d565dbf576f9

New changeset 0c016fa378db by Serhiy Storchaka in branch '3.4':
Fixed compile error in issue #22410. The _locale module is optional.
https://hg.python.org/cpython/rev/0c016fa378db

New changeset 1d87ac92b041 by Serhiy Storchaka in branch 'default':
Fixed compile error in issue #22410. The _locale module is optional.
https://hg.python.org/cpython/rev/1d87ac92b041
msg230314 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-31 00:07
Thank you for your review Antoine.

Committed patch has fixed only part of the problem. It doesn't fix the problem of explicitly compiled patterns. Better solution requires changes to the _sre module.
msg292619 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-30 05:38
Opened issue30215 for more comprehensive solution.
History
Date User Action Args
2017-04-30 05:38:39serhiy.storchakasetstatus: open -> closed

messages: + msg292619
stage: resolved
2014-10-31 00:07:39serhiy.storchakasetresolution: fixed
stage: patch review -> (no value)
messages: + msg230314
versions: - Python 2.7, Python 3.4
2014-10-30 23:39:49python-devsetmessages: + msg230310
2014-10-30 23:04:07python-devsetnosy: + python-dev
messages: + msg230306
2014-10-30 21:57:01pitrousetmessages: + msg230302
2014-10-24 12:51:18serhiy.storchakasetassignee: serhiy.storchaka
messages: + msg229920
2014-09-19 09:35:51serhiy.storchakasetfiles: - re_locale_caching2.patch
2014-09-19 09:31:48serhiy.storchakasetfiles: + re_locale_caching3.patch

messages: + msg227087
2014-09-18 16:15:57serhiy.storchakasetfiles: - re_locale_caching.patch
2014-09-18 15:38:06serhiy.storchakasetfiles: + re_locale_caching2.patch

messages: + msg227050
2014-09-18 14:11:51mrabarnettsetmessages: + msg227046
2014-09-18 13:54:04pitrousetmessages: + msg227045
2014-09-18 13:52:49serhiy.storchakasetfiles: + re_locale_caching.patch
keywords: + patch
messages: + msg227044

stage: patch review
2014-09-18 11:05:05mrabarnettsetmessages: + msg227038
2014-09-18 09:54:25serhiy.storchakasetmessages: + msg227033
2014-09-14 16:59:32mrabarnettsetmessages: + msg226878
2014-09-14 16:23:23serhiy.storchakacreate