Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locale dependent regexps on different locales #66600

Closed
serhiy-storchaka opened this issue Sep 14, 2014 · 15 comments
Closed

Locale dependent regexps on different locales #66600

serhiy-storchaka opened this issue Sep 14, 2014 · 15 comments
Assignees
Labels
extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir topic-regex type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 22410
Nosy @pitrou, @ezio-melotti, @serhiy-storchaka
Files
  • re_locale_caching_demo.py: Demo
  • re_locale_caching3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2017-04-30.05:38:39.070>
    created_at = <Date 2014-09-14.16:23:23.633>
    labels = ['extension-modules', 'expert-regex', 'type-bug', 'library']
    title = 'Locale dependent regexps on different locales'
    updated_at = <Date 2017-04-30.05:38:39.054>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2017-04-30.05:38:39.054>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2017-04-30.05:38:39.070>
    closer = 'serhiy.storchaka'
    components = ['Extension Modules', 'Library (Lib)', 'Regular Expressions']
    creation = <Date 2014-09-14.16:23:23.633>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['36616', '36659']
    hgrepos = []
    issue_num = 22410
    keywords = ['patch']
    message_count = 15.0
    messages = ['226874', '226878', '227033', '227038', '227044', '227045', '227046', '227050', '227087', '229920', '230302', '230306', '230310', '230314', '292619']
    nosy_count = 5.0
    nosy_names = ['pitrou', 'ezio.melotti', 'mrabarnett', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue22410'
    versions = ['Python 3.5']

    @serhiy-storchaka
    Copy link
    Member Author

    Locale-specific case-insensitive regular expression matching works only when the pattern was compiled on the same locale as used for matching. Due to caching this can cause unexpected result.

    Attached script demonstrates this (it requires two locales: ru_RU.koi8-r and ru_RU.cp1251). The output is:

    locale ru_RU.koi8-r
    b'1\xa3' ('1ё') matches b'1\xb3' ('1Ё')
    b'1\xa3' ('1ё') doesn't match b'1\xbc' ('1╪')
    locale ru_RU.cp1251
    b'1\xa3' ('1Ј') doesn't match b'1\xb3' ('1і')
    b'1\xa3' ('1Ј') matches b'1\xbc' ('1ј')
    locale ru_RU.cp1251
    b'2\xa3' ('2Ј') doesn't match b'2\xb3' ('2і')
    b'2\xa3' ('2Ј') matches b'2\xbc' ('2ј')
    locale ru_RU.koi8-r
    b'2\xa3' ('2ё') doesn't match b'2\xb3' ('2Ё')
    b'2\xa3' ('2ё') matches b'2\xbc' ('2╪')

    b'\xa3' matches b'\xb3' on KOI8-R locale if the pattern was compiled on KOI8-R locale and matches b'\xb3' if the pattern was compiled on CP1251 locale.

    I see three possible ways to solve this issue:

    1. Avoid caching of locale-depending case-insensitive patterns. This definitely will decrease performance of the use of locale-depending case-insensitive regexps (if user don't use own caching) and may be slightly decrease performance of the use of other regexps.

    2. Clear precompiled regexps cache on every locale change. This can look simpler, but is vulnerable to locale changes from extensions.

    3. Do not lowercase characters at compile time (in locale-depending case-insensitive patterns). This needs to introduce new opcode for case-insensitivity matching or at least rewriting implementation of current opcodes (less efficient). On other way, this is more correct implementation than current one. The problem is that this is incompatible with those distributions which updates only Python library but not statically linked binary (e.g. Vim with Python support). May be there are some workarounds.

    @serhiy-storchaka serhiy-storchaka added extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir topic-regex type-bug An unexpected behavior, bug, or error labels Sep 14, 2014
    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 14, 2014

    The support for locales in the re module is limited to those with 1 byte per character, and only for a few properties (those provided by the underlying C library), so maybe it could do the following:

    If the LOCALE flag is set, then read the current locale and build a table of its properties.

    Let the compiled pattern refer to the property table.

    When matching, use the property table referred to by the pattern.

    @serhiy-storchaka
    Copy link
    Member Author

    Yes, it is possible to build full property table for bytes regexps at regexp compile time. But it is impossible for unicode regexps (bpo-22407). And in any case this doesn't solve original problem: re.match(pattern, string, re.L|re.I) can return unexpected result if the same pattern already was used with different locale.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 18, 2014

    When you lookup the pattern in the cache, include the current locale as part of the key if the pattern is locale-sensitive (you can let it be None if the pattern is not locale-sensitive).

    @serhiy-storchaka
    Copy link
    Member Author

    Here is a patch which implements Matthew's suggestion. It significant slow down the use of locale-sensitive regular expressions, there is a possibility for race condition between compiling and matching, and it doesn't solve the issue for explicitly cached expressions. Also I prefer that matching depends on locale at the time of matching, not at the time of compiling.

    This patch can be considered as nonperfect solution for 3.4 and 2.7. But for 3.5 I'll try to implement better solution.

    Microbenchmark:
    $ ./python -m timeit -s 'import re' -- 're.match(br"\w+", b"abc", re.L)'

    Before patch: 100000 loops, best of 3: 10.4 usec per loop

    After patch: 10000 loops, best of 3: 37.5 usec per loop

    @pitrou
    Copy link
    Member

    pitrou commented Sep 18, 2014

    Rather than introduce a perf regression in 2.7 and 3.4, I would suggest to simply fix the issue in 3.5.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 18, 2014

    @serhiy: You're overlooking that the LOCALE flag could be inline, e.g. r'(?L)\w+'.

    Basically, if you've seen the pattern before, you know whether it has an inline LOCALE flag; if you haven't seen the pattern before, you'll need to parse it anyway, and then you'll discover whether it has an inline LOCALE flag.

    @serhiy-storchaka
    Copy link
    Member Author

    Good catch Matthew!

    After fixing this and yet one bug (LC_CTYPE should be used instead of LC_ALL), and adding more optimizations, the performance is increased. Now the result of above microbenchmark is 18.5 usec per loop.

    @serhiy-storchaka
    Copy link
    Member Author

    Moved the import to the top level as Antoine suggested.

    @serhiy-storchaka
    Copy link
    Member Author

    If there are no objections I'll commit the patch.

    @serhiy-storchaka serhiy-storchaka self-assigned this Oct 24, 2014
    @pitrou
    Copy link
    Member

    pitrou commented Oct 30, 2014

    Patch looks good to me.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 30, 2014

    New changeset 6d2788f9b20a by Serhiy Storchaka in branch '2.7':
    Issue bpo-22410: Module level functions in the re module now cache compiled
    https://hg.python.org/cpython/rev/6d2788f9b20a

    New changeset cbdc658b7797 by Serhiy Storchaka in branch '3.4':
    Issue bpo-22410: Module level functions in the re module now cache compiled
    https://hg.python.org/cpython/rev/cbdc658b7797

    New changeset df9c1caf3654 by Serhiy Storchaka in branch 'default':
    Issue bpo-22410: Module level functions in the re module now cache compiled
    https://hg.python.org/cpython/rev/df9c1caf3654

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 30, 2014

    New changeset d565dbf576f9 by Serhiy Storchaka in branch '2.7':
    Fixed compile error in issue bpo-22410. The _locale module is optional.
    https://hg.python.org/cpython/rev/d565dbf576f9

    New changeset 0c016fa378db by Serhiy Storchaka in branch '3.4':
    Fixed compile error in issue bpo-22410. The _locale module is optional.
    https://hg.python.org/cpython/rev/0c016fa378db

    New changeset 1d87ac92b041 by Serhiy Storchaka in branch 'default':
    Fixed compile error in issue bpo-22410. The _locale module is optional.
    https://hg.python.org/cpython/rev/1d87ac92b041

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you for your review Antoine.

    Committed patch has fixed only part of the problem. It doesn't fix the problem of explicitly compiled patterns. Better solution requires changes to the _sre module.

    @serhiy-storchaka
    Copy link
    Member Author

    Opened bpo-30215 for more comprehensive solution.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants