classification
Title: Backreferences make case-insensitive regex fail on non-ASCII strings.
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.4, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Arfrever, ezio.melotti, georg.brandl, haypo, mrabarnett, pitrou, pyos, python-dev, serhiy.storchaka
Priority: normal Keywords: easy, patch

Created on 2012-12-14 22:19 by pyos, last changed 2012-12-30 09:22 by georg.brandl. This issue is now closed.

Files
File name Uploaded Description Edit
issue16688.patch mrabarnett, 2012-12-15 20:00 review
issue16688#2.patch mrabarnett, 2012-12-16 01:04
issue16688#3.patch mrabarnett, 2012-12-16 18:48
Messages (17)
msg177518 - (view) Author: (pyos) Date: 2012-12-14 22:19
The title says it all: if a regular expression that makes use of backreferences is compiled with `re.I` flag, it will always fail when matched against a string that contains characters outside of U+0000-U+00FF range. I've been unable to further narrow the bug down.

A simple example:

    >>> import re
    >>> r = re.compile(r'(a)\1', re.I)  # should match "aa", "aA", "Aa", or "AA"
    >>> r.findall('aa')  # works as expected
    ['a']
    >>> r.findall('aa bcd')  # still works
    ['a']
    >>> r.findall('aa Ā')  # ord('Ā') == 0x0100
    []

The same code works as expected in Python 3.2:

    >>> r.findall('aa Ā')
    ['a']
msg177519 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-12-14 22:30
It works on 2.7 too, and fails on 3.3/3.x.
Maybe it's related to PEP 393?
msg177523 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-15 00:41
In function SRE_MATCH, the code for SRE_OP_GROUPREF (line 1290) contains this:

    while (p < e) {
        if (ctx->ptr >= end ||
            SRE_CHARGET(state, ctx->ptr, 0) != SRE_CHARGET(state, p, 0))
            RETURN_FAILURE;
        p += state->charsize;
        ctx->ptr += state->charsize;
    }

However, the code for SRE_OP_GROUPREF_IGNORE (line 1316) contains this:

    while (p < e) {
        if (ctx->ptr >= end ||
            state->lower(SRE_CHARGET(state, ctx->ptr, 0)) != state->lower(*p))
            RETURN_FAILURE;
        p++;
        ctx->ptr += state->charsize;
    }

(In both cases 'p' is of type 'char*'.)

The problem appears to be that the latter is still using '*p' and 'p++' and is thus always working with chars (it gets and advances 1 byte at a time instead of 1, 2 or 4 bytes for Unicode).
msg177532 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-15 08:37
Good analysis, Matthew. Are you want to submit a patch?
msg177556 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-15 20:00
OK, here's a patch.
msg177572 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-12-15 23:01
Can someone check if there is no other similar regression (introduced
by the PEP 393)?

2012/12/15 Serhiy Storchaka <report@bugs.python.org>:
>
> Changes by Serhiy Storchaka <storchaka@gmail.com>:
>
>
> ----------
> stage: needs patch -> patch review
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue16688>
> _______________________________________
msg177573 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-16 00:24
I found another bug while looking through the source.

On line 495 in function SRE_COUNT:

    if (maxcount < end - ptr && maxcount != 65535)
        end = ptr + maxcount*state->charsize;

where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters.

If the byte after the end of the string is 0 then you get this:

>>> # Good:
>>> re.search(r"\x00{1,3}", "a\x00\x00").span()
(1, 3)
>>> # Bad:
>>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
(1, 4)

I'll keep looking before submitting a patch.
msg177574 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-16 00:33
I found another bug while looking through the source.

On line 495 in function SRE_COUNT:

    if (maxcount < end - ptr && maxcount != 65535)
        end = ptr + maxcount*state->charsize;

where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters.

If the byte after the end of the string is 0 then you get this:

>>> # Good:
>>> re.search(r"\x00{1,3}", "a\x00\x00").span()
(1, 3)
>>> # Bad:
>>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
(1, 4)

I'll keep looking before submitting a patch.
msg177576 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-16 01:04
I haven't found any other issues, so here's the second patch.
msg177580 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-16 08:29
The patches LGTM. How about adding a test?
msg177614 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-16 17:33
Here are some tests for the issue.
msg177616 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-16 18:08
The second test pass on unpatched Python.
msg177618 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-16 18:48
Oops! :-( Now corrected.
msg177620 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-16 19:02
LGTM.

Matthew, can you please submit a contributor form?

http://python.org/psf/contrib/contrib-form/
http://python.org/psf/contrib/
msg178540 - (view) Author: Roundup Robot (python-dev) Date: 2012-12-29 21:45
New changeset 44a4f9289faa by Serhiy Storchaka in branch '3.3':
Issue #16688: Fix backreferences did make case-insensitive regex fail on non-ASCII strings.
http://hg.python.org/cpython/rev/44a4f9289faa

New changeset c59ee1ff6f27 by Serhiy Storchaka in branch 'default':
Issue #16688: Fix backreferences did make case-insensitive regex fail on non-ASCII strings.
http://hg.python.org/cpython/rev/c59ee1ff6f27
msg178541 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-29 21:50
Fixed. Thank you for a patch, Matthew. I hope to see more your patches.
msg178562 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-12-30 09:22
I think you will, Matthew being MRAB on the mailing lists :)
History
Date User Action Args
2012-12-30 09:22:27georg.brandlsetnosy: + georg.brandl
messages: + msg178562
2012-12-29 21:50:41serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg178541

stage: commit review -> resolved
2012-12-29 21:45:31python-devsetnosy: + python-dev
messages: + msg178540
2012-12-29 21:11:41serhiy.storchakasetassignee: serhiy.storchaka
2012-12-16 19:02:31serhiy.storchakasetmessages: + msg177620
stage: patch review -> commit review
2012-12-16 18:48:59mrabarnettsetfiles: + issue16688#3.patch

messages: + msg177618
2012-12-16 18:48:07mrabarnettsetfiles: - issue16688#3.patch
2012-12-16 18:08:15serhiy.storchakasetmessages: + msg177616
2012-12-16 17:33:38mrabarnettsetfiles: + issue16688#3.patch

messages: + msg177614
2012-12-16 08:29:53serhiy.storchakasetmessages: + msg177580
2012-12-16 01:04:57mrabarnettsetfiles: + issue16688#2.patch

messages: + msg177576
2012-12-16 00:33:54mrabarnettsetmessages: + msg177574
2012-12-16 00:24:21mrabarnettsetmessages: + msg177573
2012-12-15 23:01:50hayposetmessages: + msg177572
2012-12-15 21:41:42serhiy.storchakasetstage: needs patch -> patch review
2012-12-15 20:00:26mrabarnettsetfiles: + issue16688.patch
keywords: + patch
messages: + msg177556
2012-12-15 08:37:26serhiy.storchakasetkeywords: + easy

messages: + msg177532
stage: needs patch
2012-12-15 00:41:26mrabarnettsetmessages: + msg177523
2012-12-14 23:47:59Arfreversetnosy: + Arfrever
2012-12-14 22:30:04ezio.melottisetmessages: + msg177519
versions: + Python 3.4
2012-12-14 22:21:17hayposetnosy: + serhiy.storchaka
2012-12-14 22:20:52hayposetnosy: + haypo
2012-12-14 22:19:33pyoscreate