classification
Title: TODO in tailmatch(): it does not support backward in all cases
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.3, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: loewis, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2012-10-19 00:51 by vstinner, last changed 2013-01-03 13:38 by vstinner. This issue is now closed.

Messages (9)
msg173301 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-10-19 00:51
Oh oh, it looks like the implementation of tailmatch() was not finished:

        /* If both are of the same kind, memcmp is sufficient */
        if (kind_self == kind_sub) {
            return ...;
        }
        /* otherwise we have to compare each character by first accesing it */
        else {
            /* We do not need to compare 0 and len(substring)-1 because
               the if statement above ensured already that they are equal
               when we end up here. */
            /* TODO: honor direction and do a forward or backwards search */
            for (i = 1; i < end_sub; ++i) {
                if (PyUnicode_READ(kind_self, data_self, offset + i) !=
                    PyUnicode_READ(kind_sub, data_sub, i))
                    return 0;
            }
            return 1;
        }
msg174441 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-01 19:08
The result does not depend on the direction of comparison. This only affects speed. But who can to say in which direction comparison will be faster?

Here I see a one obvious opportunity for optimization:

    if (kind_self < kind_sub)
        return 0;

After that and after processing the case (kind_self == kind_sub) only 3 special cases left: UCS1 in UCS2, UCS1 in UCS4, and UCS2 in UCS4.  Get rid of slow PyUnicode_READ() for this cases will speed up the code.  Also note that comparing first and last characters before memcmp can be a slowdown (because PyUnicode_READ() is slow).  Try to compare first and last bytes.
msg174847 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-04 23:51
Oh, PyUnicode_Tailmatch() documentation doesn't mention that the function can fail.
msg174884 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-05 09:22
> Oh, PyUnicode_Tailmatch() documentation doesn't mention that the function
> can fail.

But it does.

.. c:function:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \
                        Py_ssize_t start, Py_ssize_t end, int direction)

   Return 1 if *substr* matches ``str[start:end]`` at the given tail end
   (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
   0 otherwise. Return ``-1`` if an error occurred.
msg174893 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-05 11:36
>> Oh, PyUnicode_Tailmatch() documentation doesn't mention that the function
>> can fail.
>
> But it does.
>
> .. c:function:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \
>                         Py_ssize_t start, Py_ssize_t end, int direction)
>
>    Return 1 if *substr* matches ``str[start:end]`` at the given tail end
>    (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
>    0 otherwise. Return ``-1`` if an error occurred.

Oh, I read the "documentation" in unicodeobject.h:

/* Return 1 if substr matches str[start:end] at the given tail end, 0
   otherwise. */

The problem is that tailmatch() returns 0 if PyUnicode_READY() failed.
It is a bug, even if it cannot occur because PyUnicode_Tailmatch()
checks that the both strings are ready.
msg178901 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-01-03 02:20
New changeset 49eb2488145d by Victor Stinner in branch 'default':
Close #16281: handle tailmatch() failure and remove useless comment
http://hg.python.org/cpython/rev/49eb2488145d
msg178902 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-01-03 02:21
"Here I see a one obvious opportunity for optimization: ..."

@Serhiy: Can you please open a new issue for this? I consider the issue as fixed: I just removed the TODO (for the reason explained in the changeset).
msg178941 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-03 12:43
Shouldn't this be applied to 3.3?

As for optimization, I made some benchmarks and didn't saw any significant difference. Usually this function used to check short ASCII heads and tails and any optimization will not be seen even under a microscope.
msg178944 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-01-03 13:38
> Shouldn't this be applied to 3.3?

It's just a cleanup, it doesn't fix any real bug. I prefer to not
pollute old versions with cleanup.

> As for optimization, I made some benchmarks and didn't saw any significant difference. Usually this function used to check short ASCII heads and tails and any optimization will not be seen even under a microscope.

Ok, agreed.
History
Date User Action Args
2013-01-03 13:38:55vstinnersetmessages: + msg178944
2013-01-03 12:43:01serhiy.storchakasetmessages: + msg178941
2013-01-03 02:21:46vstinnersetmessages: + msg178902
2013-01-03 02:20:27python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg178901

resolution: fixed
stage: needs patch -> resolved
2012-11-15 20:29:14serhiy.storchakasetcomponents: + Interpreter Core
stage: needs patch
2012-11-05 11:36:08vstinnersetmessages: + msg174893
2012-11-05 09:22:00serhiy.storchakasetmessages: + msg174884
2012-11-04 23:51:09vstinnersetmessages: + msg174847
2012-11-01 19:08:04serhiy.storchakasetmessages: + msg174441
2012-10-19 00:51:23vstinnercreate