Message 171413 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	BreamoreBoy, ezio.melotti, serhiy.storchaka, thomaslee, vstinner
Date	2012-09-28.08:28:00
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<CAMpsgwah4_2G9nr6VYpjTarTqj0zwv6DUYU7gXSBd4c9Nuo8Lw@mail.gmail.com>
In-reply-to	<1348814388.88.0.637801900455.issue16061@psf.upfronthosting.co.za>

Content
Python 3.3 is 2x faster than Python 3.2 to replace a character with another if the string only contains the character 3 times. This is not acceptable, Python 3.3 must be as slow as Python 3.2! $ python3.2 -m timeit "ch='é'; sp=' '1000; s = ch+sp+ch+sp+ch; after='à'; s.replace(ch, after)" 100000 loops, best of 3: 3.62 usec per loop $ python3.3 -m timeit "ch='é'; sp=' '1000; s = ch+sp+ch+sp+ch; after='à'; s.replace(ch, after)" 1000000 loops, best of 3: 1.36 usec per loop $ python3.2 -m timeit "ch='€'; sp=' '1000; s = ch+sp+ch+sp+ch; after='Ł'; s.replace(ch, after)" 100000 loops, best of 3: 3.15 usec per loop $ python3.2 -m timeit "ch='€'; sp=' '1000; s = ch+sp+ch+sp+ch; after='Ł'; s.replace(ch, after)" 1000000 loops, best of 3: 1.91 usec per loop More seriously, I changed the algorithm of str.replace(before, after) when before and after are only one character: changeset c802bfc8acfc. The code is now using the heavily optimized findchar() function. PyUnicode_READ() is slow and should be avoided when possible: PyUnicode_READ() macro is expanded to 2 if, whereas findchar() uses directly pointer of the right type (Py_UCS1, Py_UCS2 or Py_UCS4). In Python 3.2, the code looks like: for (i = 0; i < u->length; i++) { if (u->str[i] == u1) { if (--maxcount < 0) break; u->str[i] = u2; } } In Python 3.3, the code looks like: pos = findchar(sbuf, PyUnicode_KIND(self), slen, u1, 1); if (pos < 0) goto nothing; ... while (--maxcount) { pos++; src += pos PyUnicode_KIND(self); slen -= pos; index += pos; pos = findchar(src, PyUnicode_KIND(self), slen, u1, 1); if (pos < 0) break; PyUnicode_WRITE(rkind, PyUnicode_DATA(u), index + pos, u2); }

Python 3.3 is 2x faster than Python 3.2 to replace a character with
another if the string only contains the character 3 times. This is not
acceptable, Python 3.3 must be as slow as Python 3.2!

$ python3.2 -m timeit "ch='é'; sp=' '*1000; s = ch+sp+ch+sp+ch;
after='à'; s.replace(ch, after)"
100000 loops, best of 3: 3.62 usec per loop
$ python3.3 -m timeit "ch='é'; sp=' '*1000; s = ch+sp+ch+sp+ch;
after='à'; s.replace(ch, after)"
1000000 loops, best of 3: 1.36 usec per loop

$ python3.2 -m timeit "ch='€'; sp=' '*1000; s = ch+sp+ch+sp+ch;
after='Ł'; s.replace(ch, after)"
100000 loops, best of 3: 3.15 usec per loop
$ python3.2 -m timeit "ch='€'; sp=' '*1000; s = ch+sp+ch+sp+ch;
after='Ł'; s.replace(ch, after)"
1000000 loops, best of 3: 1.91 usec per loop

More seriously, I changed the algorithm of str.replace(before, after)
when before and after are only one character: changeset c802bfc8acfc.
The code is now using the heavily optimized findchar() function.
PyUnicode_READ() is slow and should be avoided when possible:
PyUnicode_READ() macro is expanded to 2 if, whereas findchar() uses
directly pointer of the right type (Py_UCS1*, Py_UCS2* or Py_UCS4*).

In Python 3.2, the code looks like:

            for (i = 0; i < u->length; i++) {
                if (u->str[i] == u1) {
                    if (--maxcount < 0)
                        break;
                    u->str[i] = u2;
                }
            }

In Python 3.3, the code looks like:

            pos = findchar(sbuf, PyUnicode_KIND(self), slen, u1, 1);
            if (pos < 0)
                goto nothing;
            ...
            while (--maxcount)
            {
                pos++;
                src += pos * PyUnicode_KIND(self);
                slen -= pos;
                index += pos;
                pos = findchar(src, PyUnicode_KIND(self), slen, u1, 1);
                if (pos < 0)
                    break;
                PyUnicode_WRITE(rkind, PyUnicode_DATA(u), index + pos, u2);
            }

History
Date	User	Action	Args
2012-09-28 08:28:01	vstinner	set	recipients: + vstinner, thomaslee, ezio.melotti, BreamoreBoy, serhiy.storchaka
2012-09-28 08:28:01	vstinner	link	issue16061 messages
2012-09-28 08:28:00	vstinner	create