Message 215283 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	josh.r
Recipients	ezio.melotti, josh.r, vstinner
Date	2014-03-31.23:09:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1396307346.31.0.221715925478.issue21118@psf.upfronthosting.co.za>
In-reply-to

Content
String translation functions, at least in other languages, are typically a performance optimization when you really need the speed. In exchange for paying the memory cost and one-time work to create a translation table, you get much faster 1 to 1 and deletion of characters. In Python 3, str.translate is usually a performance pessimization, not optimization. While it does a few things similar functions (e.g. str.replace) can't do, like perform a translation where the inputs and outputs overlap ('abc' -> 'bca' wouldn't work with str.replace), and it handles the full Unicode range ('\u0f01\u0f02' -> '\u0f03\u0f04' can't be done by encoding, using bytes.translate, then decoding again), when another approach is available, str.translate is almost always the slowest. While it may not be practical to make the general case for str.translate match the performance of bytes.translate, it should at least be possible to improve upon it in many common cases. My timing data (see attached file for how it was produced): $ python.exe translate_timing.py Testing 1-1 translation bytes.translate 0.05638575777521804 str.translate 2.9639258423152874 str.translate from bytes trans 1.6307581351818357 str.encode->bytes.translate->bytes.decode 0.09472254504689914 str.replace 0.2507745649580393 Testing deletion bytes.translate 0.05367317344084199 str.translate 2.360142494240577 str.encode->bytes.translate->bytes.decode 0.0629345277274993 str.replace 0.31173443413690904 Testing enlarging translations str.translate 2.33631417640283 str.replace 0.41248187325160757 Obviously, this is somewhat contrived, but it illustrates my point that the common case (ASCII or latin-1 strings) are far too slow. In every case, str.translate loses to every competitor, including itself when you pass it a translation table produced by bytes.translate, and including the case where a pure Python function calls str.replace over and over. A large part of the problem is clearly the use of a dict to perform lookups; as seen in the 1-1 test case, str.translate using the result of str.maketrans takes nearly twice as long as when it uses the result of a similar bytes.maketrans. While I have not profiled the code, I suspect the remainder is a combination of the overhead involving conversion of Py_UCS4 to PyLong for general purpose lookup, handling the multiple ways the translation target can be expressed (a Unicode ordinal or a str object), and the code that checks for and resizes the output buffer each time. I have ideas for a solution involving a special purpose translation object to be produced by str.maketrans for cases where the translation table will produce an equal or smaller string (all to values are None or length 1 or lower strings), and the from values all fall in a small enough range (say, 1024) to avoid excessive memory usage from large translation table objects. But I'd like to know whether such a change (including modifying the behavior of str.maketrans to produce an unspecified mapping object, instead of the current dict) is desired, acceptable, what have you.

String translation functions, at least in other languages, are typically a performance optimization when you *really* need the speed. In exchange for paying the memory cost and one-time work to create a translation table, you get much faster 1 to 1 and deletion of characters.

In Python 3, str.translate is usually a performance pessimization, not optimization. While it does a few things similar functions (e.g. str.replace) can't do, like perform a translation where the inputs and outputs overlap ('abc' -> 'bca' wouldn't work with str.replace), and it handles the full Unicode range ('\u0f01\u0f02' -> '\u0f03\u0f04' can't be done by encoding, using bytes.translate, then decoding again), when another approach is available, str.translate is almost always the slowest.

While it may not be practical to make the general case for str.translate match the performance of bytes.translate, it should at least be possible to improve upon it in many common cases.

My timing data (see attached file for how it was produced):

    $ python.exe translate_timing.py

    Testing 1-1 translation
    bytes.translate 0.05638575777521804
    str.translate 2.9639258423152874
    str.translate from bytes trans 1.6307581351818357
    str.encode->bytes.translate->bytes.decode 0.09472254504689914
    str.replace 0.2507745649580393

    Testing deletion
    bytes.translate 0.05367317344084199
    str.translate 2.360142494240577
    str.encode->bytes.translate->bytes.decode 0.0629345277274993
    str.replace 0.31173443413690904

    Testing enlarging translations
    str.translate 2.33631417640283
    str.replace 0.41248187325160757

Obviously, this is somewhat contrived, but it illustrates my point that the common case (ASCII or latin-1 strings) are far too slow. In every case, str.translate loses to every competitor, including itself when you pass it a translation table produced by bytes.translate, and including the case where a pure Python function calls str.replace over and over.

A large part of the problem is clearly the use of a dict to perform lookups; as seen in the 1-1 test case, str.translate using the result of str.maketrans takes nearly twice as long as when it uses the result of a similar bytes.maketrans. While I have not profiled the code, I suspect the remainder is a combination of the overhead involving conversion of Py_UCS4 to PyLong for general purpose lookup, handling the multiple ways the translation target can be expressed (a Unicode ordinal or a str object), and the code that checks for and resizes the output buffer each time.

I have ideas for a solution involving a special purpose translation object to be produced by str.maketrans for cases where the translation table will produce an equal or smaller string (all to values are None or length 1 or lower strings), and the from values all fall in a small enough range (say, 1024) to avoid excessive memory usage from large translation table objects. But I'd like to know whether such a change (including modifying the behavior of str.maketrans to produce an unspecified mapping object, instead of the current dict) is desired, acceptable, what have you.

History
Date	User	Action	Args
2014-03-31 23:09:06	josh.r	set	recipients: + josh.r, vstinner, ezio.melotti
2014-03-31 23:09:06	josh.r	set	messageid: <1396307346.31.0.221715925478.issue21118@psf.upfronthosting.co.za>
2014-03-31 23:09:06	josh.r	link	issue21118 messages
2014-03-31 23:09:05	josh.r	create