classification
Title: Optimize str.translate() for replacement with substrings and non-ASCII strings
Type: performance Stage: patch review
Components: Versions: Python 3.5
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, josh.r, serhiy.storchaka, sir-sigurd, vstinner
Priority: normal Keywords:

Created on 2014-04-07 09:10 by vstinner, last changed 2018-10-01 10:26 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
translate_script.py vstinner, 2014-04-07 09:10
Messages (6)
msg215677 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-04-07 09:10
In issue #21118, I optimized str.translate() in Python 3.5 for ASCII 1:1 mapping and ASCII deletion. My optimization is not used if a character is replaced with a string (ex: "abc".translate({ord('a'): "xxx"})) and for non-ASCII strings.

translate_script.py is a simple benchmark for 1:1 mapping. It should be enhanced to benchmark also replacement strings.
msg215681 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-04-07 09:24
codecs.charmap_build() (PyUnicode_BuildEncodingMap()) creates a C array ("a three-level trie") for fast lookup. It is used with codecs.charmap_encode() for 8-bit encodings. We may reuse it.
msg218233 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-05-10 19:43
Aren't there similar benchmarks in the benchmarks repo?
If not, would it be reasonable to add this there?
msg253002 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-14 16:20
This issue was more a reminder for myself, and I'm not more interested to implement this optimization. The method is already fast enough.
msg253026 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-10-15 02:06
I actually have a patch (still requires a little cleanup) that makes translations for non-ASCII and 1-n translations substantially faster. I've been delaying posting it largely because it makes significant changes to str.maketrans so it returns a special mapping that can be used far more efficiently than Python dicts. The effects of this are:

1. str.maketrans takes a little longer to run (when mappings are defined outside the latin-1 range, it takes about 6x as much time), and technically, the runtime is unbounded. I'm using "Perfect Hashing" to make a chaining free lookup table, but this involves randomly generating the parameters until they produce a collision free set of mappings; the number of rounds of generation is probabilistically very small (IIRC, for pathological cases, you'd still have a >50% chance of success for any random set of parameters, so the odds of failing to map after more than a dozen or so attempts is infinitesimal)
2. The resulting object, while it obeys the contract for collections.abc.Mapping, is not a dict, nor is it mutable, which would be a backwards incompatible change.

Under the current design, the mapping uses ~2x the space as the old dict (largely because it actually stores the dict internally to preserve references and simplify basic lookups).

In exchange for the longer time to do str.maketrans and the slightly higher memory, it provides:

1. Improved runtime for ASCII->Unicode (and vice-versa) of roughly 15-20x
2. Similar improvements for 1-n translations (regardless of whether non-ASCII is involved)
3. In general, much more consistent translation performance; the variance based on the contents of the mapping and the contents of the string is much lower, making it behave more like the old Py2 str.translate (and Py3 bytes.translate); translation is almost always faster than any other approach, instead of being a pessimization.

I don't know how to float changes that would make fairly substantial changes to existing APIs though, so I'm not sure how to proceed. I'd like translation to be beneficial (the optimization made in #21118 didn't actually improve my use case of stripping diacritics to convert to ASCII equivalent characters from latin-1 and related characters), but I have no good solutions that don't mess around with the API (I'd considered trying to internally cache "compiled" translation tables like the re module does, but the tables are mutable dicts, so caching can't be based on identity, and can't use the dicts as keys, which makes it difficult).
msg326788 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-10-01 10:26
> I actually have a patch (...)

Please open a new issue, since that one is closed. You can reference this issue from your new issue.
History
Date User Action Args
2018-10-01 10:26:58vstinnersetmessages: + msg326788
2018-09-30 17:30:19sir-sigurdsetnosy: + sir-sigurd
2015-10-15 02:06:04josh.rsetmessages: + msg253026
2015-10-14 16:20:20vstinnersetstatus: open -> closed
resolution: out of date
messages: + msg253002
2014-05-10 19:43:26ezio.melottisetnosy: + ezio.melotti

messages: + msg218233
stage: patch review
2014-04-07 21:00:01josh.rsetnosy: + josh.r
2014-04-07 09:24:43vstinnersetmessages: + msg215681
2014-04-07 09:10:44vstinnercreate