Message 253026 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	josh.r
Recipients	ezio.melotti, josh.r, serhiy.storchaka, vstinner
Date	2015-10-15.02:06:01
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1444874764.17.0.265663055298.issue21165@psf.upfronthosting.co.za>
In-reply-to

Content
I actually have a patch (still requires a little cleanup) that makes translations for non-ASCII and 1-n translations substantially faster. I've been delaying posting it largely because it makes significant changes to str.maketrans so it returns a special mapping that can be used far more efficiently than Python dicts. The effects of this are: 1. str.maketrans takes a little longer to run (when mappings are defined outside the latin-1 range, it takes about 6x as much time), and technically, the runtime is unbounded. I'm using "Perfect Hashing" to make a chaining free lookup table, but this involves randomly generating the parameters until they produce a collision free set of mappings; the number of rounds of generation is probabilistically very small (IIRC, for pathological cases, you'd still have a >50% chance of success for any random set of parameters, so the odds of failing to map after more than a dozen or so attempts is infinitesimal) 2. The resulting object, while it obeys the contract for collections.abc.Mapping, is not a dict, nor is it mutable, which would be a backwards incompatible change. Under the current design, the mapping uses ~2x the space as the old dict (largely because it actually stores the dict internally to preserve references and simplify basic lookups). In exchange for the longer time to do str.maketrans and the slightly higher memory, it provides: 1. Improved runtime for ASCII->Unicode (and vice-versa) of roughly 15-20x 2. Similar improvements for 1-n translations (regardless of whether non-ASCII is involved) 3. In general, much more consistent translation performance; the variance based on the contents of the mapping and the contents of the string is much lower, making it behave more like the old Py2 str.translate (and Py3 bytes.translate); translation is almost always faster than any other approach, instead of being a pessimization. I don't know how to float changes that would make fairly substantial changes to existing APIs though, so I'm not sure how to proceed. I'd like translation to be beneficial (the optimization made in #21118 didn't actually improve my use case of stripping diacritics to convert to ASCII equivalent characters from latin-1 and related characters), but I have no good solutions that don't mess around with the API (I'd considered trying to internally cache "compiled" translation tables like the re module does, but the tables are mutable dicts, so caching can't be based on identity, and can't use the dicts as keys, which makes it difficult).

I actually have a patch (still requires a little cleanup) that makes translations for non-ASCII and 1-n translations substantially faster. I've been delaying posting it largely because it makes significant changes to str.maketrans so it returns a special mapping that can be used far more efficiently than Python dicts. The effects of this are:

1. str.maketrans takes a little longer to run (when mappings are defined outside the latin-1 range, it takes about 6x as much time), and technically, the runtime is unbounded. I'm using "Perfect Hashing" to make a chaining free lookup table, but this involves randomly generating the parameters until they produce a collision free set of mappings; the number of rounds of generation is probabilistically very small (IIRC, for pathological cases, you'd still have a >50% chance of success for any random set of parameters, so the odds of failing to map after more than a dozen or so attempts is infinitesimal)
2. The resulting object, while it obeys the contract for collections.abc.Mapping, is not a dict, nor is it mutable, which would be a backwards incompatible change.

Under the current design, the mapping uses ~2x the space as the old dict (largely because it actually stores the dict internally to preserve references and simplify basic lookups).

In exchange for the longer time to do str.maketrans and the slightly higher memory, it provides:

1. Improved runtime for ASCII->Unicode (and vice-versa) of roughly 15-20x
2. Similar improvements for 1-n translations (regardless of whether non-ASCII is involved)
3. In general, much more consistent translation performance; the variance based on the contents of the mapping and the contents of the string is much lower, making it behave more like the old Py2 str.translate (and Py3 bytes.translate); translation is almost always faster than any other approach, instead of being a pessimization.

I don't know how to float changes that would make fairly substantial changes to existing APIs though, so I'm not sure how to proceed. I'd like translation to be beneficial (the optimization made in #21118 didn't actually improve my use case of stripping diacritics to convert to ASCII equivalent characters from latin-1 and related characters), but I have no good solutions that don't mess around with the API (I'd considered trying to internally cache "compiled" translation tables like the re module does, but the tables are mutable dicts, so caching can't be based on identity, and can't use the dicts as keys, which makes it difficult).

History
Date	User	Action	Args
2015-10-15 02:06:04	josh.r	set	recipients: + josh.r, vstinner, ezio.melotti, serhiy.storchaka
2015-10-15 02:06:04	josh.r	set	messageid: <1444874764.17.0.265663055298.issue21165@psf.upfronthosting.co.za>
2015-10-15 02:06:04	josh.r	link	issue21165 messages
2015-10-15 02:06:01	josh.r	create