Message 215301 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	ezio.melotti, josh.r, serhiy.storchaka, vstinner
Date	2014-04-01.08:23:34
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1396340617.89.0.519156771668.issue21118@psf.upfronthosting.co.za>
In-reply-to

Content
str.translate() currently allocates a buffer of UCS4 characters. translate_writer.patch: - modify _PyUnicode_TranslateCharmap() to use the _PyUnicodeWriter API - drop optimizations for error handlers different than "ignore" because there is no unit tests for them, and str.translate() uses "ignore". It's safer to drop untested optimization. - cleanup also the code: charmaptranslate_output() is now responsible to handle charmaptranslate_lookup() result (to decrement the reference coutner) str.translate() may be a little bit faster when translating ASCII to ASCII for large string, but not so much. bytes.translate() is much faster because it builds a C array of 256 items to fast table lookup, whereas str.translate() requires a Python dict lookup for each character, which is much slower. codecs.charmap_build() (PyUnicode_BuildEncodingMap()) creates a C array ("a three-level trie") for fast lookup. It is used with codecs.charmap_encode() for 8-bit encodings. We may reuse it for simple cases, like translating ASCII to ASCII.

str.translate() currently allocates a buffer of UCS4 characters.

translate_writer.patch:
- modify _PyUnicode_TranslateCharmap() to use the _PyUnicodeWriter API
- drop optimizations for error handlers different than "ignore" because there is no unit tests for them, and str.translate() uses "ignore". It's safer to drop untested optimization.
- cleanup also the code: charmaptranslate_output() is now responsible to handle charmaptranslate_lookup() result (to decrement the reference coutner)

str.translate() may be a little bit faster when translating ASCII to ASCII for large string, but not so much.

bytes.translate() is much faster because it builds a C array of 256 items to fast table lookup, whereas str.translate() requires a Python dict lookup for each character, which is much slower.

codecs.charmap_build() (PyUnicode_BuildEncodingMap()) creates a C array ("a three-level trie") for fast lookup. It is used with codecs.charmap_encode() for 8-bit encodings. We may reuse it for simple cases, like translating ASCII to ASCII.

History
Date	User	Action	Args
2014-04-01 08:23:37	vstinner	set	recipients: + vstinner, ezio.melotti, serhiy.storchaka, josh.r
2014-04-01 08:23:37	vstinner	set	messageid: <1396340617.89.0.519156771668.issue21118@psf.upfronthosting.co.za>
2014-04-01 08:23:37	vstinner	link	issue21118 messages
2014-04-01 08:23:34	vstinner	create