Title: str.translate() unexpectedly duplicates characters
msg261049 - (view) Author: Ben Knight (ben.knight) Date: 2016-03-01 13:51
Python 3.5.1 x86-64, Windows 10

I created a translation map that translated some characters to None and others to strings and found that in some cases str.translate() will duplicate one of the untranslated characters in the returned string.

How to reproduce:

table = str.maketrans({'a': None, 'b': 'cd'})

Expected result:


Actual result:


Mapping 'a' to '' instead of None will produce the desired effect.
msg261059 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-03-01 16:31
It duplicates translated characters as well. For example:

    >>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
    >>> 'aaaaaamnopqrb'.translate(table)

3.4 returns the correct result:

    >>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
    >>> 'aaaaaamnopqrb'.translate(table)

The problem is the new fast path for one-to-one ASCII mapping (unicode_fast_translate in Objects/unicodeobject.c) doesn't have a way to return the current input position in order to resume processing the translation. _PyUnicode_TranslateCharmap assumes it's the same as the current writer position, which is wrong when input characters have been deleted.
msg261064 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-01 19:54
Oh... I see. It's a bug introduced by the optimization for ASCII replacing one character with another ASCII character or deleting a character: unicode_fast_translate(). See change cca6b056236a of issue #21118.

There is a confusion in the code between input and ouput position. "i = writer.pos;" is used in the caller to continue when unicode_fast_translate() was interrupted (because a translation use a non-ASCII character or a string longer than 1 character), but writer.pos is the position in the *output* string, not in the *input* string :-/

I see that I added unit tests on translate, but it lacks an unit testing fast translation, starting with ignore and then switching to regular translation.

Attached patch should fix the issue. It adds unit tests.
msg261065 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-01 19:55
> See change cca6b056236a of issue #21118.

The bug was introduced in Python v3.5.0a1.
msg261069 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-03-01 20:24
msg261070 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-03-01 20:31
New changeset 27ba9ba5deb1 by Victor Stinner in branch '3.5':
Fix str.translate()
msg261071 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-01 20:33

Thanks for the review. I pushed my fix.

Sorry for the regression, I hate being responsible of a regression in a core feature :-/

I may even deserve a release, but Python doesn't have the habit of "release often" yet :-(
msg261072 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-03-01 21:08
New changeset 6643c5cc9797 by Victor Stinner in branch '3.5':
Issue #26464: Fix unicode_fast_translate() again
