classification
Title: str.translate() unexpectedly duplicates characters
Type: behavior Stage: commit review
Components: Interpreter Core Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vstinner Nosy List: ben.knight, eryksun, larry, python-dev, serhiy.storchaka, vstinner
Priority: release blocker Keywords: patch

Created on 2016-03-01 13:51 by ben.knight, last changed 2016-03-01 21:08 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_fast_translate.patch vstinner, 2016-03-01 19:54 review
Messages (8)
msg261049 - (view) Author: Ben Knight (ben.knight) Date: 2016-03-01 13:51
Python 3.5.1 x86-64, Windows 10

I created a translation map that translated some characters to None and others to strings and found that in some cases str.translate() will duplicate one of the untranslated characters in the returned string.

How to reproduce:

table = str.maketrans({'a': None, 'b': 'cd'})
'axb'.translate(table)

Expected result:

'xcd'

Actual result:

'xxcd'

Mapping 'a' to '' instead of None will produce the desired effect.
msg261059 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-03-01 16:31
It duplicates translated characters as well. For example:

    >>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
    >>> 'aaaaaamnopqrb'.translate(table)
    'rqponmrqponmĀ'

3.4 returns the correct result:

    >>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
    >>> 'aaaaaamnopqrb'.translate(table)
    'rqponmĀ'

The problem is the new fast path for one-to-one ASCII mapping (unicode_fast_translate in Objects/unicodeobject.c) doesn't have a way to return the current input position in order to resume processing the translation. _PyUnicode_TranslateCharmap assumes it's the same as the current writer position, which is wrong when input characters have been deleted.
msg261064 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-01 19:54
Oh... I see. It's a bug introduced by the optimization for ASCII replacing one character with another ASCII character or deleting a character: unicode_fast_translate(). See change cca6b056236a of issue #21118.

There is a confusion in the code between input and ouput position. "i = writer.pos;" is used in the caller to continue when unicode_fast_translate() was interrupted (because a translation use a non-ASCII character or a string longer than 1 character), but writer.pos is the position in the *output* string, not in the *input* string :-/

I see that I added unit tests on translate, but it lacks an unit testing fast translation, starting with ignore and then switching to regular translation.

Attached patch should fix the issue. It adds unit tests.
msg261065 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-01 19:55
> See change cca6b056236a of issue #21118.

The bug was introduced in Python v3.5.0a1.
msg261069 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-03-01 20:24
LGTM.
msg261070 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-03-01 20:31
New changeset 27ba9ba5deb1 by Victor Stinner in branch '3.5':
Fix str.translate()
https://hg.python.org/cpython/rev/27ba9ba5deb1
msg261071 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-01 20:33
> LGTM.

Thanks for the review. I pushed my fix.

Sorry for the regression, I hate being responsible of a regression in a core feature :-/

I may even deserve a release, but Python doesn't have the habit of "release often" yet :-(
msg261072 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-03-01 21:08
New changeset 6643c5cc9797 by Victor Stinner in branch '3.5':
Issue #26464: Fix unicode_fast_translate() again
https://hg.python.org/cpython/rev/6643c5cc9797
History
Date User Action Args
2016-03-01 21:08:10python-devsetmessages: + msg261072
2016-03-01 20:33:24vstinnersetstatus: open -> closed
priority: high -> release blocker

nosy: + larry
messages: + msg261071

resolution: fixed
2016-03-01 20:31:27python-devsetnosy: + python-dev
messages: + msg261070
2016-03-01 20:24:47serhiy.storchakasetassignee: serhiy.storchaka -> vstinner
messages: + msg261069
stage: needs patch -> commit review
2016-03-01 19:55:30vstinnersetmessages: + msg261065
2016-03-01 19:54:12vstinnersetfiles: + unicode_fast_translate.patch
keywords: + patch
messages: + msg261064
2016-03-01 16:36:44serhiy.storchakasetnosy: + vstinner
2016-03-01 16:31:34eryksunsetversions: + Python 3.6
2016-03-01 16:31:17eryksunsetnosy: + eryksun

messages: + msg261059
title: str.translate() unexpectedly duplicates untranslated characters -> str.translate() unexpectedly duplicates characters
2016-03-01 16:26:28serhiy.storchakasetnosy: + serhiy.storchaka
priority: normal -> high
assignee: serhiy.storchaka
components: + Interpreter Core
stage: needs patch
2016-03-01 13:51:44ben.knightcreate