Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.translate() unexpectedly duplicates characters #70651

Closed
benknight mannequin opened this issue Mar 1, 2016 · 8 comments
Closed

str.translate() unexpectedly duplicates characters #70651

benknight mannequin opened this issue Mar 1, 2016 · 8 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error

Comments

@benknight
Copy link
Mannequin

benknight mannequin commented Mar 1, 2016

BPO 26464
Nosy @vstinner, @larryhastings, @serhiy-storchaka, @eryksun
Files
  • unicode_fast_translate.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = <Date 2016-03-01.20:33:24.805>
    created_at = <Date 2016-03-01.13:51:44.015>
    labels = ['interpreter-core', 'type-bug', 'release-blocker']
    title = 'str.translate() unexpectedly duplicates characters'
    updated_at = <Date 2016-03-01.21:08:10.173>
    user = 'https://bugs.python.org/benknight'

    bugs.python.org fields:

    activity = <Date 2016-03-01.21:08:10.173>
    actor = 'python-dev'
    assignee = 'vstinner'
    closed = True
    closed_date = <Date 2016-03-01.20:33:24.805>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2016-03-01.13:51:44.015>
    creator = 'ben.knight'
    dependencies = []
    files = ['42056']
    hgrepos = []
    issue_num = 26464
    keywords = ['patch']
    message_count = 8.0
    messages = ['261049', '261059', '261064', '261065', '261069', '261070', '261071', '261072']
    nosy_count = 6.0
    nosy_names = ['vstinner', 'larry', 'python-dev', 'serhiy.storchaka', 'eryksun', 'ben.knight']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue26464'
    versions = ['Python 3.5', 'Python 3.6']

    @benknight
    Copy link
    Mannequin Author

    benknight mannequin commented Mar 1, 2016

    Python 3.5.1 x86-64, Windows 10

    I created a translation map that translated some characters to None and others to strings and found that in some cases str.translate() will duplicate one of the untranslated characters in the returned string.

    How to reproduce:

    table = str.maketrans({'a': None, 'b': 'cd'})
    'axb'.translate(table)

    Expected result:

    'xcd'

    Actual result:

    'xxcd'

    Mapping 'a' to '' instead of None will produce the desired effect.

    @benknight benknight mannequin added the type-bug An unexpected behavior, bug, or error label Mar 1, 2016
    @serhiy-storchaka serhiy-storchaka added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Mar 1, 2016
    @serhiy-storchaka serhiy-storchaka self-assigned this Mar 1, 2016
    @eryksun
    Copy link
    Contributor

    eryksun commented Mar 1, 2016

    It duplicates translated characters as well. For example:

        >>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
        >>> 'aaaaaamnopqrb'.translate(table)
        'rqponmrqponmĀ'

    3.4 returns the correct result:

        >>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
        >>> 'aaaaaamnopqrb'.translate(table)
        'rqponmĀ'

    The problem is the new fast path for one-to-one ASCII mapping (unicode_fast_translate in Objects/unicodeobject.c) doesn't have a way to return the current input position in order to resume processing the translation. _PyUnicode_TranslateCharmap assumes it's the same as the current writer position, which is wrong when input characters have been deleted.

    @eryksun eryksun changed the title str.translate() unexpectedly duplicates untranslated characters str.translate() unexpectedly duplicates characters Mar 1, 2016
    @vstinner
    Copy link
    Member

    vstinner commented Mar 1, 2016

    Oh... I see. It's a bug introduced by the optimization for ASCII replacing one character with another ASCII character or deleting a character: unicode_fast_translate(). See change cca6b056236a of issue bpo-21118.

    There is a confusion in the code between input and ouput position. "i = writer.pos;" is used in the caller to continue when unicode_fast_translate() was interrupted (because a translation use a non-ASCII character or a string longer than 1 character), but writer.pos is the position in the *output* string, not in the *input* string :-/

    I see that I added unit tests on translate, but it lacks an unit testing fast translation, starting with ignore and then switching to regular translation.

    Attached patch should fix the issue. It adds unit tests.

    @vstinner
    Copy link
    Member

    vstinner commented Mar 1, 2016

    See change cca6b056236a of issue bpo-21118.

    The bug was introduced in Python v3.5.0a1.

    @serhiy-storchaka
    Copy link
    Member

    LGTM.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 1, 2016

    New changeset 27ba9ba5deb1 by Victor Stinner in branch '3.5':
    Fix str.translate()
    https://hg.python.org/cpython/rev/27ba9ba5deb1

    @vstinner
    Copy link
    Member

    vstinner commented Mar 1, 2016

    LGTM.

    Thanks for the review. I pushed my fix.

    Sorry for the regression, I hate being responsible of a regression in a core feature :-/

    I may even deserve a release, but Python doesn't have the habit of "release often" yet :-(

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 1, 2016

    New changeset 6643c5cc9797 by Victor Stinner in branch '3.5':
    Issue bpo-26464: Fix unicode_fast_translate() again
    https://hg.python.org/cpython/rev/6643c5cc9797

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants