Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input() with Unicode prompt produces mojibake on Windows #72520

Closed
Drekin mannequin opened this issue Oct 1, 2016 · 13 comments
Closed

input() with Unicode prompt produces mojibake on Windows #72520

Drekin mannequin opened this issue Oct 1, 2016 · 13 comments
Assignees
Labels
3.7 (EOL) end of life OS-windows topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@Drekin
Copy link
Mannequin

Drekin mannequin commented Oct 1, 2016

BPO 28333
Nosy @terryjreedy, @pfmoore, @vstinner, @tjguk, @ned-deily, @ezio-melotti, @zware, @eryksun, @zooba
PRs
  • [Do Not Merge] Convert Misc/NEWS so that it is managed by towncrier #552
  • Files
  • issue_28333_01.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/zooba'
    closed_at = <Date 2016-10-25.18:52:52.861>
    created_at = <Date 2016-10-01.15:20:25.517>
    labels = ['type-bug', '3.7', 'expert-unicode', 'OS-windows']
    title = 'input() with Unicode prompt produces mojibake on Windows'
    updated_at = <Date 2017-03-31.16:36:19.198>
    user = 'https://bugs.python.org/Drekin'

    bugs.python.org fields:

    activity = <Date 2017-03-31.16:36:19.198>
    actor = 'dstufft'
    assignee = 'steve.dower'
    closed = True
    closed_date = <Date 2016-10-25.18:52:52.861>
    closer = 'steve.dower'
    components = ['Unicode', 'Windows']
    creation = <Date 2016-10-01.15:20:25.517>
    creator = 'Drekin'
    dependencies = []
    files = ['45008']
    hgrepos = []
    issue_num = 28333
    keywords = ['patch', '3.5regression']
    message_count = 13.0
    messages = ['277821', '278274', '278275', '278277', '278281', '278284', '278317', '278318', '278319', '278624', '279427', '279435', '279445']
    nosy_count = 11.0
    nosy_names = ['terry.reedy', 'paul.moore', 'vstinner', 'tim.golden', 'ned.deily', 'ezio.melotti', 'python-dev', 'zach.ware', 'eryksun', 'Drekin', 'steve.dower']
    pr_nums = ['552']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue28333'
    versions = ['Python 3.6', 'Python 3.7']

    @Drekin
    Copy link
    Mannequin Author

    Drekin mannequin commented Oct 1, 2016

    In my setting (Python 3.6b1 on Windows), trying to prompt a non-ASCII character via input() results in mojibake. This is related to the recent fix of bpo-1602 and so is Windows-specific.

    >>> input("α")
    ╬▒

    The result corresponds to print("α".encode("utf-8").decode("cp852")). That cp852 the default terminal encoding in my locale.

    @Drekin Drekin mannequin added topic-unicode OS-windows type-bug An unexpected behavior, bug, or error labels Oct 1, 2016
    @terryjreedy
    Copy link
    Member

    Same output with cp437.

    @terryjreedy
    Copy link
    Member

    This is a regression from 3.5.2, where input("α") displays "α".

    @zooba
    Copy link
    Member

    zooba commented Oct 7, 2016

    This may force bpo-17620 into 3.6 - we really ought to be getting and using sys.stdin and sys.stderr in PyOS_StdioReadline() rather than going directly to the raw streams.

    The problem here is that we're still using fprintf to output the prompt, even though we know (assume) the input is utf-8.

    I haven't looked closely at how safely we can use Python objects from this code, except to see that it's not obviously safe, but we should really figure out how to deal in Python str rather than C char* for the default readline implementation (and then only fall back on the GNU protocol when someone asks for it).

    The faster fix here would be to decode the prompt from utf-8 to utf-16-le in PyOS_StdioReadline and then write it using a wide-char output function.

    @zooba zooba added the 3.7 (EOL) end of life label Oct 7, 2016
    @eryksun
    Copy link
    Contributor

    eryksun commented Oct 7, 2016

    When I pointed this issue out in code reviews, I assumed you would add the relatively simple fix to decode the prompt and call WriteConsoleW. The long-term fix in bpo-17620 has to be worked out with cross-platform support, and ISTM that it can wait for 3.7.

    Off topic: I just noticed that you're not calling PyOS_InputHook in the new PyOS_StdioReadline code. Tkinter registers this function pointer to call its EventHook. Do you want a separate issue for this, or is there a reason its was omitted?

    @eryksun
    Copy link
    Contributor

    eryksun commented Oct 8, 2016

    I'm sure Steve already has this covered, but FWIW here's a patch to call WriteConsoleW. Here's the result with the patch applied:

        >>> sys.ps1 = '»»» '
        »»» input("αβψδ: ")
        αβψδ: spam
        'spam'

    and with interactive stdin and stdout/stderr redirected to a file:

    >set PYTHONIOENCODING=utf-8
    >amd64\python_d.exe >out.txt 2>&1
    input("αβψδ: ")
    spam
    ^Z
    
    >chcp 65001
    Active code page: 65001
    
        >type out.txt
        Python 3.6.0b1+ (default, Oct  7 2016, 23:47:58)
        [MSC v.1900 64 bit (AMD64)] on win32
        Type "help", "copyright", "credits" or "license" for more information.
        >>> αβψδ: 'spam'
        >>>

    If it can't write the prompt for some reason (e.g. out of memory, decoding fails, WriteConsole fails), it doesn't fall back on fprintf to write the prompt. Should it?

    This should also get a test that calls ReadConsoleOutputCharacter to verify that the correct prompt is written.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 8, 2016

    New changeset faf5493e6f61 by Steve Dower in branch '3.6':
    Issue bpo-28333: Enables Unicode for ps1/ps2 and input() prompts. (Patch by Eryk Sun)
    https://hg.python.org/cpython/rev/faf5493e6f61

    New changeset cb62e921bd06 by Steve Dower in branch 'default':
    Issue bpo-28333: Enables Unicode for ps1/ps2 and input() prompts. (Patch by Eryk Sun)
    https://hg.python.org/cpython/rev/cb62e921bd06

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 8, 2016

    New changeset 63ceadf8410f by Steve Dower in branch '3.6':
    Issue bpo-28333: Remove unnecessary increment.
    https://hg.python.org/cpython/rev/63ceadf8410f

    New changeset d76c8f9ea787 by Steve Dower in branch 'default':
    Issue bpo-28333: Remove unnecessary increment.
    https://hg.python.org/cpython/rev/d76c8f9ea787

    @zooba
    Copy link
    Member

    zooba commented Oct 8, 2016

    I made some minor tweaks to the patch (no need for strlen() - passing -1 works equivalently), but otherwise it's exactly what I would have done so I committed it.

    We currently have no tests to check which characters are written to a console output buffer. bpo-28217 was tracking those, but considering how little code we have on top of output I don't think it's worth blocking anything on automating those tests.

    @eryksun
    Copy link
    Contributor

    eryksun commented Oct 13, 2016

    MultibyteToWideChar includes the trailing NUL when it gets the string length, so the WriteConsoleW call needs to use (wlen - 1).

    @zooba
    Copy link
    Member

    zooba commented Oct 25, 2016

    Not sure how I missed it originally, but that extra 1 char is actually very important:

    Python 3.6.0b2 (v3.6.0b2:b9fadc7d1c3f, Oct 10 2016, 20:36:51) [MSC v.1900 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>>  import sys
    >>>  sys.ps1='> '
    >  sys

    The extra space is because of that. Really ought to fix this before the next beta.

    @zooba zooba self-assigned this Oct 25, 2016
    @eryksun
    Copy link
    Contributor

    eryksun commented Oct 25, 2016

    I forgot to include the link to the python-list thread where this came up:

    https://mail.python.org/pipermail/python-list/2016-October/715428.html

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 25, 2016

    New changeset 6b46c3deea2c by Steve Dower in branch '3.6':
    Issue bpo-28333: Fixes off-by-one error that was adding an extra space.
    https://hg.python.org/cpython/rev/6b46c3deea2c

    New changeset 44d15ba67d2e by Steve Dower in branch 'default':
    Issue bpo-28333: Fixes off-by-one error that was adding an extra space.
    https://hg.python.org/cpython/rev/44d15ba67d2e

    @zooba zooba closed this as completed Oct 25, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life OS-windows topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants