Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows console doesn't print or input Unicode #45943

Closed
mark-summerfield mannequin opened this issue Dec 12, 2007 · 148 comments
Closed

windows console doesn't print or input Unicode #45943

mark-summerfield mannequin opened this issue Dec 12, 2007 · 148 comments
Assignees
Labels
OS-windows topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@mark-summerfield
Copy link
Mannequin

mark-summerfield mannequin commented Dec 12, 2007

BPO 1602
Nosy @malemburg, @mhammond, @terryjreedy, @pfmoore, @amauryfa, @ncoghlan, @pitrou, @giampaolo, @tjguk, @mark-summerfield, @ned-deily, @ezio-melotti, @florentx, @4kir4, @lilydjwg, @berkerpeksag, @vadmium, @eryksun, @zooba, @davispuh
Superseder
  • bpo-28217: Add interactive console tests
  • Files
  • sys_write_stdout.patch
  • unicode2.py
  • doc-patch.diff: Proposed changes to user-visible documentation
  • unicode3.py
  • win_console.patch
  • test_win_console.py
  • streams.py
  • wincontest.py: Example io.TextIOWrapper sublcass using WideCharToMultiByte
  • winconsoleio.diff
  • 1602_2.patch
  • 1602_3.patch
  • 1602_4.patch
  • 1602_5.patch
  • 1602_6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/zooba'
    closed_at = <Date 2016-09-09.16:42:53.594>
    created_at = <Date 2007-12-12.09:56:30.846>
    labels = ['type-bug', 'expert-unicode', 'OS-windows']
    title = "windows console doesn't print or input Unicode"
    updated_at = <Date 2016-10-22.10:46:13.515>
    user = 'https://github.com/mark-summerfield'

    bugs.python.org fields:

    activity = <Date 2016-10-22.10:46:13.515>
    actor = 'THRlWiTi'
    assignee = 'steve.dower'
    closed = True
    closed_date = <Date 2016-09-09.16:42:53.594>
    closer = 'steve.dower'
    components = ['Unicode', 'Windows']
    creation = <Date 2007-12-12.09:56:30.846>
    creator = 'mark'
    dependencies = []
    files = ['19493', '20320', '20363', '23461', '23470', '23471', '36120', '40990', '44094', '44290', '44379', '44409', '44449', '44452']
    hgrepos = []
    issue_num = 1602
    keywords = ['patch']
    message_count = 148.0
    messages = ['58487', '58621', '58651', '87086', '88059', '88077', '92854', '94445', '94480', '94483', '94496', '108173', '108228', '116801', '120414', '120415', '120416', '120700', '125823', '125824', '125826', '125833', '125852', '125877', '125889', '125890', '125898', '125899', '125938', '125942', '125947', '125956', '126286', '126288', '126303', '126304', '126308', '126319', '127782', '131657', '131854', '132060', '132061', '132062', '132064', '132065', '132067', '132184', '132191', '132208', '132266', '132268', '145898', '145899', '145963', '145964', '146471', '148990', '157569', '160812', '160813', '160897', '161151', '161153', '161308', '161651', '164572', '164578', '164580', '164618', '164619', '170899', '170915', '170999', '185135', '197700', '197751', '197752', '197773', '221175', '221178', '223403', '223404', '223507', '223509', '223945', '223946', '223947', '223948', '223949', '223951', '223952', '224019', '224086', '224095', '224596', '224605', '224690', '227329', '227330', '227332', '227333', '227337', '227338', '227347', '227354', '227373', '227374', '227441', '227450', '228191', '228208', '228210', '233347', '233350', '233916', '233937', '234019', '234020', '234096', '234371', '242884', '254405', '254407', '272596', '272605', '272645', '272662', '272675', '272716', '272718', '272720', '273999', '274449', '274673', '274884', '274906', '274912', '274939', '275003', '275004', '275005', '275157', '275362', '275510', '277047', '277048', '277050']
    nosy_count = 38.0
    nosy_names = ['lemburg', 'mhammond', 'terry.reedy', 'paul.moore', 'tzot', 'amaury.forgeotdarc', 'ncoghlan', 'pitrou', 'giampaolo.rodola', 'tim.golden', 'mark', 'ned.deily', 'christoph', 'ezio.melotti', 'v+python', 'hippietrail', 'flox', 'THRlWiTi', 'davidsarah', 'santoso.wijaya', 'akira', 'David.Sankel', 'python-dev', 'smerlin', 'lilydjwg', 'berker.peksag', 'martin.panter', 'piotr.dobrogost', 'eryksun', 'Drekin', 'steve.dower', 'wiz21', 'stijn', 'Jonitis', 'gurnec', 'escapewindow', 'dead1ne', 'davispuh']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = '28217'
    type = 'behavior'
    url = 'https://bugs.python.org/issue1602'
    versions = ['Python 3.6']

    @mark-summerfield
    Copy link
    Mannequin Author

    mark-summerfield mannequin commented Dec 12, 2007

    I am not sure if this is a Python bug or simply a limitation of cmd.exe.

    I am using Windows XP Home.
    I run cmd.exe with the /u option and I have set my console font to
    "Lucida Console" (the only TrueType font offered), and I run chcp 65001
    to set the utf8 code page.
    When I run the following program:

    for x in range(32, 2000):
        print("{0:5X} {0:c}".format(x))

    one blank line is output.

    But if I do chcp 1252 the program prints up to 7F before hitting a
    unicode encoding error.

    This is different behaviour from Python 2.5.1 which (with a suitably
    modified print line) after chcp 65001 prints up to 7F and then fails
    with "IOError: [Errno 0] Error".

    @mark-summerfield mark-summerfield mannequin added OS-windows topic-unicode type-bug An unexpected behavior, bug, or error labels Dec 12, 2007
    @mark-summerfield
    Copy link
    Mannequin Author

    mark-summerfield mannequin commented Dec 14, 2007

    I've looked into this a bit more, and from what I can see, code page
    65001 just doesn't work---so it is a Windows problem not a Python problem.
    A possible solution might be to read/write UTF16 which "managed" Windows
    applications can do.

    @tiran
    Copy link
    Member

    tiran commented Dec 15, 2007

    We are aware of multiple Windows related problems. We are planing to
    rewrite parts of the Windows specific API to use the widechar variants.
    Maybe that will help.

    @pitrou
    Copy link
    Member

    pitrou commented May 3, 2009

    Yes, it is a Windows problem. There simply doesn't seem to be a true
    Unicode codepage for command-line apps. Recommend closing.

    @tzot
    Copy link
    Mannequin

    tzot mannequin commented May 19, 2009

    Just in case it helps, this behaviour is on Win XP Pro, Python 2.5.1:

    First, I added an alias for 'cp65001' to 'utf_8' in
    Lib/encodings/aliases.py .

    Then, I opened a command prompt with a bitmap font.

    c:\windows\system32>python
    Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
    (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print u"\N{EM DASH}"
    —

    I switched the font to Lucida Console, and retried (without exiting the
    python interpreter, although the behaviour is the same when exiting and
    entering again: )

    >>> print u"\N{EM DASH}"
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    IOError: [Errno 13] Permission denied

    Then I tried (by pressing Alt+0233 for é, which is invalid in my normal
    cp1253 codepage):

    >> print u"née"

    and the interpreter exits without any information. So it does for:

    >> a=u"née"

    Then I created a UTF-8 text file named 'test65001.py':

    # -*- coding: utf_8 -*-
    a=u"néeα"
    print a

    and tried to run it directly from the command line:

    c:\windows\system32>python d:\src\PYTHON\test65001.py
    néeαTraceback (most recent call last):
      File "d:\src\PYTHON\test65001.py", line 4, in <module>
        print a
    IOError: [Errno 2] No such file or directory

    You see? It printed all the characters before failing.

    Also the following works:

    c:\windows\system32>echo heéε
    heéε

    and

    c:\windows\system32>echo heéε >D:\src\PYTHON\dummy.txt

    creates successfully a UTF-8 file (without any UTF-8 BOM marks at the
    beginning).

    So it's possible that it is a python bug, or at least something can be
    done about it.

    @amauryfa
    Copy link
    Member

    an immediate thing to do is to declare cp65001 as an encoding:

    Index: Lib/encodings/aliases.py
    ===================================================================

    --- Lib/encodings/aliases.py    (revision 72757)
    +++ Lib/encodings/aliases.py    (working copy)
    @@ -511,6 +511,7 @@
         'utf8'               : 'utf_8',
         'utf8_ucs2'          : 'utf_8',
         'utf8_ucs4'          : 'utf_8',
    +    'cp65001'            : 'utf_8',
     ## uu_codec codec
     #'uu'                 : 'uu_codec',
    

    This is not enough unfortunately, because the win32 API function
    WriteFile() returns the number of characters written, not the number of
    (utf8) bytes:

    >>> print("\u0124\u0102" + 'abc')
    ĤĂabc
    c
    [44420 refs]
    >>>

    Additionally, there is a bug in the ReadFile, which returns an empty
    string (and no error) when a non-ascii character is entered, which is
    the behavior of an EOF condition...

    Maybe the solution is to use the win32 console API directly...

    @tzot
    Copy link
    Mannequin

    tzot mannequin commented Sep 19, 2009

    Another note:
    if one creates a dummy Stream object (having a softspace attribute and a
    write method that writes using os.write, as in
    http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/1432462#1432462
    ) to replace sys.stdout and sys.stderr, then writes occur correctly,
    without issues. Pre-requisites:
    chcp 65001, Lucida Console font and cp65001 as an alias for UTF-8 in
    encodings/aliases.py
    This is Python 2.5.4 on Windows.

    @vpython
    Copy link
    Mannequin

    vpython mannequin commented Oct 25, 2009

    With Python 3.1.1, the following batch file seems to be necessary to use
    UTF-8 successfully from an XP console:

    set PYTHONIOENCODING=UTF-8
    cmd /u /k chcp 65001
    set PYTHONIOENCODING=
    exit

    the cmd line seems to be necessary because of Windows having
    compatibility issues, but it seems that Python should notice the cp65001
    and not need the PYTHONIOENCODING stuff.

    @mark-summerfield
    Copy link
    Mannequin Author

    mark-summerfield mannequin commented Oct 26, 2009

    Glenn Linderman's fix pretty well works for me on XP Home. I can print
    every Unicode character up to and including U+D7FF (although most just
    come out as rectangles, at least I don't get encoding errors).

    It fails at U+D800 with message:

    UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
    position 17: surrogates not allowed

    I also tried U+D801 and got the same error.

    Nonetheless, this is *much* better than before.

    @malemburg
    Copy link
    Member

    Mark Summerfield wrote:

    Mark Summerfield <mark@qtrac.eu> added the comment:

    Glenn Linderman's fix pretty well works for me on XP Home. I can print
    every Unicode character up to and including U+D7FF (although most just
    come out as rectangles, at least I don't get encoding errors).

    It fails at U+D800 with message:

    UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
    position 17: surrogates not allowed

    I also tried U+D801 and got the same error.

    That's normal and expected: D800 is the start of the surrogate
    ranges which are only allows in pairs in UTF-8.

    @vpython
    Copy link
    Mannequin

    vpython mannequin commented Oct 26, 2009

    The choice of the Lucida Consola or the Consolas font cures most of the
    rectangle problems. Those are just a limitation of the selected font
    for the console window.

    @christoph
    Copy link
    Mannequin

    christoph mannequin commented Jun 19, 2010

    Will this bug be tackled or Python2.7?

    And is there a way to get hold of the access denied error?

    Here are my steps to reproduce:

    I started the console with "cmd /u /k chcp 65001"


    Aktive Codepage: 65001.

    C:\Dokumente und Einstellungen\root>set PYTHONIOENCODING=UTF-8

    C:\Dokumente und Einstellungen\root>d:

    D:\>cd Python31

    D:\Python31>python
    Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) [MSC v.1500 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print("\u573a")
    场
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    IOError: [Errno 13] Permission denied
    >>>
    _______________________________________________________________________

    I see a rectangle on screen but obviously c&p works.

    @vstinner
    Copy link
    Member

    Maybe the solution is to use the win32 console API directly...

    Yes, it is the best solution because it avoids the horrible mbcs encoding.

    About cp65001: it is not *exactly* the same encoding than utf-8 and so it cannot be used as an alias to utf-8: see issue bpo-6058.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Sep 18, 2010

    @Brian/Tim what's your take on this?

    @vstinner
    Copy link
    Member

    vstinner commented Nov 4, 2010

    I wrote a small function to call WriteConsoleOutputA() and WriteConsoleOutputW() in Python to do some tests. It works correclty, except if I change the code page using chcp command. It looks like the problem is that the chcp command changes the console code page and the ANSI code page, but it should only changes the ANSI code page (and not the console code page).

    chcp command
    ============

    The chcp command changes the console code page, but in practice, the console still expects the OEM code page (eg. cp850 on my french setup). Example:

    C:\...> python.exe -c "import sys; print(sys.stdout.encoding")
    cp850
    C:\...> chcp 65001
    C:\...> python.exe
    Fatal Python error: Py_Initialize: can't initialize sys standard streams
    LookupError: unknown encoding: cp65001
    C:\...> SET PYTHONIOENCODING=utf-8
    C:\...> python.exe
    >>> import sys
    >>> sys.stdout.write("\xe9\n")
    é
    2
    >>> sys.stdout.buffer.write("\xe9\n".encode("utf8"))
    é
    3
    >>> sys.stdout.buffer.write("\xe9\n".encode("cp850"))
    é
    2

    os.device_encoding(1) uses GetConsoleOutputCP() which gives 65001. It should maybe use GetOEMCP() instead? Or chcp command should be fixed?

    Set the console code page looks to be a bad idea, because if I type "é" using my keyboard, a random character (eg. U+0002) is displayed instead...

    WriteConsoleOutputA() and WriteConsoleOutputW()
    ===============================================

    Without touching the code page
    ------------------------------

    If the character can be rendered by the current font (eg. U+00E9): WriteConsoleOutputA() and WriteConsoleOutputW() work correctly.

    If the character cannot be rendered by the current font, but there is a replacment character (eg. U+0141 replaced by U+0041): WriteConsoleOutputA() cannot be used (U+0141 cannot be encoded to the code page), WriteConsoleOutputW() writes U+0141 but the console contains U+0041 (I checked using ReadConsoleOutputW()) and U+0041 is displayed. It works like the mbcs encoding, the behaviour looks correct.

    If the character cannot be rendered by the current font, but there is a replacment character (eg. U+042D): WriteConsoleOutputA() cannot be used (U+042D cannot be encoded to the code page), WriteConsoleOutputW() writes U+042D but U+003d (?) is displayed instead. The behaviour looks correct.

    chcp 65001
    ----------

    Using "chcp 65001" command (+ "set PYTHONIOENCODING=utf-8" to avoid the fatal error), it becomes worse: the result depends on the font...

    Using raster font:

    • (ANSI) write "\xe9".encode("cp850") using WriteConsoleOutputA() displays U+00e9 (é), whereas the console output code page is cp65001 (I checked using GetConsoleOutputCP())
    • (ANSI) write "\xe9".encode("utf-8") using WriteConsoleOutputA() displays é (mojibake!)
    • (UNICODE) write "\xe9" using WriteConsoleOutputW() displays... a random character (U+0002, U+0008, U+0069, U+00b0, ...)

    Using Lucida (TrueType font):

    • (ANSI) write "\xe9".encode("cp850") using WriteConsoleOutputA() displays U+0000 !?
    • (UNICODE) write "\xe9" using WriteConsoleOutputW() works correctly (display U+00e9), even with "\u0141", it works correctly (display U+0141)

    @vstinner
    Copy link
    Member

    vstinner commented Nov 4, 2010

    sys_write_stdtout.patch: Create sys.write_stdout() test function to call WriteConsoleOutputA() or WriteConsoleOutputW() depending on the input types (bytes or str).

    @tzot
    Copy link
    Mannequin

    tzot mannequin commented Nov 4, 2010

    http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx

    If you want any kind of Unicode output in the console, the font must be an “official” MS console TTF (“official” as defined by the Windows version); I believe only Lucida Console and Consolas are the ones with all MS private settings turned on inside the font file.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 8, 2010

    I don't understand exactly the goal of this issue. Different people described various bugs of the Windows console, but I don't see any problem with Python here. It looks like it's just not possible to display correctly unicode with the Windows console (the whole unicode charset, not the current code page subset).

    • 65001 code page: it's not the same encoding than utf-8 and so it cannot be set as an alias to utf-8 (see bpo-6058) => nothing to do, or maybe document that PYTHONIOENCODING=utf-8 workaround... But if you do that, you may get strange errors when writing to stdout or stderr like "IOError: [Errno 13] Permission denied" or "IOError: [Errno 2] No such file or directory" ...
    • chcp command sets the console encoding, which is stupid because the console still expects text encoded to the previous code page => Windows (chcp command) bug, chcp command should not be used (it doesn't solve any problem, it just makes the situation worse)
    • use the console API instead of read()/write() to fix this issue: it doesn't work, the console is completly buggy (msg120414) => Windows (console) bug
    • use "Lucida Console" font avoids some issue => I don't think that the Python interpreter should configure the console (using SetCurrentConsoleFontEx?), it's not the role of Python

    To me, there is nothing to do, and so I close the bug.

    If you would like to fix a particular Python bug, open a new specific issue. If you consider that I'm wrong, Python should fix this issue and you know how, please reopen it.

    @davidsarah
    Copy link
    Mannequin

    davidsarah mannequin commented Jan 9, 2011

    It is certainly possible to write Unicode to the console successfully using WriteConsoleW. This works regardless of the console code page, including 65001. The code <a href="http://tahoe-lafs.org/trac/tahoe-lafs/browser/src/allmydata/windows/fixups.py"\>here\</a> does so (it's for Python 2.x, but you'd be calling WriteConsoleW from C anyway).

    WriteConsoleW has one bug that I know of, which is that it <a href="http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1232"\>fails when writing more than 26608 characters at once</a>. That's easy to work around by limiting the amount of data passed in a single call.

    Fonts are not Python's problem, but encoding is. It doesn't make sense to fail to output the right characters just because some users might not have selected fonts that can display those characters. This bug should be reopened.

    (For completeness, it is possible to display Unicode on the console using fonts other than Lucida Console and Consolas, but it <a href="http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/3259271#3259271"\>requires a registry hack</a>.)

    @vpython
    Copy link
    Mannequin

    vpython mannequin commented Jan 9, 2011

    Interesting!

    I was able to tweak David-Sarah's code to work with Python 3.x, mostly doing things that 2to3 would probably do: changing unicode() to str(), dropping u from u'...', etc.

    I skipped the unmangling of command-line arguments, because it produced an error I didn't understand, about needing a buffer protocol. But I'll attach David-Sarah's code + tweaks + a test case showing output of the Cyrillic alphabet to a console with code page 437 (at least, on my Win7-64 box, that is what it is).

    Nice work, David-Sarah. I'm quite sure this is not in a form usable inside Python 3, but it shows exactly what could be done inside Python 3 to make things work... and gives us a workaround if Python 3 is not fixed.

    @davidsarah
    Copy link
    Mannequin

    davidsarah mannequin commented Jan 9, 2011

    Glenn Linderman wrote:

    I skipped the unmangling of command-line arguments, because it produced an error I didn't understand, about needing a buffer protocol.

    If I understand correctly, that part isn't needed on Python 3 because bpo-2128 is already fixed there.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 9, 2011

    It is certainly possible to write Unicode to the console
    successfully using WriteConsoleW

    Did you tried with characters not encodable to the code page and with character that cannot be rendeded by the font?

    See msg120414 for my tests with WriteConsoleOutputW.

    @davidsarah
    Copy link
    Mannequin

    davidsarah mannequin commented Jan 9, 2011

    haypo wrote:

    davidsarah wrote:
    > It is certainly possible to write Unicode to the console
    > successfully using WriteConsoleW

    Did you tried with characters not encodable to the code page and with character that cannot be rendeded by the font?

    Yes, characters not encodable to the code page do work (as confirmed by Glenn Linderman, since code page 437 does not include Cyrillic).

    Characters that cannot be rendered by the font print as missing-glyph boxes, as expected. They don't cause any other problem, and they can be cut-and-pasted to other Unicode-aware applications, showing up as the original characters.

    See msg120414 for my tests with WriteConsoleOutputW

    Even if it handled encoding correctly, WriteConsoleOutputW (http://msdn.microsoft.com/en-us/library/ms687404%28v=vs.85%29.aspx) would not be the right API to use in any case, because it prints to a rectangle of characters without scrolling. WriteConsoleW does scroll in the same way that printing to a console output stream normally would. (Redirection to a non-console stream can be detected and handled differently, as the code in unicode2.py does.)

    @Drekin
    Copy link
    Mannequin

    Drekin mannequin commented Aug 13, 2016

    Hello Steve, that's great you are working on this!

    I've ran through your patch and I have the following remarks:

    • Since wide chars have two bytes, there may be problem when someone wants to read or write odd number of bytes. If the number is > 1, it's ok since the code may read or write less bytes, but when the number is exactly 1, the code should maybe raise some exception.

    • WriteConsoleW always fails with ERROR_NOT_ENOUGH_MEMORY (8) if we try to write more than a certain number of bytes. For me, the number is something like 41000. Unfortunately, it depends on actual heap usage of the console process. I do len = min(len, 32767) in write. The the value chosen comes from bpo-11395 .

    • If someone types something like ^Zfoo, the standard sys.stdin returns '' -- it ignores everything after EOF if it is the first byte read. I reproduce the bahaviour in win_unicode_console to be compatible.

    • There may be some issue when someone hits Ctrl-C on input. It seems that in that case, ReadConsoleW fails with ERROR_OPERATION_ABORTED (995) and some signal is asynchronously fired. It may happen that the corresponding KeyboardInterrupt exception occurs later that it should. In my Python/ctypes situation I do an ugly hack – I detect ERROR_OPERATION_ABORTED and in that case I sleep for 0.1 seconds to wait for the exception. I understand that the situation may me different in C.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 14, 2016

    For compatibility, I think it may be good to add custom implementations of the buffer attribute and detach() method to stdin/out. They should be able to at least read and write ASCII bytes. It might be easiest to keep them as the current BufferedReader/Writer objects. Probably also make stdin/out.fileno() defer to the buffer attribute.

    With the current patch that only allows reading and writing in UTF-16 pairs, I forsee a few problems:

    • I assume stdin.buffer.raw.readline() will try to read one byte at a time, and will therefore always indicate EOF.
    • Incompatibility with using stdin/out.buffer for ASCII character input and output. I suggest testing the patch with “python -m base64”, a use case mentioned earlier in this thread.

    @Drekin
    Copy link
    Mannequin

    Drekin mannequin commented Aug 14, 2016

    There is also the following consequence of (not) having the standard filenos: input() either considers the streams interactive or not. To consider them interactive, standard filenos and isatty are needed on sys.stdin and sys.stdout.

    If the streams are considered interactive, input() goes via readlinehook machinery, otherwise it just writes and reads an ordinary file.

    The latter means we don't have to touch readline machinery now, the downside is that custom rlcompleters like pyreadline won't work on input().

    @zooba
    Copy link
    Member

    zooba commented Aug 14, 2016

    The current patch actually only affects the raw IO, so the concern would be one of the wrappers trying to work in bytes when it should be dealing in characters. This should be no different from reading a UTF16 file, so either both work or both are broken.

    The readline API is most annoying because it assumes strlen is valid for any encoded text (and at so many places it's near unfixable), but there's another issue for this part.

    Also, I don't have answers for most of the questions in the review on the patch because I copied all of those bits from fileio.c. Can certainly clean parts of them up for the console API, but I count compatibility with the FileIO class a useful goal where possible.

    @zooba
    Copy link
    Member

    zooba commented Aug 15, 2016

    I'm fairly happy with where my current patch is at (not posted right now - too many different machines involved) and only one test is failing - test_cgi.

    The problem seems to be that urllib.parse.unquote() takes an encoding parameter to decode utf-8 encoded bytes with, and cgi infers this parameter from sys.stdin. I don't have the slightest idea why unquote/unquote_to_bytes unconditionally encodes with utf-8 and then allows decoding with an arbitrary encoding, but I guess it works okay for ASCII-compatible encodings?

    Unfortunately, utf-16-le is not ASCII compatible, and so this doesn't work. I'm not familiar enough with cgi or urllib.parse to know what to fix - any suggestions?

    @zooba
    Copy link
    Member

    zooba commented Aug 15, 2016

    For more info here, cgi.parse has code like this:

    def parse(fp, ...):
        if fp is None:
            fp = sys.stdin
    
        encoding = getattr(fp, 'encoding', 'latin-1')
    # later on...
    
    return urllib.parse.parse_qs(a_str, encoding=encoding, ...)
    

    As an easy hack, I added this after assigning encoding:

        if len(' '.encode(encoding, errors='replace')) > 1:
            encoding = 'latin-1'

    I have no idea if this is a good idea or not. The current behaviour of mojibake in the parsed result is certainly worse, since the choice of utf-16-le is entirely contained within the parse() function.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 15, 2016

    I think this CGI thing is a separate bug, just exacerbated by the stdin.encoding problem. :) The urllib.parse.parse_qs() function takes an encoding parameter to figure out what to do with percent-encoded values: "%A9" → b"\xA9".decode(...). This is different lower-level encoding: b"%A9".decode("ascii").

    Maybe the best solution is just to remove the encoding argument, and let it revert to UTF-8, as it did before r87998. Or maybe it really should use the locale encoding. (Is that ASCII-compatible on Windows?) It really depends on where the query string was generated (in a browser, pre-computed URL, etc).

    @zooba
    Copy link
    Member

    zooba commented Aug 31, 2016

    New patch attached (1602_2.patch - hopefully the review will work this time too).

    I discovered while researching for the PEP that a decent amount of code expects to be able to write ASCII to sys.stdout.buffer (or sys.stdout.buffer.raw). As my first patch required utf-16-le at this point, it was going to cause havoc.

    Rather than break that compatibility, I decided that exposing utf-8 and doing the reencoding at the latest possible stage was better. This is also more consistent with how other encoding issues are likely to be resolved, and shouldn't be any less performant, given that previously we were decoding to utf-16 anyway.

    The downsides of this is that read(n) now can only read up to n/4 characters, and write(n) has a much more complicated time dealing with large buffers (as we need to cap the number of utf-16-le bytes but return the number of utf-8 bytes - it's not a direct relationship, so there's more work and a little bit of guessing in some cases).

    On the upside, the readline handling is simpler as utf-8 is compatible with the existing interface and now sys.stdin.encoding is accurate. I've rolled that fix into this patch (just the myreadline.c change) as they really ought to go in together.

    @zooba
    Copy link
    Member

    zooba commented Sep 5, 2016

    Updated patch. This implements everything we've been discussing on python-dev

    @zooba
    Copy link
    Member

    zooba commented Sep 6, 2016

    Latest patch is attached.

    PEP acceptance is sounding likely, so feel free to critically review.

    @zooba
    Copy link
    Member

    zooba commented Sep 7, 2016

    Updated patch based on some suggestions from Eryk.

    The PEP has been accepted, so now I just need to land it in the next two days.

    Currently "normal" usage here is fine, and some edge cases match the Python 3.5 behaviour. I'm going to go through now and bulk out the tests to try and catch more problems, but modulo that I hope the implementation is nearly ready.

    @zooba
    Copy link
    Member

    zooba commented Sep 7, 2016

    I can't actually come up with many useful tests for this... so far I can validate that we can open the console IO object from 0, 1, 2, "CON", "CONIN$" and "CONOUT$", get fileno(), check readable()/writable() and close (multiple times without crashing).

    Anything else requires a real console with a real person with a real keyboard.

    But I fixed a couple of issues in fd handling as a result of the tests, so it's not a complete waste.

    @berkerpeksag
    Copy link
    Member

    I left some minor comments for Doc/whatsnew/3.6.rst on Rietveld.

    In Lib/test/test_winconsoleio.py:

    • self.assert_() (deprecated) can be replaced by self.assertTrue()

    • We can add

      if __name__ == '__main__':
          unittest.main()

    @zooba
    Copy link
    Member

    zooba commented Sep 8, 2016

    Thanks! I've made the changes you suggested.

    @vadmium
    Copy link
    Member

    vadmium commented Sep 8, 2016

    +++ b/Lib/test/test_winconsoleio.py
    +to real people with real keyborads.
    Should be keyboards
    There are still assert_() calls in this file (1602_6.patch). Did you miss them?

    +++ b/Lib/io.py
    +from _io import WindowsConsoleIO
    +all.append('WindowsConsoleIO')
    I think you should either document this class, or remove it from __all__ to clarify it is just an implementation detail.

    +++ b/Modules/_io/winconsoleio.c
    +_io_WindowsConsoleIO___init___impl
    + PyObject *decodedname = Py_None;
    + Py_INCREF(decodedname);
    + int d = PyUnicode_FSDecoder(nameobj, (void*)&decodedname);
    Won’t this leak a reference to Py_None?
    (Also, I think needless casting like in the last line can mask mistakes that the compiler would otherwise pick up. Imagine if you got the parameters around the wrong way.)

    +read_console_w(HANDLE handle, DWORD maxlen, DWORD *readlen) {
    + /* If we didn't read a full buffer that time, don't try
    + again or we will block a second time. */
    I’m not familiar with the Windows APIs involved, but this doesn’t seem robust. What if there were exactly one full buffer waiting, would the next call block without returning anything?

    @vadmium
    Copy link
    Member

    vadmium commented Sep 8, 2016

    Ah sorry I see Berker’s assert_() comment was _after_ you posted 1602_6.patch, so ignore that bit :)

    @vadmium
    Copy link
    Member

    vadmium commented Sep 8, 2016

    Also as I understand it, the open() function can return this new class, so the documentation at <https://docs.python.org/3.6/library/functions.html#open\> needs updating.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 8, 2016

    New changeset 6142d2d3c471 by Steve Dower in branch 'default':
    Issue bpo-1602: Windows console doesn't input or print Unicode (PEP-528)
    https://hg.python.org/cpython/rev/6142d2d3c471

    @zooba zooba closed this as completed Sep 9, 2016
    @eryksun
    Copy link
    Contributor

    eryksun commented Sep 9, 2016

    Martin, the console should be in line-input mode, in which case ReadConsole will block if there isn't at least one line in the input buffer. It reads up to the lesser of a complete line or the number of UTF-16 codes requested. If the previous call read the entire request size but didn't stop on '\n', then we know the next call shouldn't block because the input buffer has at least one '\n' in it.

    I can validate that we can open the console IO object from
    0, 1, 2, "CON", "CONIN$" and "CONOUT$", get fileno(), check
    readable()/writable() and close (multiple times without
    crashing).

    I like the idea to have fileno() lazily get a file descriptor on demand, but _open_osfhandle is a low I/O function that uses _open flags -- not 'rb' (int 0x7262) or 'wb' (int 0x7762). ;-)

    You can use _O_RDONLY | _O_BINARY or _O_WRONLY | _O_BINARY. But really those values would be ignored anyway. It's not actually opening the file, so it only cares about a few flags. Specifically, in lowio\osfinfo.cpp I see that it looks for _O_APPEND, _O_TEXT, and _O_NOINHERIT.

    On line 329, the following assignment

        if (self->writable)
            access |= GENERIC_WRITE;
    

    should be access = GENERIC_WRITE. Requesting both read and write access is an invalid parameter when opening "CON", as can be seen here:

        >>> f = open('CON', 'wb', buffering=0)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        OSError: [WinError 87] The parameter is incorrect: 'CON'

    CONOUT$ works, of course:

        >>> f = open('CONOUT$', 'wb', buffering=0)
        >>> f
        <_io._WindowsConsoleIO mode='wb' closefd=True>

    Lastly, for a readall that starts with ^Z, you're still breaking out of the loop before incrementing len, which is thus 0 when subsequently checked. It ends up calling WideCharToMultiByte with len == 0, which fails.

        >>> sys.stdin.buffer.raw.read()
        ^Z
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        OSError: [WinError 87] The parameter is incorrect

    I can't actually come up with many useful tests for this...

    ctypes can be used to write to the input buffer and read from a screen buffer. For the latter it helps to first create and activate a scratch screen buffer, initialized to NULs to make it easy to read back everything that was written up to the current cursor position. I have existing ctypes code for this, written to solve the problem of a subprocess that stubbornly writes directly to the console instead of writing to stdout/stderr pipes.

    @vadmium
    Copy link
    Member

    vadmium commented Sep 10, 2016

    Okay so regarding blocking reads with a full buffer, what you are saying is the second check to break the read loop should be sufficient:

    +/* If the buffer ended with a newline, break out */
    +if (buf[*readlen - 1] == '\n')
    + break;

    @davispuh
    Copy link
    Mannequin

    davispuh mannequin commented Sep 20, 2016

    Steve Dower (steve.dower)

    [...]
    Anything else requires a real console with a real person with a real keyboard.

    FYI, not really, it is possible to fully automatically test console's output/input using WinAPI functions like WriteConsoleInput, GetConsoleScreenBufferInfo, ReadConsoleOutputCharacter

    very recently I wrote such test, you can look at it as example http://review.source.kitware.com/gitweb?p=KWSys.git;a=blob;f=testConsoleBuf.cxx;hb=HEAD

    it tests all 3 cases when output is actual console, redirected pipe and file.

    @zooba
    Copy link
    Member

    zooba commented Sep 20, 2016

    Oh nice, I like that. We should definitely add some tests using that (though it seems like quite a big task... maybe I'll open a new issue for it).

    @zooba
    Copy link
    Member

    zooba commented Sep 20, 2016

    Created bpo-28217 for adding these tests.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    TODO: <Get performance changes in windows>.
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    Benchmarking on windows:
    * Baseline (http://gpaste/6701096112750592):
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    Benchmarking on windows:
    * Baseline (http://gpaste/6701096112750592): ~1min 41sec.
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    OS-windows topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests