classification
Title: Support reading long lines with io._WindowsConsoleIO
Type: enhancement Stage: needs patch
Components: IO, Unicode, Windows Versions: Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, paul.moore, serhiy.storchaka, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2020-09-24 00:33 by eryksun, last changed 2020-09-24 22:27 by eryksun.

Messages (3)
msg377436 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-09-24 00:33
io._WindowsConsoleIO reads from the console via ReadConsoleW in line-input mode, which limits the line length to the maximum of 256 and the size of the client buffer, including the trailing CRLF (or just CR if processed-input mode is disabled). Text that's typed or pasted beyond the length limit is ignored. The call returns when a carriage return ("\r") is read or the user types the enter key anywhere on the line.

Currently the buffer that _WindowsConsoleIO passes to ReadConsoleW is capped at 512 wide characters, based on the C runtime's BUFSIZ (512) macro. This is too small. Legacy mode (i.e. PYTHONLEGACYWINDOWSSTDIO) uses io.FileIO with an 8 KiB buffer, which is 8K characters when the console input codepage is a legacy single-byte encoding. _WindowsConsoleIO should support at least this much. 

I'd prefer that it allowed up to 32K characters, which is the upper limit for a process command line or for a long filepath. By way of comparison, input(), which calls _PyOS_WindowsConsoleReadline if stdin is a console file, is currently capped at 16K characters.

To be able to read up to 32K characters also requires increasing the BufferedReader default buffer size and TextIOWrapper chunk size to 96 KiB (BMP codes in the range 0x0800-0xFFFF encode as a 3-byte UTF-8 sequence) in order to ensure that f.readline() and f.buffer.readline() request the maximum size. This would require changing _io__WindowsConsoleIO___init___impl to set `self->blksize = 96 * 1024` when `console_type == 'r'`, as well as changing _io_open_impl to manually set the _CHUNK_SIZE of the TextIOWrapper to 96 KiB for console input (type 'r'). Maybe TextIOWrapper should just query the raw _blksize as the initial chunk size. That would remove the need to manually set _CHUNK_SIZE.
msg377448 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-09-24 15:59
I'm in favour of this change in principle, but would want to look at the PR closely.

The biggest risk here is that we have to emulate GNU readline for compatibility, which severely limits the data that can be passed through, and also forces multiple encoding/decoding passes. It would be nice to be able to bypass this in cases where nobody is using it, though since so many host applications use hooks there'll likely only be a benefit to people at the plain console...
msg377463 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-09-24 22:27
> The biggest risk here is that we have to emulate GNU readline for 
> compatibility, which severely limits the data that can be passed 
> through, and also forces multiple encoding/decoding passes. 

I'm not suggesting to disable the console's line-input and echo-input modes and implement our own line editor. If people don't like the built-in line editor that the console provides, they can use pyreadline, which directly uses the console's low-level API via ctypes.

Here's an example of what I would like to just work by default in Python:

    import sys
    import win32console

    def write_input(h, s):
         records = []
         for c in s:
             b = c.encode('utf-16le')
             for i in range(0, len(b), 2):
                 r = win32console.PyINPUT_RECORDType(win32console.KEY_EVENT)
                 r.KeyDown = True
                 r.RepeatCount = 1
                 r.Char = b[i:i+2].decode('utf-16le', 'surrogatepass')
                 records.append(r)
         h.WriteConsoleInput(records)

    def write_and_read_line(s):
        if '\r' not in s:
            s += '\r'
        h = win32console.GetStdHandle(win32console.STD_INPUT_HANDLE)
        mode = h.GetConsoleMode()
        h.SetConsoleMode(mode & ~win32console.ENABLE_ECHO_INPUT)
        try:
            write_input(h, s)
            line = sys.stdin.readline()
        finally:
            h.SetConsoleMode(mode)
        return line

    >>> src_line = 'a' * 32765 + '\r'
    >>> res_line = write_and_read_line(src_line)
    >>> assert res_line == src_line.replace('\r', '\n')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AssertionError
    >>> len(res_line)
    511
    >>> res_line[:5], res_line[-5:]
    ('aaaaa', 'aaaa\n')


Currently the ReadConsoleW buffer in read_console_w is capped at 512 (BUFSIZ) characters. With the console's processed-input mode enabled, it writes a trailing CRLF instead of the raw CR. So the user is limited to typing or pasting just 510 characters on a single line. 

I was only thinking to increase the default maximum size up to 32K in read_console_w -- removing the fixed BUFSIZ aspect of the implementation in favor of capping the buffer used at BUFMAX. In practice this also requires similar default increases for the BufferedReader size and TextIOWrapper chunk size.
History
Date User Action Args
2020-09-24 22:27:40eryksunsetmessages: + msg377463
2020-09-24 15:59:55steve.dowersetmessages: + msg377448
2020-09-24 09:50:27vstinnersetnosy: - vstinner
2020-09-24 06:45:16serhiy.storchakasetnosy: + serhiy.storchaka
2020-09-24 00:33:10eryksuncreate