Author eryksun
Recipients eryksun, paul.moore, python-dev, steve.dower, tim.golden, zach.ware
Date 2016-09-21.05:46:31
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1474436793.57.0.776081344241.issue28162@psf.upfronthosting.co.za>
In-reply-to
Content
For breaking out of the readall while loop, you only need to check if the current read is empty:

        /* when the read is empty we break */
        if (n == 0)
            break;

Also, the logic is wrong here:

    if (len == 0 || buf[0] == '\x1a' && _buflen(self) == 0) {
        /* when the result starts with ^Z we return an empty buffer */
        PyMem_Free(buf);
        return PyBytes_FromStringAndSize(NULL, 0);
    }

This is true when len is 0 or when buf[0] is Ctrl+Z and _buflen(self) is 0. Since buf[0] shouldn't ever be Ctrl+Z here (low-level EOF handling is abstracted in read_console_w), it's never checking the internal buffer. We can easily see this going wrong here:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    Ā^Z
    >>> a
    b'\xc4'
    >>> b
    b''

It misses the remaining byte in the internal buffer.

This check can be simplified as follows:

    rn = _buflen(self);

    if (len == 0 && rn == 0) {
        /* return an empty buffer */
        PyMem_Free(buf);
        return PyBytes_FromStringAndSize(NULL, 0);
    }

After this the code assumes that len isn't 0, which leads to more WideCharToMultiByte failure cases. 

In the last conversion it's overwrite bytes_size without including rn. 

I'm not sure what's going on with _PyBytes_Resize(&bytes, n * sizeof(wchar_t)). ISTM, it should be resized to bytes_size, and make sure this includes rn.

Finally, _copyfrombuf is repeatedly overwriting buf[0] instead of writing to buf[n]. 

With the attached patch, the behavior seems correct now:

    >>> sys.stdin.buffer.raw.read()
    ^Z
    b''

    >>> sys.stdin.buffer.raw.read()
    abc^Z
    ^Z
    b'abc\x1a\r\n'

Split U+0100:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    Ā^Z
    >>> a
    b'\xc4'
    >>> b
b'\x80'

Split U+1234:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    ሴ^Z
    >>> a
    b'\xe1'
    >>> b
    b'\x88\xb4'

The buffer still can't handle splitting an initial non-BMP character, stored as a surrogate pair. Both codes end up as replacement characters because they aren't transcoded as a unit.

Split U+00010000:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    𐀀^Z
    ^Z
    >>> a
    b'\xef'
    >>> b
    b'\xbf\xbd\xef\xbf\xbd\x1a\r\n'
History
Date User Action Args
2016-09-21 05:46:33eryksunsetrecipients: + eryksun, paul.moore, tim.golden, python-dev, zach.ware, steve.dower
2016-09-21 05:46:33eryksunsetmessageid: <1474436793.57.0.776081344241.issue28162@psf.upfronthosting.co.za>
2016-09-21 05:46:33eryksunlinkissue28162 messages
2016-09-21 05:46:31eryksuncreate