Message 277093 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, paul.moore, python-dev, steve.dower, tim.golden, zach.ware
Date	2016-09-21.05:46:31
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1474436793.57.0.776081344241.issue28162@psf.upfronthosting.co.za>
In-reply-to

Content
For breaking out of the readall while loop, you only need to check if the current read is empty: /* when the read is empty we break / if (n == 0) break; Also, the logic is wrong here: if (len == 0 \|\| buf[0] == '\x1a' && _buflen(self) == 0) { / when the result starts with ^Z we return an empty buffer / PyMem_Free(buf); return PyBytes_FromStringAndSize(NULL, 0); } This is true when len is 0 or when buf[0] is Ctrl+Z and _buflen(self) is 0. Since buf[0] shouldn't ever be Ctrl+Z here (low-level EOF handling is abstracted in read_console_w), it's never checking the internal buffer. We can easily see this going wrong here: >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read() Ā^Z >>> a b'\xc4' >>> b b'' It misses the remaining byte in the internal buffer. This check can be simplified as follows: rn = _buflen(self); if (len == 0 && rn == 0) { / return an empty buffer / PyMem_Free(buf); return PyBytes_FromStringAndSize(NULL, 0); } After this the code assumes that len isn't 0, which leads to more WideCharToMultiByte failure cases. In the last conversion it's overwrite bytes_size without including rn. I'm not sure what's going on with _PyBytes_Resize(&bytes, n sizeof(wchar_t)). ISTM, it should be resized to bytes_size, and make sure this includes rn. Finally, _copyfrombuf is repeatedly overwriting buf[0] instead of writing to buf[n]. With the attached patch, the behavior seems correct now: >>> sys.stdin.buffer.raw.read() ^Z b'' >>> sys.stdin.buffer.raw.read() abc^Z ^Z b'abc\x1a\r\n' Split U+0100: >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read() Ā^Z >>> a b'\xc4' >>> b b'\x80' Split U+1234: >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read() ሴ^Z >>> a b'\xe1' >>> b b'\x88\xb4' The buffer still can't handle splitting an initial non-BMP character, stored as a surrogate pair. Both codes end up as replacement characters because they aren't transcoded as a unit. Split U+00010000: >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read() 𐀀^Z ^Z >>> a b'\xef' >>> b b'\xbf\xbd\xef\xbf\xbd\x1a\r\n'

For breaking out of the readall while loop, you only need to check if the current read is empty:

        /* when the read is empty we break */
        if (n == 0)
            break;

Also, the logic is wrong here:

    if (len == 0 || buf[0] == '\x1a' && _buflen(self) == 0) {
        /* when the result starts with ^Z we return an empty buffer */
        PyMem_Free(buf);
        return PyBytes_FromStringAndSize(NULL, 0);
    }

This is true when len is 0 or when buf[0] is Ctrl+Z and _buflen(self) is 0. Since buf[0] shouldn't ever be Ctrl+Z here (low-level EOF handling is abstracted in read_console_w), it's never checking the internal buffer. We can easily see this going wrong here:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    Ā^Z
    >>> a
    b'\xc4'
    >>> b
    b''

It misses the remaining byte in the internal buffer.

This check can be simplified as follows:

    rn = _buflen(self);

    if (len == 0 && rn == 0) {
        /* return an empty buffer */
        PyMem_Free(buf);
        return PyBytes_FromStringAndSize(NULL, 0);
    }

After this the code assumes that len isn't 0, which leads to more WideCharToMultiByte failure cases. 

In the last conversion it's overwrite bytes_size without including rn. 

I'm not sure what's going on with _PyBytes_Resize(&bytes, n * sizeof(wchar_t)). ISTM, it should be resized to bytes_size, and make sure this includes rn.

Finally, _copyfrombuf is repeatedly overwriting buf[0] instead of writing to buf[n]. 

With the attached patch, the behavior seems correct now:

    >>> sys.stdin.buffer.raw.read()
    ^Z
    b''

    >>> sys.stdin.buffer.raw.read()
    abc^Z
    ^Z
    b'abc\x1a\r\n'

Split U+0100:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    Ā^Z
    >>> a
    b'\xc4'
    >>> b
b'\x80'

Split U+1234:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    ሴ^Z
    >>> a
    b'\xe1'
    >>> b
    b'\x88\xb4'

The buffer still can't handle splitting an initial non-BMP character, stored as a surrogate pair. Both codes end up as replacement characters because they aren't transcoded as a unit.

Split U+00010000:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    𐀀^Z
    ^Z
    >>> a
    b'\xef'
    >>> b
    b'\xbf\xbd\xef\xbf\xbd\x1a\r\n'

History
Date	User	Action	Args
2016-09-21 05:46:33	eryksun	set	recipients: + eryksun, paul.moore, tim.golden, python-dev, zach.ware, steve.dower
2016-09-21 05:46:33	eryksun	set	messageid: <1474436793.57.0.776081344241.issue28162@psf.upfronthosting.co.za>
2016-09-21 05:46:33	eryksun	link	issue28162 messages
2016-09-21 05:46:31	eryksun	create