Author serhiy.storchaka
Recipients Brian.Cain, benjamin.peterson, serhiy.storchaka, terry.reedy
Date 2015-11-06.21:34:38
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1446845678.91.0.520538985619.issue25388@psf.upfronthosting.co.za>
In-reply-to
Content
Yes, there is a bug. When decoding_fgets() encounter non-UTF-8 bytes, it fails and free input buffer in error_ret(). But since tok->cur != tok->inp, next call of tok_nextc() reads freed memory.

        if (tok->cur != tok->inp) {
            return Py_CHARMASK(*tok->cur++); /* Fast path */
        }

If Python is not crashed here, new buffer is allocated and assigned to tok->buf, then PyTokenizer_Get returns error, parsetok() calculates the position of the error

            err_ret->offset = (int)(tok->cur - tok->buf);

but tok->cur points inside old freed buffer, and the offset becomes too large integer. err_input() tries to decode the part of the string before error with the "replace" error handler, but since the position was wrongly calculated, it reads out of allocated memory.

Proposed patch fixes the issue. It sets tok->done and pointers in case of decoding error, so they now are in consistent state. It also removes some duplicated or dead code.
History
Date User Action Args
2015-11-06 21:34:38serhiy.storchakasetrecipients: + serhiy.storchaka, terry.reedy, benjamin.peterson, Brian.Cain
2015-11-06 21:34:38serhiy.storchakasetmessageid: <1446845678.91.0.520538985619.issue25388@psf.upfronthosting.co.za>
2015-11-06 21:34:38serhiy.storchakalinkissue25388 messages
2015-11-06 21:34:38serhiy.storchakacreate