Author serhiy.storchaka
Recipients doerwalter, lemburg, paul.moore, serhiy.storchaka, steve.dower, tim.golden, zach.ware
Date 2019-03-16.08:06:40
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1552723600.79.0.686719955113.issue36311@roundup.psfhosted.org>
In-reply-to
Content
There is a flaw in PyUnicode_DecodeCodePageStateful() (exposed as _codecs.code_page_decode() at Python level). Since MultiByteToWideChar() takes the size of the input as C int, it can not be used for decoding more than 2 GiB. Large input is split on chunks of size 2 GiB which are decoded separately. The problem is if it split in the middle of a multibyte character. In this case decoding chunks will always fail or replace incomplete parts of the multibyte character at both ends with what the error handler returns.

It is hard to reproduce this bug, because you need to decode more than 2 GiB, and you will need at least 14 GiB of RAM for this (maybe more).
History
Date User Action Args
2019-03-16 08:06:40serhiy.storchakasetrecipients: + serhiy.storchaka, lemburg, doerwalter, paul.moore, tim.golden, zach.ware, steve.dower
2019-03-16 08:06:40serhiy.storchakasetmessageid: <1552723600.79.0.686719955113.issue36311@roundup.psfhosted.org>
2019-03-16 08:06:40serhiy.storchakalinkissue36311 messages
2019-03-16 08:06:40serhiy.storchakacreate