This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Flaw in Windows code page decoder for large input
Type: behavior Stage: resolved
Components: Interpreter Core, Windows Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steve.dower Nosy List: doerwalter, lemburg, miss-islington, paul.moore, serhiy.storchaka, steve.dower, terry.reedy, tim.golden, zach.ware
Priority: normal Keywords: patch

Created on 2019-03-16 08:06 by serhiy.storchaka, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15083 merged steve.dower, 2019-08-02 22:04
PR 15374 merged miss-islington, 2019-08-21 23:22
PR 15375 merged miss-islington, 2019-08-21 23:22
Messages (8)
msg338061 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-16 08:06
There is a flaw in PyUnicode_DecodeCodePageStateful() (exposed as _codecs.code_page_decode() at Python level). Since MultiByteToWideChar() takes the size of the input as C int, it can not be used for decoding more than 2 GiB. Large input is split on chunks of size 2 GiB which are decoded separately. The problem is if it split in the middle of a multibyte character. In this case decoding chunks will always fail or replace incomplete parts of the multibyte character at both ends with what the error handler returns.

It is hard to reproduce this bug, because you need to decode more than 2 GiB, and you will need at least 14 GiB of RAM for this (maybe more).
msg338626 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-03-22 23:30
I have 24G if all working and would be willing to try to run a test case.
msg348921 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-08-02 21:34
If we reduce our chunk size below INT_MAX, then we avoid the issue entirely. Our logic for hitting the middle of a multibyte character is fine (perhaps fixed since this issue was opened?), there's just a weird edge case at 2 GiB in the API call.

As a bonus, smaller chunks seems to have a performance benefit too. It seems like INT_MAX/4 is the sweet spot - it took about a quarter of the time for my 2GiB test case as INT_MAX (and we're measuring in tens of seconds here, so I'm pretty comfortable with the direction of the result). INT_MAX/2 and INT_MAX/8 were both slower than INT_MAX/4.
msg350131 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-08-21 23:22
New changeset 7ebdda0dbee7df6f0c945a7e1e623e47676e112d by Steve Dower in branch 'master':
bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance (GH-15083)
https://github.com/python/cpython/commit/7ebdda0dbee7df6f0c945a7e1e623e47676e112d
msg350133 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-08-21 23:29
I'll get the 3.7 and 3.8 backports merged - looks like they're trivial. 

Going to need some help with the 2.7 backport, but I'm happy to approve a PR.
msg350136 - (view) Author: miss-islington (miss-islington) Date: 2019-08-21 23:53
New changeset f93c15aedc2ea2cb8b56fc9dbb0d412918992e86 by Miss Islington (bot) in branch '3.8':
bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance (GH-15083)
https://github.com/python/cpython/commit/f93c15aedc2ea2cb8b56fc9dbb0d412918992e86
msg350137 - (view) Author: miss-islington (miss-islington) Date: 2019-08-21 23:56
New changeset 735a960ac98cf414caf910565220ab2761fa542a by Miss Islington (bot) in branch '3.7':
bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance (GH-15083)
https://github.com/python/cpython/commit/735a960ac98cf414caf910565220ab2761fa542a
msg351391 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-09-09 09:53
Declaring this out-of-scope for 2.7, unless someone wants to insist (and provide a PR).
History
Date User Action Args
2022-04-11 14:59:12adminsetgithub: 80492
2019-09-09 09:53:49steve.dowersetstatus: open -> closed
versions: - Python 2.7
messages: + msg351391

resolution: fixed
stage: backport needed -> resolved
2019-08-21 23:56:01miss-islingtonsetmessages: + msg350137
2019-08-21 23:53:59miss-islingtonsetnosy: + miss-islington
messages: + msg350136
2019-08-21 23:29:04steve.dowersetmessages: + msg350133
stage: patch review -> backport needed
2019-08-21 23:22:53miss-islingtonsetpull_requests: + pull_request15086
2019-08-21 23:22:46miss-islingtonsetpull_requests: + pull_request15085
2019-08-21 23:22:36steve.dowersetmessages: + msg350131
2019-08-02 22:04:07steve.dowersetkeywords: + patch
stage: test needed -> patch review
pull_requests: + pull_request14828
2019-08-02 21:58:26steve.dowersetassignee: steve.dower
versions: + Python 3.9
2019-08-02 21:34:20steve.dowersetmessages: + msg348921
2019-03-22 23:30:25terry.reedysetnosy: + terry.reedy

messages: + msg338626
stage: test needed
2019-03-16 08:06:40serhiy.storchakacreate