Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaw in Windows code page decoder for large input #80492

Closed
serhiy-storchaka opened this issue Mar 16, 2019 · 8 comments
Closed

Flaw in Windows code page decoder for large input #80492

serhiy-storchaka opened this issue Mar 16, 2019 · 8 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 36311
Nosy @malemburg, @doerwalter, @terryjreedy, @pfmoore, @tjguk, @zware, @serhiy-storchaka, @zooba, @miss-islington
PRs
  • bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance #15083
  • [3.8] bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance (GH-15083) #15374
  • [3.7] bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance (GH-15083) #15375
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/zooba'
    closed_at = <Date 2019-09-09.09:53:49.558>
    created_at = <Date 2019-03-16.08:06:40.740>
    labels = ['interpreter-core', 'type-bug', '3.8', '3.9', '3.7', 'OS-windows']
    title = 'Flaw in Windows code page decoder for large input'
    updated_at = <Date 2019-09-09.09:53:49.556>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2019-09-09.09:53:49.556>
    actor = 'steve.dower'
    assignee = 'steve.dower'
    closed = True
    closed_date = <Date 2019-09-09.09:53:49.558>
    closer = 'steve.dower'
    components = ['Interpreter Core', 'Windows']
    creation = <Date 2019-03-16.08:06:40.740>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 36311
    keywords = ['patch']
    message_count = 8.0
    messages = ['338061', '338626', '348921', '350131', '350133', '350136', '350137', '351391']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'doerwalter', 'terry.reedy', 'paul.moore', 'tim.golden', 'zach.ware', 'serhiy.storchaka', 'steve.dower', 'miss-islington']
    pr_nums = ['15083', '15374', '15375']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue36311'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @serhiy-storchaka
    Copy link
    Member Author

    There is a flaw in PyUnicode_DecodeCodePageStateful() (exposed as _codecs.code_page_decode() at Python level). Since MultiByteToWideChar() takes the size of the input as C int, it can not be used for decoding more than 2 GiB. Large input is split on chunks of size 2 GiB which are decoded separately. The problem is if it split in the middle of a multibyte character. In this case decoding chunks will always fail or replace incomplete parts of the multibyte character at both ends with what the error handler returns.

    It is hard to reproduce this bug, because you need to decode more than 2 GiB, and you will need at least 14 GiB of RAM for this (maybe more).

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows type-bug An unexpected behavior, bug, or error labels Mar 16, 2019
    @terryjreedy
    Copy link
    Member

    I have 24G if all working and would be willing to try to run a test case.

    @zooba
    Copy link
    Member

    zooba commented Aug 2, 2019

    If we reduce our chunk size below INT_MAX, then we avoid the issue entirely. Our logic for hitting the middle of a multibyte character is fine (perhaps fixed since this issue was opened?), there's just a weird edge case at 2 GiB in the API call.

    As a bonus, smaller chunks seems to have a performance benefit too. It seems like INT_MAX/4 is the sweet spot - it took about a quarter of the time for my 2GiB test case as INT_MAX (and we're measuring in tens of seconds here, so I'm pretty comfortable with the direction of the result). INT_MAX/2 and INT_MAX/8 were both slower than INT_MAX/4.

    @zooba zooba added the 3.9 only security fixes label Aug 2, 2019
    @zooba zooba self-assigned this Aug 2, 2019
    @zooba
    Copy link
    Member

    zooba commented Aug 21, 2019

    New changeset 7ebdda0 by Steve Dower in branch 'master':
    bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance (GH-15083)
    7ebdda0

    @zooba
    Copy link
    Member

    zooba commented Aug 21, 2019

    I'll get the 3.7 and 3.8 backports merged - looks like they're trivial.

    Going to need some help with the 2.7 backport, but I'm happy to approve a PR.

    @miss-islington
    Copy link
    Contributor

    New changeset f93c15a by Miss Islington (bot) in branch '3.8':
    bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance (GH-15083)
    f93c15a

    @miss-islington
    Copy link
    Contributor

    New changeset 735a960 by Miss Islington (bot) in branch '3.7':
    bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance (GH-15083)
    735a960

    @zooba
    Copy link
    Member

    zooba commented Sep 9, 2019

    Declaring this out-of-scope for 2.7, unless someone wants to insist (and provide a PR).

    @zooba zooba closed this as completed Sep 9, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants