Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range checking in GB18030 decoder #74176

Closed
animalize mannequin opened this issue Apr 5, 2017 · 11 comments
Closed

Range checking in GB18030 decoder #74176

animalize mannequin opened this issue Apr 5, 2017 · 11 comments
Labels
3.7 (EOL) end of life topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@animalize
Copy link
Mannequin

animalize mannequin commented Apr 5, 2017

BPO 29990
Nosy @vstinner, @ezio-melotti, @animalize, @zhangyangyu
PRs
  • bpo-29990: Range checking in GB18030 decoder #994
  • bpo-29990: Range checking in GB18030 decoder #999
  • bpo-29990: fix range checking in GB18030 decoder #1495
  • [3.6] bpo-29990: Fix range checking in GB18030 decoder #1507
  • [3.5] bpo-29990: Fix range checking in GB18030 decoder #1508
  • [2.7] bpo-29990: Fix range checking in GB18030 decoder  #1509
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-05-09.04:19:26.799>
    created_at = <Date 2017-04-05.03:50:03.567>
    labels = ['type-bug', '3.7', 'expert-unicode']
    title = 'Range checking in GB18030 decoder'
    updated_at = <Date 2017-05-09.04:22:14.406>
    user = 'https://github.com/animalize'

    bugs.python.org fields:

    activity = <Date 2017-05-09.04:22:14.406>
    actor = 'xiang.zhang'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-05-09.04:19:26.799>
    closer = 'xiang.zhang'
    components = ['Unicode']
    creation = <Date 2017-04-05.03:50:03.567>
    creator = 'malin'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 29990
    keywords = []
    message_count = 11.0
    messages = ['291153', '291205', '291208', '291210', '291213', '291215', '293275', '293276', '293277', '293278', '293279']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'ezio.melotti', 'malin', 'xiang.zhang']
    pr_nums = ['994', '999', '1495', '1507', '1508', '1509']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue29990'
    versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7']

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Apr 5, 2017

    This issue is split from bpo-24117, that issue became a soup of small issues, so I'm going to close it.

    For 4-byte GB18030 sequence, the legal range is:
    0x81-0xFE for the 1st byte
    0x30-0x39 for the 2nd byte
    0x81-0xFE for the 3rd byte
    0x30-0x39 for the 4th byte
    GB18030 standard:
    https://en.wikipedia.org/wiki/GB_18030
    https://pan.baidu.com/share/link?shareid=2606985291&uk=3341026630

    The current code forgets to check 0xFE for the 1st and 3rd byte.
    Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example:

    # legal sequence b'\x81\x31\x81\x30' is decoded to U+060A, it's fine.
    uchar = b'\x81\x31\x81\x30'.decode('gb18030')
    print(hex(ord(uchar)))
    
    # illegal sequence 0x8130FF30 can be decoded to U+060A as well, this should not happen.
    uchar = b'\x81\x30\xFF\x30'  .decode('gb18030')
    print(hex(ord(uchar)))

    @animalize animalize mannequin added 3.7 (EOL) end of life topic-unicode type-bug An unexpected behavior, bug, or error labels Apr 5, 2017
    @zhangyangyu
    Copy link
    Member

    The table in wikipedia is somewhat complex. I find ftp://ftp.software.ibm.com/software/globalization/documents/gb18030m.pdf and the table in it is same as https://pan.baidu.com/share/link?shareid=2606985291&uk=3341026630 (except 0x80) but in English. I agree with Ma Lin bytes sequences like b'\x81\x30\xFF\x30' are invalid.

    For current implementation, you could see:

    >>> invalid = b'\x81\x30\xff\x30'
    >>> invalid.decode('gb18030').encode('gb18030') == invalid
    False

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Apr 6, 2017

    except 0x80 (€)

    I suppose the English edition is not the final release of GB18030-2000.

    At the end of official Chinese edition of GB18030-2005, listed the difference between GB18030-2000 and GB18030-2005 clearly, it doesn't mention 0x80 (€), so GB18030-2000 should not has 0x80 as well.

    Why 0x80 (€) appear in English edition?
    I searched on Google, this topic said 0x80 appears in *draft* of GB18030-2000.
    http://www.pkucn.com/thread-304395-1-1.html
    So maybe the English edition is a translation of GB18030-2000 draft, this logic seems ok.

    Anyway, 0x80 is another story, not conflict with this issue.
    Zhang, do I need to make PR for 3.6/3.5/2.7 respectively?

    @zhangyangyu
    Copy link
    Member

    Yes, 0x80 doesn't matter here.

    It's nice to make the backporting PRs. But let's wait some time for ezio and haypo's comments and reviews. Get the master PR merged first and then continue on backporting. :-)

    @vstinner
    Copy link
    Member

    vstinner commented Apr 6, 2017

    An incorrect implementation of a decoder might lead to security vulnerabilities:
    http://unicodebook.readthedocs.io/issues.html#security-vulnerabilities

    *But* UTF-8 decoder of Python 2 is *not* strict and nobody complained.

    I suggest that, once the changed is merged in master, backport the fix to 3.6 and 3.5.

    But I'm not sure that it's worth it to backport it to 2.7? Is there a risk to break an application?

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Apr 6, 2017

    This is a very trivial bug, it's hard to imagine a scene that someone trying to decode those 8630 illegal 4-byte sequences with GB18030 decoder.
    And I think this bug can't lead to security vulnerabilities.

    As far as I can see, GB2312/GBK/GB18030 codecs are bugfree except this bug, of course maybe I'm wrong.

    @zhangyangyu
    Copy link
    Member

    New changeset 9da408d by Xiang Zhang in branch 'master':
    bpo-29990: Fix range checking in GB18030 decoder (bpo-1495)
    9da408d

    @zhangyangyu
    Copy link
    Member

    New changeset 72e1b61 by Xiang Zhang in branch '3.6':
    bpo-29990: Fix range checking in GB18030 decoder (bpo-1495) (bpo-1507)
    72e1b61

    @zhangyangyu
    Copy link
    Member

    New changeset f5f7870 by Xiang Zhang in branch '3.5':
    bpo-29990: Fix range checking in GB18030 decoder (bpo-1495) (bpo-1508)
    f5f7870

    @zhangyangyu
    Copy link
    Member

    New changeset 4e7457b by Xiang Zhang in branch '2.7':
    bpo-29990: Fix range checking in GB18030 decoder (bpo-1509)
    4e7457b

    @zhangyangyu
    Copy link
    Member

    Thanks Ma Lin for finding the problem! Don't know why you close the PR but anyway, we solve it finally.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants