Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong behavior for '\xff\n'.decode('gb2312', 'ignore') #56225

Closed
cdqzzy mannequin opened this issue May 6, 2011 · 14 comments
Closed

Wrong behavior for '\xff\n'.decode('gb2312', 'ignore') #56225

cdqzzy mannequin opened this issue May 6, 2011 · 14 comments
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@cdqzzy
Copy link
Mannequin

cdqzzy mannequin commented May 6, 2011

BPO 12016
Nosy @malemburg, @terryjreedy, @hyeshik, @vstinner, @ezio-melotti
Dependencies
  • bpo-12057: HZ codec has no test
  • Files
  • cjk_decode.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-07-07.23:52:16.790>
    created_at = <Date 2011-05-06.09:16:14.586>
    labels = ['type-bug', 'expert-unicode']
    title = "Wrong behavior for '\\xff\\n'.decode('gb2312', 'ignore')"
    updated_at = <Date 2011-07-07.23:52:16.788>
    user = 'https://bugs.python.org/cdqzzy'

    bugs.python.org fields:

    activity = <Date 2011-07-07.23:52:16.788>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-07-07.23:52:16.790>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2011-05-06.09:16:14.586>
    creator = 'cdqzzy'
    dependencies = ['12057']
    files = ['22241']
    hgrepos = []
    issue_num = 12016
    keywords = ['patch']
    message_count = 14.0
    messages = ['135268', '135410', '135419', '135422', '135423', '135447', '135767', '135769', '137340', '137591', '137595', '137602', '140004', '140006']
    nosy_count = 7.0
    nosy_names = ['lemburg', 'terry.reedy', 'hyeshik.chang', 'vstinner', 'ezio.melotti', 'python-dev', 'cdqzzy']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue12016'
    versions = ['Python 3.3']

    @cdqzzy
    Copy link
    Mannequin Author

    cdqzzy mannequin commented May 6, 2011

    let s='\xff\n'
    The expected result of s.decode('gb2312', 'ignore') is u"\n", while in 2.6.6 it is u"".
    s can be replaced with chr(m) + chr(n) , where m is in range of 128~255, and n in 0~127.
    In the above cases, try decoding from chr(n) will never interfere with later parts in the string if there is any, since chr(n) do not start a multibyte sequence.

    @cdqzzy cdqzzy mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels May 6, 2011
    @terryjreedy
    Copy link
    Member

    u'' in 2.7.1 also, on winxp

    @vstinner
    Copy link
    Member

    vstinner commented May 7, 2011

    So the correct result for b'\xff\n'.decode('gb2312', 'replace') is u'?\n'?

    @cdqzzy
    Copy link
    Mannequin Author

    cdqzzy mannequin commented May 7, 2011

    So the correct result for b'\xff\n'.decode('gb2312', 'replace') is u'?\n'?

    I think it should be so. This behavior does not leave out possible information, has no side-effect on later decodings, and should the '\n' indeed be redundant, an output of u'?\n' would unlikely cause confusions.

    Though, I have no knowledge on this subject code-wise. If a change of the behavior will have an impact on performance, maybe the change should not come in.

    @vstinner
    Copy link
    Member

    vstinner commented May 7, 2011

    _codecs_cn implements different multibyte encodings: gb2312, gbkext, gbcommon, gb18030ext, gbk, gb18030.

    And there are other Asian multibyte encodings: big5 family, ISO 2202 family, JIS family, korean encodings (KSX1001, EUC_KR, CP949, ...), Big5, CP950, ...

    All of them ignore the all bytes if one byte of a multibyte sequence is invalid (lile 0xFF 0x0A: replaced by ? instead of ?\n using replace error handler).

    I don't think that you can/should patch only one encoding: we should use the same rule for all encodings.

    By the way, do you have any document explaining which result is the good one (? or ?\n)? For UTF-8, we have well defined standards explaining exactly what to do with invalid byte sequences => see issue bpo-8271. It is easy to fix the decoders, but I would like to be sure that your proposed change is the right way to decode these encodings.

    Change the multibyte encodings can also concern the security. Read for example the following section "Check byte strings before decoding them to character strings" of my book:
    http://www.haypocalc.com/tmp/unicode-2011-03-25/html/issues.html#check-byte-strings-before-decoding-them-to-character-strings
    (https://github.com/haypo/unicode_book/wiki)

    @cdqzzy
    Copy link
    Mannequin Author

    cdqzzy mannequin commented May 7, 2011

    I do not have documents on this subject. Though, I found that GNU iconv(1) behaves the same as my proposed behavior. My reading of the source code suggests that iconv(1) treat all encodings equally, which I think should also be true for python.

    As of security concerns, I do not think the change in decoding function itself would introduce any security vulnerabilities. If a security issue arises because of the proposed change, there must be improper code out side of python, which is out of python's control. That said, the proposed change is unlikely to introduce new security vulnerability, as all it does in effect is retaining a few ascii characters in the string to the output as opposed to removing. In the issue of wordpress, if we suppose that wordpress was written in python, and that the attacker was using gb2312 encoded strings instead of gbk, then my proposed change would by chance fix the issue, as the backslash would be retained when we decode the string.

    @vstinner
    Copy link
    Member

    I asked if the change is correct on iconv mail list. Here is a copy of an answer.

    De: Bruno Haible
    À: [iconv mailing list]
    Cc: Victor Stinner
    Sujet: Re: [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings
    Date: Tue, 10 May 2011 14:52:09 +0200

    Hi,

    Someone opened an issue in Python bug tracker asking to change how
    invalid multibyte sequences are handled.
    http://bugs.python.org/issue12016

    For UTF-8 the recommended way of handling malformed input is written down
    in <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt\>. But the
    principle applies to any encoding with a variable number of bytes per
    character:
    When an invalid or malformed byte sequence is found, the smallest
    such byte sequence is transformed to U+FFFD (replacement character).

    In particular, normally, if the first byte that is considered "wrong"
    or "invalid" is a valid starter byte, the malformed byte sequence should
    be considered to end before that byte. If it is not a valid starter
    byte, then use your judgement.

    For an example implementation, see
    <http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/unistr/u8-mbtouc.c;hb=HEAD\>
    Here the return value is the number of bytes consumed. Look carefully
    when it is 1, 2, 3, or 4.

    b'\xffabc'.decode('gb2312', 'replace') gives "�bc". The 'a' character is
    seen as part of a multibyte character of 2 bytes. Because {0xFF, 0x61}
    is invalid in GB2312, the two bytes are replaced by U+FFFD.

    Is it the "right" way to to do?

    It is better to replace only the 0xFF byte with U+FFFD, because 0x61 is a
    valid first byte (even a complete character).

    UTF-8 decoder changed recently to ignore a single byte and restart the
    decoder, so '\xF1\x80\x41\x42\x43' is now decoded "�ABC" instead "�C".
    Should we do the same for all encodings?

    Generally, yes.

    Or at least for asian encodings
    (gb2312, gbk, gb18030, big5 family, ISO 2202 family, JIS family, EUC_KR,
    CP949, Big5, CP950, ...)?

    For stateful encodings of the ISO 2202 family, you may want to ignore/replace
    a complete escape sequence, where the syntax of escape sequences is defined
    through general rules.

    Bruno

    In memoriam Siegfried Rädel <http://en.wikipedia.org/wiki/Siegfried_Rädel\>

    @vstinner
    Copy link
    Member

    Oh, the HZ codec has no test! And what is this horrible BLOB, Lib/test/cjkencodings_test.py?

    @vstinner
    Copy link
    Member

    • I added tests for the HZ codec and some ISO 2022 codecs: bpo-12057
    • I fixed IncrementalEncoder.encode() (of multibytecodec ): bpo-12100
    • I fixed IncrementalEncoder.reset() (of multibytecodec): bpo-12171

    I can now work confidently on this issue. I will try to patch all CJK decoders to only replace 1 invalid byte by U+FFFD (and not 2, 3 or 4 bytes) and try to write a test for each case (each byte sequence generating a different error).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 3, 2011

    New changeset 3610841f7357 by Victor Stinner in branch '3.2':
    Issue bpo-12016: Reindent decoders of HK and JP codecs
    http://hg.python.org/cpython/rev/3610841f7357

    New changeset aa07c1237f4e by Victor Stinner in branch 'default':
    (Merge 3.2) Issue bpo-12016: Reindent decoders of HK and JP codecs
    http://hg.python.org/cpython/rev/aa07c1237f4e

    New changeset 685351d65592 by Victor Stinner in branch '2.7':
    Issue bpo-12016: Reindent decoders of HK and JP codecs
    http://hg.python.org/cpython/rev/685351d65592

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 3, 2011

    New changeset 8572bf1b56ec by Victor Stinner in branch '3.2':
    Issue bpo-12016: Add test_errorhandle() to TestBase_Mapping of
    http://hg.python.org/cpython/rev/8572bf1b56ec

    New changeset c3dc94d53ef8 by Victor Stinner in branch 'default':
    (Merge 3.2) Issue bpo-12016: Add test_errorhandle() to TestBase_Mapping of
    http://hg.python.org/cpython/rev/c3dc94d53ef8

    New changeset 53912b58eee6 by Victor Stinner in branch '2.7':
    Issue bpo-12016: Add test_errorhandle() to TestBase_Mapping of
    http://hg.python.org/cpython/rev/53912b58eee6

    @vstinner
    Copy link
    Member

    vstinner commented Jun 3, 2011

    cjk_decode.patch:

    • patch *all* CJK decoders to replace only the first byte of an invalid byte sequence (by U+FFFD). Example from the issue title: b'\xff\n'.decode('gb2312', 'replace') gives now '�\n' instead of just '�'
    • add at least one unit test for *each* path in the decoder (sometimes it was really hard to see how to go into a specific path, especially for the johab decoder!)
    • add testcases for euc_jis_2004 and shift_jis_2004
    • factorize "codec tests" (codectests) of all japanese EUC tests (euc_commontests)

    Because I consider this issue as a bug, I would like to apply this patch to 2.7, 3.2 and 3.3.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 7, 2011

    New changeset 16cbd84de848 by Victor Stinner in branch 'default':
    Issue bpo-12016: Multibyte CJK decoders now resynchronize faster
    http://hg.python.org/cpython/rev/16cbd84de848

    @vstinner
    Copy link
    Member

    vstinner commented Jul 7, 2011

    Because I consider this issue as a bug, I would like
    to apply this patch to 2.7, 3.2 and 3.3.

    It is maybe a bug but it is also an important change on Python behaviour, so finally I prefer to only change (fix) Python 3.3.

    Thanks for reporting the bug zy (cdqzzy). Tell me if it now behaves as you expected.

    I'm closing this issue because the initial issue is now fixed.

    @vstinner vstinner closed this as completed Jul 7, 2011
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants