Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml.etree.ElementTree says unknown encoding of a regular encoding #57821

Closed
dongying mannequin opened this issue Dec 16, 2011 · 25 comments
Closed

xml.etree.ElementTree says unknown encoding of a regular encoding #57821

dongying mannequin opened this issue Dec 16, 2011 · 25 comments
Labels
extension-modules C modules in the Modules dir topic-unicode topic-XML type-bug An unexpected behavior, bug, or error

Comments

@dongying
Copy link
Mannequin

dongying mannequin commented Dec 16, 2011

BPO 13612
Nosy @terryjreedy, @amauryfa, @vstinner, @ezio-melotti, @florentx, @serhiy-storchaka
Files
  • test.py: source code
  • expat_unknown_encoding_handler.patch
  • expat_unknown_encoding_handler_2.patch
  • expat_et_share_unknown_handler.1.patch
  • expat_buffer_overflow-2.7.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-08-04.13:11:32.579>
    created_at = <Date 2011-12-16.09:04:10.905>
    labels = ['extension-modules', 'expert-XML', 'type-bug', 'expert-unicode']
    title = 'xml.etree.ElementTree says unknown encoding of a regular encoding'
    updated_at = <Date 2013-08-04.13:11:32.578>
    user = 'https://bugs.python.org/dongying'

    bugs.python.org fields:

    activity = <Date 2013-08-04.13:11:32.578>
    actor = 'eli.bendersky'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-08-04.13:11:32.579>
    closer = 'eli.bendersky'
    components = ['Extension Modules', 'Unicode', 'XML']
    creation = <Date 2011-12-16.09:04:10.905>
    creator = 'dongying'
    dependencies = []
    files = ['23974', '30328', '30342', '30347', '31150']
    hgrepos = []
    issue_num = 13612
    keywords = ['patch']
    message_count = 25.0
    messages = ['149602', '149612', '149652', '189660', '189671', '189685', '189694', '189720', '189785', '189796', '189809', '189810', '189818', '189845', '189847', '189865', '189898', '189956', '189957', '189960', '189961', '194061', '194340', '194364', '194365']
    nosy_count = 9.0
    nosy_names = ['terry.reedy', 'amaury.forgeotdarc', 'vstinner', 'ezio.melotti', 'eli.bendersky', 'flox', 'dongying', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13612'
    versions = ['Python 2.7', 'Python 3.3', 'Python 3.4']

    @dongying
    Copy link
    Mannequin Author

    dongying mannequin commented Dec 16, 2011

    I've been trying to parse xml string using python, codes following:

        #-*- coding: utf-8 -*-
        import xml.etree.ElementTree as xmlet
        s = '<?xml version="1.0" encoding="GBK"?><info></info>'
        xmlet.fromstring(s)

    Then:
    $ python2.6 test.py
    or:
    $ pypy test.py

    Traceback message came out like this:

        Traceback (most recent call last):
          File "test.py", line 4, in <module>
            xmlet.fromstring(s)
          File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML
            parser.feed(text)
          File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed
            self._parser.Parse(data, 0)
        xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30

    I've seen this in following cases:

    Python2.5.6 on Mac OS X Lion
    Python2.6.5 on Ubuntu 10.04
    pypy1.7.0 on Mac OS X Lion
    

    But not in these cases:

    Python2.6.7 on Mac OS X Lion
    Python2.7.1 on Mac OS X Lion
    

    @dongying dongying mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 16, 2011
    @amauryfa
    Copy link
    Member

    Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with narrow unicode builds (on my 64bit Linux box).

    In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode buffer, without checking its length... which happens to be 192 chars long.
    So buffers overflow, etc. The function has a comment "supports only 8bit encodings"; indeed.
    Versions 3.2 and 3.3 happen to pass the test, probably by pure luck.

    Supporting multibytes codecs won't be easy: pyexpat requires to fill an array which specifies the number of bytes needed by each start byte (for example, in utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence). Our codecs framwork does not provide this information, and some codecs (gb18030 for example) need the second char to determine whether it will need 4 bytes.

    @terryjreedy
    Copy link
    Member

    3.2, Win7 (a narrow build) it indeed works and returns
    <Element 'info' at 0x3389128>

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented May 20, 2013

    In 3.3+ there's no distinction between wide and narrow builds. Does anyone know how this should be affected? [I don't know much about unicode and encodings, unfortunately]

    @amauryfa
    Copy link
    Member

    encoding="GBK" causes a buffer overflow in PyUnknownEncodingHandler, because the result of PyUnicode_Decode() is only 192 characters long.
    Exact behavior is not defined...

    @terryjreedy
    Copy link
    Member

    3.3 shifted the wide-build problem to all builds ;-). I now get

    File "C:\Python\mypy\tem.py", line 4, in <module>
    xmlet.fromstring(s)
    File "C:...33\lib\xml\etree\ElementTree.py", line 1356, in XML
    parser.feed(text)
    File "<string>", line None
    xml.etree.ElementTree.ParseError: unknown encoding: line 1, column 30

    I do not understand the 'unknown encoding' bit. Replacing 'GBK' with a truly unknown encoding changes the last line to
    LookupError: unknown encoding: xyz, so the lookup of 'GBK' succeeded.

    I get the same two messages if I add a 'b' prefix to make s be bytes, which it logically should be (and was in 2.7). (I presume .fromstring 'encodes' unicode input to bytes with the ascii or latin-1 encoder and then decodes back to unicode according to the announced encoding.)

    With s so prefixed, s.decode(encoding="GBK") works and returns the original unicode version of s, so Python does know "GBK". And it indeed is in the list of official IANA charset names.

    I don't know unicode internals to understand Amaury's comment. However, it almost reads to me as if this is a unicode bug, not ET bug.

    @serhiy-storchaka
    Copy link
    Member

    PyUnknownEncodingHandler gets an encoding name and fills a special XML_Encoding structure which expat parser uses for decoding. Currently PyUnknownEncodingHandler works only with 8-bit encodings and I don't see an efficient method how extent it to handle general multibyte encoding.

    @serhiy-storchaka
    Copy link
    Member

    I can propose only raise a specialized exception instead of general xml.etree.ElementTree.ParseError. Here is a patch. It also fixes a buffer overread bug mentioned by Amaury.

    @serhiy-storchaka serhiy-storchaka added extension-modules C modules in the Modules dir topic-unicode topic-XML and removed stdlib Python modules in the Lib dir labels May 21, 2013
    @amauryfa
    Copy link
    Member

    The patch goes in the right direction: consistently reject non-8bit encodings which the current implementation does not support.

    • please add a unit test
    • remove usage of PyUnicodeObject and the "Stupid to access directly" comment, they are outdated.

    Real support for GBK is more work, http://wang.yuxuan.org/blog/?itemid=63 shows how to set a conversion callback. It will difficult to implement in the general case, see my older comment.

    @serhiy-storchaka
    Copy link
    Member

    For unit tests we first should fix bpo-16986.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented May 22, 2013

    For unit tests we first should fix bpo-16986.

    I did another round of code review on bpo-16986 now.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented May 22, 2013

    Looked at Serhiy's patch here too: LGTM with a unit test :)

    @serhiy-storchaka
    Copy link
    Member

    Here is an updated patch. PyUnknownEncodingHandler() and expat_unknown_encoding_handler() are synchronized. Added tests.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented May 23, 2013

    Serhiy, would it make sense to share the code somewhere instead of duplicating it?

    @serhiy-storchaka
    Copy link
    Member

    May be. But it needs more work for building. It is simpler just duplicate a code (the function is small enough) and add comments:

    /* The code is duplicated as xx() in xxx.c. Keep both functions synchronized. */

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented May 23, 2013

    How about this patch (not tested it too much - just as a proof of concept).

    We're pretty free in the C API exported by pyexpat through a capsule to _elementtree, so we can also add a default handler there. This API already has some general utilities like ErrorString.

    ET can then simply call the handler from pyexpat.

    @serhiy-storchaka
    Copy link
    Member

    LGTM. But why not just inline expat_unknown_encoding_handler()?

    For 2.7 we perhaps should add SetStartDoctypeDeclHandler and SetEncoding in the exported C API. Shouldn't we update the C API version (for this issue and for bpo-16986)?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 25, 2013

    New changeset f7b47fb30169 by Eli Bendersky in branch '3.3':
    Issue bpo-13612: handle unknown encodings without a buffer overflow.
    http://hg.python.org/cpython/rev/f7b47fb30169

    New changeset 47e719b11c46 by Eli Bendersky in branch 'default':
    Issue bpo-13612: handle unknown encodings without a buffer overflow.
    http://hg.python.org/cpython/rev/47e719b11c46

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented May 25, 2013

    A few notes:

    1. If by C API version you mean PyExpat_CAPI_MAGIC, I'm not sure what difference that makes. It has never been updated and it's also being used only in _elementtree. Since the latter is statically compiled against pyexpat, I don't see a reason to touch it.

    2. I wouldn't backport it to 2.7; again, not critical enough.

    3. Should we open a separate issue for adding support to GBK and similar encodings "sometime in the future"? This issue kind-of got derailed to fix the bug.

    @serhiy-storchaka
    Copy link
    Member

    Tests were duplicated. There is no Misc/NEWS entry.

    I think a buffer overflow is critical enough for backporting.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented May 25, 2013

    Oh, I didn't notice that your patches had duplication in the tests. Fixed now. I'll wait to see what unfolds for the Misc/NEWS discussion on python-committers.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Aug 1, 2013

    Serhiy, do you want to backport the buffer overflow fix to 2.7?

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch for 2.7.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 4, 2013

    New changeset b3efc140d8a6 by Eli Bendersky in branch '2.7':
    Issue bpo-13612: Fix a buffer overflow in case of a multi-byte encoding.
    http://hg.python.org/cpython/rev/b3efc140d8a6

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Aug 4, 2013

    Thanks, Serhiy.

    @elibendersky elibendersky mannequin closed this as completed Aug 4, 2013
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir topic-unicode topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants