Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expat parser parses strings only when XML encoding is UTF-8 #61291

Closed
serhiy-storchaka opened this issue Jan 31, 2013 · 2 comments
Closed

Expat parser parses strings only when XML encoding is UTF-8 #61291

serhiy-storchaka opened this issue Jan 31, 2013 · 2 comments
Assignees
Labels
extension-modules C modules in the Modules dir topic-unicode topic-XML type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 17089
Nosy @ezio-melotti, @serhiy-storchaka
Files
  • pyexpat_parse_str.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2013-02-13.13:46:43.900>
    created_at = <Date 2013-01-31.10:01:19.212>
    labels = ['extension-modules', 'expert-XML', 'type-bug', 'expert-unicode']
    title = 'Expat parser parses strings only when XML encoding is UTF-8'
    updated_at = <Date 2013-05-22.18:17:25.389>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2013-05-22.18:17:25.389>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2013-02-13.13:46:43.900>
    closer = 'serhiy.storchaka'
    components = ['Extension Modules', 'Unicode', 'XML']
    creation = <Date 2013-01-31.10:01:19.212>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['28916']
    hgrepos = []
    issue_num = 17089
    keywords = ['patch']
    message_count = 2.0
    messages = ['181014', '181347']
    nosy_count = 3.0
    nosy_names = ['ezio.melotti', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue17089'
    versions = ['Python 3.2', 'Python 3.3', 'Python 3.4']

    @serhiy-storchaka
    Copy link
    Member Author

    xmlparser.Parse() works with string data only if XML encoding is utf-8 (or ascii). Examples:

    >>> import xml.parsers.expat
    >>> parser = xml.parsers.expat.ParserCreate()
    >>> content = []
    >>> parser.CharacterDataHandler = content.append
    >>> parser.Parse("<?xml version='1.0' encoding='utf-8'?><tag>\xb5</tag>")
    1
    >>> content
    ['µ']
    >>> parser = xml.parsers.expat.ParserCreate()
    >>> content = []
    >>> parser.CharacterDataHandler = content.append
    >>> parser.Parse("<?xml version='1.0' encoding='iso8859'?><tag>\xb5</tag>")
    1
    >>> content
    ['µ']
    >>> parser = xml.parsers.expat.ParserCreate()
    >>> content = []
    >>> parser.CharacterDataHandler = content.append
    >>> parser.Parse("<?xml version='1.0' encoding='utf-16'?><tag>\xb5</tag>")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    xml.parsers.expat.ExpatError: encoding specified in XML declaration is incorrect: line 1, column 30

    This affects all other modules which works with XML: xml.sax, xml.dom.minidom, xml.dom.pulldom, xml.etree.ElementTree.

    Here is a patch which fixes parsing string data with non-UTF-8 XML.

    @serhiy-storchaka serhiy-storchaka self-assigned this Jan 31, 2013
    @serhiy-storchaka serhiy-storchaka added extension-modules C modules in the Modules dir topic-unicode topic-XML type-bug An unexpected behavior, bug, or error labels Jan 31, 2013
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 4, 2013

    New changeset 3cc2a2de36e3 by Serhiy Storchaka in branch '3.2':
    Issue bpo-17089: Expat parser now correctly works with string input not only when
    http://hg.python.org/cpython/rev/3cc2a2de36e3

    New changeset 6c27b0e09c43 by Serhiy Storchaka in branch '3.3':
    Issue bpo-17089: Expat parser now correctly works with string input not only when
    http://hg.python.org/cpython/rev/6c27b0e09c43

    New changeset c4e6e560e6f5 by Serhiy Storchaka in branch 'default':
    Issue bpo-17089: Expat parser now correctly works with string input not only when
    http://hg.python.org/cpython/rev/c4e6e560e6f5

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir topic-unicode topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant