Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In xml.etree.ElementTree findall() can't search all elements in a namespace #72425

Closed
py-user mannequin opened this issue Sep 21, 2016 · 7 comments
Closed

In xml.etree.ElementTree findall() can't search all elements in a namespace #72425

py-user mannequin opened this issue Sep 21, 2016 · 7 comments
Assignees
Labels
3.8 only security fixes stdlib Python modules in the Lib dir topic-XML type-feature A feature request or enhancement

Comments

@py-user
Copy link
Mannequin

py-user mannequin commented Sep 21, 2016

BPO 28238
Nosy @scoder, @py-user, @serhiy-storchaka
PRs
  • bpo-28238: Implement "{*}tag" and "{ns}*" wildcard tag selection support for ElementPath #12997
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/scoder'
    closed_at = <Date 2019-05-03.18:59:05.111>
    created_at = <Date 2016-09-21.12:38:12.968>
    labels = ['expert-XML', '3.8', 'type-feature', 'library']
    title = "In xml.etree.ElementTree findall() can't search all elements in a namespace"
    updated_at = <Date 2019-05-03.18:59:05.110>
    user = 'https://github.com/py-user'

    bugs.python.org fields:

    activity = <Date 2019-05-03.18:59:05.110>
    actor = 'scoder'
    assignee = 'scoder'
    closed = True
    closed_date = <Date 2019-05-03.18:59:05.111>
    closer = 'scoder'
    components = ['Library (Lib)', 'XML']
    creation = <Date 2016-09-21.12:38:12.968>
    creator = 'py.user'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 28238
    keywords = ['patch']
    message_count = 5.0
    messages = ['277130', '340301', '341030', '341043', '341351']
    nosy_count = 4.0
    nosy_names = ['scoder', 'eli.bendersky', 'py.user', 'serhiy.storchaka']
    pr_nums = ['12997']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue28238'
    versions = ['Python 3.8']

    @py-user
    Copy link
    Mannequin Author

    py-user mannequin commented Sep 21, 2016

    In the example there are two namespaces in one document, but it is impossible to search all elements only in one namespace:

    >>> import xml.etree.ElementTree as etree
    >>>
    >>> s = '<feed xmlns="http://def" xmlns:x="http://x"><a/><x:b/></feed>'
    >>>
    >>> root = etree.fromstring(s)
    >>>
    >>> root.findall('*')
    [<Element '{http://def}a' at 0xb73961bc>, <Element '{http://x}b' at 0xb7396c34>]
    >>>
    >>> root.findall('{http://def}*')
    []
    >>>

    And same try with site package lxml works fine:

    >>> import lxml.etree as etree
    >>>
    >>> s = '<feed xmlns="http://def" xmlns:x="http://x"><a/><x:b/></feed>'
    >>>
    >>> root = etree.fromstring(s)
    >>>
    >>> root.findall('*')
    [<Element {http://def}a at 0xb70ab11c>, <Element {http://x}b at 0xb70ab144>]
    >>>
    >>> root.findall('{http://def}*')
    [<Element {http://def}a at 0xb70ab11c>]
    >>>

    @py-user py-user mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir topic-XML labels Sep 21, 2016
    @tirkarthi tirkarthi added the 3.8 only security fixes label Apr 15, 2019
    @scoder
    Copy link
    Contributor

    scoder commented Apr 15, 2019

    lxml has a couple of nice features here:

    • all tags in a namespace: "{namespace}*"
    • a local name 'tag' in any (or no) namespace: "{*}tag"
    • a tag without namespace: "{}tag"
    • all tags without namespace: "{}*"

    "{}" is also accepted but is the same as "*". Note that "*" is actually allowed as an XML tag name by the spec, but rare enough to hijack it for this purpose. I've actually never seen it used anywhere in the wild.

    lxml's implementation isn't applicable to ElementTree (searching has been subject to excessive optimisation), but it shouldn't be hard to extend the one in ET's ElementPath.py module, as well as Element.iter() in ElementTree.py, to support this kind of tag comparison.

    PR welcome.

    lxml's tests are here (and in the following test methods):

    https://github.com/lxml/lxml/blob/359f693b972c2e6b0d83d26a329d2d20b7581c48/src/lxml/tests/test_etree.py#L2911

    Note that they actually test the deprecated .getiterator() method for historical reasons. They should probably call .iter() instead these days. lxml's ElementPath implementation is under src/lxml/_elementpath.py, but the tag comparison itself is done elsewhere in Cython code (here, in case it matters:)

    https://github.com/lxml/lxml/blob/359f693b972c2e6b0d83d26a329d2d20b7581c48/src/lxml/apihelpers.pxi#L921-L1048

    @scoder
    Copy link
    Contributor

    scoder commented Apr 28, 2019

    PR submitted, feedback welcome.

    @scoder scoder self-assigned this Apr 28, 2019
    @scoder scoder added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Apr 28, 2019
    @scoder
    Copy link
    Contributor

    scoder commented Apr 29, 2019

    BTW, I found that lxml and ET differ in their behaviour when searching for '*'. ET takes it as meaning "any tree node", whereas lxml interprets it as "any Element". Since ET's parser does not create comments and processing instructions by default, this does not make a difference in most cases, but when the tree contains comments or PIs, then they will be found by '*' in ET but not in lxml.

    At least for "{}", they now both return only Elements. Changing either behaviour for '*' is probably not a good idea at this point.

    @scoder
    Copy link
    Contributor

    scoder commented May 3, 2019

    New changeset 4754168 by Stefan Behnel in branch 'master':
    bpo-28238: Implement "{}tag" and "{ns}" wildcard tag selection support for ElementPath, and extend the surrounding tests and docs. (GH-12997)
    4754168

    @scoder scoder closed this as completed May 3, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @sleepyhollo
    Copy link

    Is this an issue with Python's xml module or something specific to CPython? I am having this with the xml module right now

    @scoder
    Copy link
    Contributor

    scoder commented Mar 24, 2023

    Is this an issue with Python's xml module or something specific to CPython? I am having this with the xml module right now

    This feature was added to the xml.etree package in the standard library of Python 3.8. It's not specific to CPython, all Python implementations that use the same standard library (3.8) module here should have the same features.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes stdlib Python modules in the Lib dir topic-XML type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants