classification
Title: ElementTree -- provide a way to ignore namespace in tags and searches
Type: enhancement Stage:
Components: Library (Lib), XML Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: brycenesbitt, eli.bendersky, jjmiller50, martin.panter, pocek, rhettinger, scoder
Priority: normal Keywords: patch

Created on 2013-06-26 03:48 by brycenesbitt, last changed 2016-07-28 04:37 by martin.panter.

Files
File name Uploaded Description Edit
etree_strip_namespaces.patch scoder, 2016-07-27 17:59 review
Messages (15)
msg191894 - (view) Author: Bryce Nesbitt (brycenesbitt) Date: 2013-06-26 03:48
ElementTree offers a wonderful and easy API for parsing XML... but if there is a namespace involved it suddenly gets ugly.  This is a proposal to fix that.  First an example:

------------------
!/usr/bin/python
# Demonstrate awkward behavior of namespaces in ElementTree
import xml.etree.cElementTree as ET

xml_sample_one = """\
<?xml version="1.0"?>
<presets>
<thing stuff="some stuff"/>
<thing stuff="more stuff"/>
</presets>
"""
root = ET.fromstring(xml_sample_one)
for child in root.iter('thing'):
    print child.tag

xml_sample_two = """\
<?xml version="1.0"?>
<presets xmlns="http://josm.openstreetmap.de/tagging-preset-1.0">
<thing stuff="some stuff"/>
<thing stuff="more stuff"/>
</presets>
"""
root = ET.fromstring(xml_sample_two)
for child in root.iter('{http://josm.openstreetmap.de/tagging-preset-1.0}thing'):
    print child.tag
------------------

Because of the namespace in the 2nd example, a {namespace} name keeps {namespace} getting {namespace} in {namespace} {namespace} the way.

Online there are dozens of question on how to deal with this, for example: http://stackoverflow.com/questions/11226247/python-ignore-xmlns-in-elementtree-elementtree

With wonderfully syntactic solutions like 'item.tag.split("}")[1][0:]'

-----
How about if I could set any root to have an array of namespaces to suppress:

root = ET.fromstring(xml_sample_two)
root.xmlns_at_root.append('{namespace}')

Or even just a boolean that says I'll take all my namespaces without qualification?
msg194896 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-11 15:53
FWIW, lxml.etree supports wildcards like '{*}tag' in searches, and this is otherwise quite rarely a problem in practice.

I'm -1 on the proposed feature and wouldn't mind rejecting this all together. (At least change the title to something more appropriate.)
msg194899 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-08-11 16:05
I was planning to look more closely at the namespace support in ET at some point, but haven't found the time yet.

[changing the title to be more helpful]
msg194901 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-11 16:17
There's also the QName class which can be used to split qualified tag names. And it's pretty trivial to pre-process the entire tree by stripping all namespaces from it the intention is really to do namespace agnostic processing. However, in my experience, most people who want to do that haven't actually understood namespaces (although, admittedly, sometimes it's those who designed the XML format who didn't understand namespaces ...).
msg194906 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-08-11 17:41
> (although, admittedly, sometimes it's those who designed the XML format
who didn't understand >namespaces ...).

I fully concur. The design of XML, in general, is not the best
demonstration of aesthetics in programming. But namespaces always seem to
me to be one further step in the WTF direction. This is precisely why I
didn't reject this issue right away: perhaps it's not a bad idea to provide
Python programmers with *some* way to ease namespace-related tasks (even if
they go against the questionable design principles behind XML).
msg194919 - (view) Author: Bryce Nesbitt (brycenesbitt) Date: 2013-08-12 04:06
The mere existence of popular solutions like
'item.tag.split("}")[1][0:]' argues something is wrong.  What could
lmxl do to make this cleaner (even if the ticket proposal is junk).
msg194920 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-12 04:21
Please leave the title as it is now.
msg194923 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-12 04:34
As I already suggested for lxml, you can use the QName class to process qualified names, e.g.

    QName(some_element.tag).localname

Or even just

    QName(some_element).localname

It appears that ElementTree doesn't support this. It lists the QName type as "opaque". However, it does provide a "text" attribute that contains the qualified tag name.

http://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.QName

Here is the corresponding documentation from lxml:

http://lxml.de/api/lxml.etree.QName-class.html

QName instances in lxml provide the properties "localname", "namespace" and "text".
msg216768 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-04-18 01:12
FWIW, I would like to have a way to ignore namespaces.

For many day-to-day problems (parsing Microsoft Excel
files saved in an XML format or parsing RSS feeds),
this would be a nice simplification.

I teach Python for a living and have found that it is
common for namespaces to be an obstacle for people
trying to get something done.

Giving them the following answer is unsatisfactory response to legitimate needs: 
"""
And it's pretty trivial to pre-process the entire tree by stripping all namespaces from it the intention is really to do namespace agnostic processing. However, in my experience, most people who want to do that haven't actually understood namespaces (although, admittedly, sometimes it's those who designed the XML format who didn't understand namespaces ...).
"""
msg216774 - (view) Author: Stefan Behnel (scoder) * Date: 2014-04-18 05:08
You can already use iterparse for this.

    it = ET.iterparse('somefile.xml')
    for _, el in it:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
    root = it.root

As I said, this would be a little friendlier with support in the QName class, but it's not really complex code. Could be added to the docs as a recipe, with a visible warning that this can easily lead to incorrect data processing and therefore should not be used in systems where the input is not entirely under control.

Note that it's unclear what the "right way to do it" is, though. Is it better to 1) alter the data by stripping all namespaces off, or 2) let the tree API itself provide a namespace agnostic mode? Depends on the use case, but the more generic way 2) should be fairly involved in terms of implementation complexity, for just a minor use case. 1) would be ok in most cases where this "feature" is useful, I guess, and can be done as shown above.

In fact, the advantage of doing it explicitly with iterparse() is that instead of stripping all namespaces, only the expected namespaces can be discarded. And errors can be raised when finding unnamespaced elements, for example. This allows for a safety guard that prevents the code from completely misinterpreting input. There is a reason why namespace were added to XML at some point.
msg235724 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-02-11 04:05
See Issue 8583 for a proposal that would apparently allow all namespaces to be ignored
msg271448 - (view) Author: jonathan miller (jjmiller50) Date: 2016-07-27 13:00
A flexible and pretty simple way opf loosening up handling namespaces would be to OPTIONALLY change what is done at parse time:

1.  Don't handle xmlns declarations specially.  Leave them as normal attributes, and the Element.attrib would have a normal entry for each.

2. Leave the abbreviation colon-separated prefix in front of the element
tags as they come in.

If the using code wants, it can walk the ElementTree contents making dictionaries of the active namespace declarations, tucking a dict reference into each Element.  Maybe put in an ElementTree method that does this, why not?

I'm interested in this topic because I wish to handle xml from a variety of different tools, some of which had their XML elements defined without namespaces.  They can use element names which are really common - like 'project' - and have no namespace definitions.  Worse:   if you put one in, the tool that originally used the element breaks.

Doing things as suggested gives the user the opportunity to look for matches using the colonized names, to shift namespace abbrevs easily, and to write out nicely namespaced code with abbrevs on the elements easily.

This would be OPTIONAL:  the way etree does it now, full prefixing of URI, is the safe way and should be retained as the default.
msg271466 - (view) Author: Stefan Behnel (scoder) * Date: 2016-07-27 17:59
Here is a proposed patch for a new function "strip_namespaces(tree)" that discards all namespaces from tags and attributes in a (sub-)tree, so that subsequent processing does not have to deal with them.

The "__all__" test is failing (have to figure out how to fix that), and docs are missing (it's only a proposal for now). Comments welcome.
msg271472 - (view) Author: Stefan Behnel (scoder) * Date: 2016-07-27 19:47
On second thought, I think it should be supported (also?) in the parser. Otherwise, using it with an async parser would be different from (and more involved than) one-shot parsing. That seems wrong.
msg271475 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-07-27 23:22
Perhaps it would make more sense to use rpartition() or rstrip(). It seems possible to have a closing curly bracket in a namespace, but not in a element tag or attribute name.

My guess is the __all__ failure is just a sign to add the new function to the __all__ variable at the top of the module.
History
Date User Action Args
2016-07-28 04:37:21martin.pantersettitle: ElementTree -- provide a way to ignore namespace in tags and seaches -> ElementTree -- provide a way to ignore namespace in tags and searches
2016-07-27 23:22:48martin.pantersetmessages: + msg271475
2016-07-27 19:47:37scodersetmessages: + msg271472
2016-07-27 17:59:04scodersetfiles: + etree_strip_namespaces.patch
keywords: + patch
messages: + msg271466

versions: + Python 3.6, - Python 3.4
2016-07-27 13:00:50jjmiller50setnosy: + jjmiller50
messages: + msg271448
2015-02-11 04:05:25martin.pantersetmessages: + msg235724
2014-04-18 05:08:06scodersetmessages: + msg216774
2014-04-18 01:12:08rhettingersetnosy: + rhettinger
messages: + msg216768
2014-04-17 23:14:32martin.pantersetnosy: + martin.panter
2014-04-11 09:00:07poceksetnosy: + pocek
2013-08-12 04:34:27scodersetmessages: + msg194923
2013-08-12 04:21:23scodersetmessages: + msg194920
title: ElementTree gets awkward to use if there is an xmlns -> ElementTree -- provide a way to ignore namespace in tags and seaches
2013-08-12 04:06:42brycenesbittsetmessages: + msg194919
title: ElementTree -- provide a way to ignore namespace in tags and seaches -> ElementTree gets awkward to use if there is an xmlns
2013-08-11 17:41:00eli.benderskysetmessages: + msg194906
2013-08-11 16:17:22scodersetmessages: + msg194901
2013-08-11 16:05:44eli.benderskysetmessages: + msg194899
title: ElementTree gets awkward to use if there is an xmlns -> ElementTree -- provide a way to ignore namespace in tags and seaches
2013-08-11 15:53:23scodersetversions: + Python 3.4, - Python 2.7
nosy: + scoder, eli.bendersky

messages: + msg194896

components: + Library (Lib), XML, - Extension Modules
2013-06-26 03:48:17brycenesbittcreate