classification
Title: Hardcoded namespace_separator in the cElementTree.XMLParser
Type: enhancement Stage: resolved
Components: Library (Lib), XML Versions: Python 3.4
process
Status: closed Resolution: duplicate
Dependencies: Superseder: ElementTree -- provide a way to ignore namespace in tags and searches
View: 18304
Assigned To: Nosy List: dmtr, effbot, eli.bendersky, flox, library.engine, loewis, martin.panter, scoder
Priority: normal Keywords: patch

Created on 2010-04-30 22:57 by dmtr, last changed 2019-04-27 16:32 by scoder. This issue is now closed.

Files
File name Uploaded Description Edit
issue-8583.patch dmtr, 2010-04-30 23:02 Issue 8583.patch Target: cElementTree-1.0.5-20051216
Messages (13)
msg104671 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-04-30 22:57
The namespace_separator parameter is hard coded in the cElementTree.XMLParser class disallowing the option of ignoring XML Namespaces with cElementTree library.

Here's the code example:
 from xml.etree.cElementTree import iterparse
 from StringIO import StringIO
 xml = """<root xmlns="http://www.very_long_url.com"><child/></root>"""
 for event, elem in iterparse(StringIO(xml)): print event, elem

It produces:
 end <Element '{http://www.very_long_url.com}child' at 0xb7ddfa58>
 end <Element '{http://www.very_long_url.com}root' at 0xb7ddfa40> 

In the current implementation local tags get forcibly concatenated with URIs often resulting in the ugly code on the user's side and performance degradation (at least due to extra concatenations and extra lengthy compare operations in the elements matching code).

Internally cElementTree uses EXPAT parser, which is doing namespace processing only optionally, enabled by providing a value for namespace_separator argument. This argument is hard-coded in the cElementTree: 
 self->parser = EXPAT(ParserCreate_MM)(encoding, &memory_handler, "}");

Well, attached is a patch exposing this parameter in the cElementTree.XMLParser() arguments. This parameter is optional and the default behavior should be unchanged.  Here's the test code:

import cElementTree

x = """<root xmlns="http://www.very_long_url.com"><child>text</child></root>"""

parser = cElementTree.XMLParser()
parser.feed(x)
elem = parser.close()
print elem

parser = cElementTree.XMLParser(namespace_separator="}")
parser.feed(x)
elem = parser.close()
print elem

parser = cElementTree.XMLParser(namespace_separator=None)
parser.feed(x)
elem = parser.close()
print elem

The resulting output:
<Element '{http://www.very_long_url.com}root' at 0xb7e885f0>
<Element '{http://www.very_long_url.com}root' at 0xb7e88608>
<Element 'root' at 0xb7e88458>
msg104676 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-04-30 23:25
And obviously iterparse can be either overridden in the local user code or patched in the library. Here's the iterparse code/test code:

import  cElementTree
from cStringIO import StringIO

class iterparse(object):
    root = None
    def __init__(self, file, events=None, namespace_separator = "}"):
        if not hasattr(file, 'read'):
            file = open(file, 'rb')
        self._file = file
        self._events = events
        self._namespace_separator = namespace_separator
    def __iter__(self):
        events = []
        b = cElementTree.TreeBuilder()
        p = cElementTree.XMLParser(b, namespace_separator= \
                                        self._namespace_separator)
        p._setevents(events, self._events)
        while 1:
          data = self._file.read(16384)
          if not data:
            break
          p.feed(data)
          for event in events:
            yield event
          del events[:]
        root = p.close()
        for event in events:
          yield event
        self.root = root


x = """<root xmlns="http://www.very_long_url.com"><child>text</child></root>"""
context = iterparse(StringIO(x), events=("start", "end", "start-ns"))
for event, elem in context: print event, elem

context = iterparse(StringIO(x), events=("start", "end", "start-ns"), namespace_separator = None)
for event, elem in context: print event, elem


It produces:
start-ns ('', 'http://www.very_long_url.com')
start <Element '{http://www.very_long_url.com}root' at 0xb7ccf650>
start <Element '{http://www.very_long_url.com}child' at 0xb7ccf5a8>
end <Element '{http://www.very_long_url.com}child' at 0xb7ccf5a8>
end <Element '{http://www.very_long_url.com}root' at 0xb7ccf650>
start <Element 'root' at 0xb7ccf620>
start <Element 'child' at 0xb7ccf458>
end <Element 'child' at 0xb7ccf458>
end <Element 'root' at 0xb7ccf620>

Note the absence of URIs and ignored start-ns events in the 'space_separator = None' version.
msg104733 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-05-01 17:56
Namespaces are a fundamental part of the XML information model (both xpath and infoset) and all modern XML document formats, so I'm not sure what problem you're trying to solve by pretending that they don't exist.

It's a bit like modifying "import foo" to work like "from foo import *"...
msg104764 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-05-02 02:55
This patch does not modify the existing behavior of the library. The namespace_separator parameter is optional. Parameter already exists in the EXPAT library, but it is hard coded in the cElementTree.XMLParser code.

Fredrik, yes, namespaces are a fundamental part of the XML information model. Yet an option of having them ignored is a very valuable one in the performance critical code.
msg104795 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2010-05-02 17:30
There is at least one valid use case: code that needs to deal with HTML and XHTML currently has to normalise the tag names in some way, which usually means that it will want to remove the namespaces from XHTML documents to make it look like plain HTML. It would be nice if the library could do this efficiently right in the parser by simply removing all namespace declarations. However, this doesn't really apply to (c)ElementTree where the parser does not support HTML parsing.

I'm -1 on the interface that the proposed patch adds. The keyword argument name and its semantics are badly chosen. A boolean flag will work much better.

The proposed feature will have to be used with great care by users. Code that depends on it is very fragile and will fail when an input document uses unexpected namespaces, e.g. to embed foreign content, or because it is actually written in a different XML language that just happens to have similar local tag names. This kind of code is rather hard to fix, as fixing it means that it will stop accepting documents that previously passed without problems. Rejecting broken input early is a virtue.

All in all, I'm -0.5 on this feature as I'd expect most use cases to be premature optimisations with potentially dangerous side effects more than anything else.
msg104815 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-05-03 04:38
I agree that the argument name choice is poor. But it have already been made by whoever coded the EXPAT parser which cElementTree.XMLParser wraps. So there is not much room here.

As to 'proposed feature have to be used with great care by users' - this s already taken care of. If you look - cElementTree.XMLParser class is a rather obscure one. As I understand it is only being used by users requiring high performance xml parsing for large datasets (10GB - 10TB range) in data-mining applications.
msg104816 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-05-03 05:03
Interestingly in precisely these applications often you don't care about namespaces at all. Often all you need is to extract 'text' or 'name' elements irregardless of the namespace.
msg137104 - (view) Author: (library.engine) Date: 2011-05-28 01:55
I second request for tag names not prefixed with a root namespace in python,
mostly because of ugly code, as performance degradation is negligible on relatively small files. But this ubiquitous repeating (even in the case if you're appending a variable to every tag name) is just against the DRY principle, and I don't like it.
I think an extra option to pass list of namespaces that should NOT be prepended to the tag names would be sufficient.
msg137106 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2011-05-28 07:04
I don't see this having much to do with the DRY principle. It's "explicit is better than implicit" and "better safe than sorry" that applies here.
msg137164 - (view) Author: (library.engine) Date: 2011-05-29 01:54
What is so implicit in the passing of a list of undesired namespaces to the parse function?
This is quite explicit, in my humble opinion, and it lets you not to repeat yourself for each and every tag you want to find in the tree, as well.
msg166024 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2012-07-21 14:05
See also issue 13378 which proposes custom namespace maps for serializing.
msg235725 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-02-11 04:06
Also Issue 18304 for more discussion on simplifying namespaces
msg340999 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-04-27 16:32
Closing as a duplicate of the more general issue 18304.
History
Date User Action Args
2019-04-27 16:32:07scodersetstatus: open -> closed
superseder: ElementTree -- provide a way to ignore namespace in tags and searches
messages: + msg340999

resolution: duplicate
stage: resolved
2015-02-11 04:06:49martin.pantersetnosy: + martin.panter
messages: + msg235725
2012-07-21 14:05:36floxsetversions: + Python 3.4, - Python 3.2
nosy: + eli.bendersky

messages: + msg166024

components: + XML
2011-05-29 01:54:35library.enginesetmessages: + msg137164
2011-05-28 07:28:52loewissetmessages: - msg137107
2011-05-28 07:27:47loewissetnosy: + loewis
messages: + msg137107
2011-05-28 07:04:35scodersetmessages: + msg137106
2011-05-28 01:55:24library.enginesetnosy: + library.engine
messages: + msg137104
2010-08-04 20:11:46terry.reedysettype: performance -> enhancement
versions: + Python 3.2, - Python 2.6, Python 2.5, Python 2.7
2010-05-03 05:03:47dmtrsetmessages: + msg104816
2010-05-03 04:38:35dmtrsetmessages: + msg104815
2010-05-02 17:30:22scodersetnosy: + scoder
messages: + msg104795
2010-05-02 02:55:03dmtrsetmessages: + msg104764
2010-05-01 17:56:49effbotsetnosy: + effbot
messages: + msg104733
2010-04-30 23:25:56dmtrsetmessages: + msg104676
2010-04-30 23:02:12dmtrsetfiles: + issue-8583.patch
keywords: + patch
2010-04-30 23:00:52brian.curtinsetnosy: + flox
2010-04-30 22:57:27dmtrcreate