classification
Title: Add StopParser(), ResumeParser, and GetParsingStatus to expat
Type: enhancement Stage:
Components: Documentation, XML Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: amaury.forgeotdarc, docs@python, loewis, nemeskeyd
Priority: normal Keywords:

Created on 2012-08-24 08:18 by nemeskeyd, last changed 2012-09-06 11:26 by loewis.

Messages (9)
msg168980 - (view) Author: Dávid Nemeskey (nemeskeyd) Date: 2012-08-24 08:18
The C expat library provides XML_StopParser() method that allows the parsing to be stopped from the handler functions. It would be nice to have this option in Python as well, maybe by adding StopParser() method to the XMLParser class.
msg169207 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-08-27 16:22
If a handler function raises an exception, the Parse() method exits and the exception is propagated; internally, this also calls XML_StopParser().
Why would one call XML_StopParser() explicitely?
msg169255 - (view) Author: Dávid Nemeskey (nemeskeyd) Date: 2012-08-28 08:17
OK, then this issue has a "bug" part, too: it is not mentioned in the documentation that exceptions from the handler methods propagate through the Parse() method. I guess the parser can be then stopped in this way too, but it is a dirty method as opposed to calling StopParser().

To answer your question, there are several situations where StopParser() could come in handy. For instance, the XML file might contain records (such as the output of a search engine), from which we only need the first n. Another example would be that reading through the file we realize halfway that e.g. it does not contain the information we need, contains wrong information, etc. so we want to skip the rest of it. Since the file might be huge and since XML parsing can in now way be considered fast, being able to stop the parsing in a clear way would spare the superfluous and possible lengthy computation.
msg169281 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-08-28 13:39
nemeskeyd: would you like to work on a patch (for Python 3.4)?
msg169285 - (view) Author: Dávid Nemeskey (nemeskeyd) Date: 2012-08-28 15:34
loewis: I don't think it would be difficult to fix, so theoretically I'd be in. However, I don't really have the time to work on this right now.
msg169879 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-09-05 16:33
Below is a sample script that shows that it's possible to stop parsing XML in the middle, without an explicit call to XML_StopParser(): raise StopParsing from any handler, and catch it around the Parse() call.

This method covers the two proposed use cases.  Do we need another way to do it?


import xml.parsers.expat

class StopParsing(Exception):
    pass

def findFirstElementByName(data, what):
  def end_element(name):
      if name == what:
          raise StopParsing(name)

  p = xml.parsers.expat.ParserCreate()
  p.EndElementHandler = end_element

  try:
      p.Parse(data, True)
  except StopParsing as e:
      print "Element found:", e
  else:
      print "Element not found"

data = """<?xml version="1.0"?>
         <parent id="top"><child1 name="paul">Text goes here</child1>
         <child2 name="fred">More text</child2>
         </parent>"""
findFirstElementByName(data, "child2")   # Found
findFirstElementByName(data, "child3")   # Not found
msg169905 - (view) Author: Dávid Nemeskey (nemeskeyd) Date: 2012-09-06 07:28
Amaury: see my previous comment. There are two problems with the method you proposed:

1. It is not mentioned in the documentation that exceptions are propagated through parse().
2. Exceptions usually mean that an error has happened, and is not the preferred way for flow control (at least this is the policy in other languages e.g. Java, I don't know about Python).
msg169906 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-09-06 09:30
Your first point is true, even if the Python zen (try "import this") 
states that "Errors should never pass silently."

For your second point: exceptions are a common thing in Python code.  This is similar to the EAFP principle http://docs.python.org/glossary.html#term-eafp
Also, this example http://docs.python.org/release/2.7.3/library/imp.html#examples shows that exceptions can be part of the normal flow control.
msg169913 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-09-06 11:26
Dávid: Another (similar) example is the Python for loop. In it's original form, it would increase an index and invoke __getitem__ until that *raised* IndexError. In the current definition, it converts the iterated-over object into an iterator, and keeps calling .next until that *raises* StopIteration.

So raising an exception to indicate that something is finished is an established Python idiom.

In any case, I still think adding StopParser is a useful addition, in particular since that would also allow giving True as the "resumable" argument. Any such change needs to be accompanied by also exposing XML_ResumeParser, and possibly XML_GetParsingStatus.

Since we all agree that this is not an important change, I don't mind keeping this issue around until someone comes along to contribute code for it.
History
Date User Action Args
2012-09-06 11:26:29loewissetmessages: + msg169913
title: Add StopParser() to expat -> Add StopParser(), ResumeParser, and GetParsingStatus to expat
2012-09-06 09:30:44amaury.forgeotdarcsetnosy: + docs@python
messages: + msg169906

assignee: docs@python
components: + Documentation
2012-09-06 07:28:31nemeskeydsetmessages: + msg169905
2012-09-05 16:33:22amaury.forgeotdarcsetmessages: + msg169879
2012-08-31 21:29:01berker.peksagsetversions: + Python 3.4
2012-08-28 15:34:58nemeskeydsetmessages: + msg169285
2012-08-28 13:39:37loewissetnosy: + loewis
messages: + msg169281
2012-08-28 08:17:04nemeskeydsetmessages: + msg169255
2012-08-27 16:22:20amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg169207
2012-08-24 08:18:17nemeskeydcreate