Issue 43560: Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87726

classification

Title:	Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks
Type:	enhancement	Stage:
Components:	XML	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ridgerat1611
Priority:	normal	Keywords:

Created on 2021-03-19 18:39 by ridgerat1611, last changed 2022-04-11 14:59 by admin.

Messages (1)
msg389108 - (view)	Author: Larry Trammell (ridgerat1611) *	Date: 2021-03-19 18:39
Issue 43483 was posted as a "bug" but retracted. Though the problem is real, it is tricky to declare an UNSPECIFIED behavior to be a bug. See that issue page for more discussion and a test case. A brief overview is repeated here. SCENARIO - XML PARSING LOSES DATA (or not) The parsing attempts to capture text consisting of very tiny quoted strings. A typical content line reads something like this: <p>Colchuck</p> The parser implements a scheme presented at various tutorial Web sites, using two member functions. # Note the name attribute of the current tag group def element_handler(self, tagname, attrs) : self.CurrentTag = tagname # Record the content from each "p" tag when encountered def characters(self, content): if self.CurrentTag == "p": self.name = content ... > print(parser.name) "Colchuck" But then, after successfully extracting content from perhaps hundreds of thousands of XML tag sets in this way, the parsing suddenly "drops" a few characters of content. > print(parser.name) "lchuck" While this problem was observed with a SAX parser, it can affect expat parsers as well. It affects 32-bit and 64-bit implementations the same, over several major releases of the Python 3 system. SPECIFIED BEHAVIOR (or not) The "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (and many prior versions) states: ----------- ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks... ----------- If it happens that the content is delivered in two chunks instead of one, the characters() method shown above overwrites the first part of the text with the second part, and some content seems lost. This completely explains the observed behavior. EXPECTED BEHAVIOR (or not) Even though the behavior is unspecified, users can have certain expectations about what a reasonable parser should do. Among these: -- EFFICIENCY: the parser should do simple things simply, and complicated things as simply as possible -- CONSISTENCY: the parser behavior should be repeatable and dependable The design can be considered "poor" if thorough testing cannot identify what the actual behaviors are going to be, because those behaviors are rare and unpredictable. The obvious "simple thing," from the user perspective, is that the parser should return each tiny text string as one tiny text chunk. In fact, this is precisely what it does... 99.999% of the time. But then, suddenly, it doesn't. One hypothesis is that when the parsing scan of raw input text reaches the end of a large internal text buffer, it is easier from the implementer's perspective to flush any text remaining in the old buffer prior to fetching a new one, even if that produces a fragmented chunk with only a couple of characters. IMPROVEMENTS REQUIRED Review the code to determine whether the text buffer scenario is in fact the primary cause of inconsistent behavior. Modify the data handling to defer delivery of content fragments that are small, carrying over a small amount of previously scanned text so that small contiguous text chunks are recombined rather than reported as multiple fragments. If the length of the content text to carry over is greater than some configurable xml.sax.handler.ContiguousChunkLength, the parser can go ahead and deliver it as a fragment. DOCUMENTING THE IMPROVEMENTS Strictly speaking: none required. Undefined behaviors are undefined, whether consistent or otherwise. But after the improvements are implemented, it would be helpful to modify documentation to expose the new performance guarantees, making users more aware of the possible hazards. For example, a new description in the "xml.sax.handler" page might read as follows: ----------- ContentHandler.characters(content) -- The Parser will call this method to report chunks of character data. In general, character data may be reported as a single chunk or as sequence of chunks; but character data sequences with fewer than xml.sax.handler.ContiguousChunkLength characters, when uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to be delivered as a single chunk... -----------

msg389108 - (view)

Author: Larry Trammell (ridgerat1611) *

Date: 2021-03-19 18:39

Issue 43483 was posted as a "bug" but retracted.  Though the problem is real, it is tricky to declare an UNSPECIFIED behavior to be a bug.  See that issue page for more discussion and a test case.  A brief overview is repeated here.

SCENARIO - XML PARSING LOSES DATA (or not)

The parsing attempts to capture text consisting of very tiny quoted strings. A typical content line reads something like this: 

   <p>Colchuck</p>

The parser implements a scheme presented at various tutorial Web sites, using two member functions. 

   # Note the name attribute of the current tag group
   def element_handler(self, tagname, attrs) :
       self.CurrentTag = tagname      

   # Record the content from each "p" tag when encountered
   def characters(self, content):
       if self.CurrentTag == "p":
           self.name = content

   ...

   > print(parser.name)
   "Colchuck" 

But then, after successfully extracting content from perhaps hundreds of thousands of XML tag sets in this way, the parsing suddenly "drops" a few characters of content. 

   > print(parser.name)
   "lchuck" 

While this problem was observed with a SAX parser, it can affect expat parsers as well.  It affects 32-bit and 64-bit implementations the same, over several major releases of the Python 3 system.  

SPECIFIED BEHAVIOR (or not) 

The "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (and many prior versions) states:

-----------
ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data.  SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks...
-----------

If it happens that the content is delivered in two chunks instead of one, the characters() method shown above overwrites the first part of the text with the second part, and some content seems lost.  This completely explains the observed behavior.  

EXPECTED BEHAVIOR (or not)

Even though the behavior is unspecified, users can have certain expectations about what a reasonable parser should do.  Among these:

  -- EFFICIENCY: the parser should do simple things simply, and complicated things as simply as possible
  -- CONSISTENCY: the parser behavior should be repeatable and dependable

The design can be considered "poor" if thorough testing cannot identify what the actual behaviors are going to be, because those behaviors are rare and unpredictable.

The obvious "simple thing," from the user perspective, is that the parser should return each tiny text string as one tiny text chunk.  In fact, this is precisely what it does... 99.999% of the time.  But then, suddenly, it doesn't.  

One hypothesis is that when the parsing scan of raw input text reaches the end of a large internal text buffer, it is easier from the implementer's perspective to flush any text remaining in the old buffer prior to fetching a new one, even if that produces a fragmented chunk with only a couple of characters.  

IMPROVEMENTS REQUIRED

Review the code to determine whether the text buffer scenario is in fact the primary cause of inconsistent behavior. Modify the data handling to defer delivery of content fragments that are small, carrying over a small amount of previously scanned text so that small contiguous text chunks are recombined rather than reported as multiple fragments. If the length of the content text to carry over is greater than some configurable xml.sax.handler.ContiguousChunkLength, the parser can go ahead and deliver it as a fragment.  

DOCUMENTING THE IMPROVEMENTS 

Strictly speaking:  none required.  Undefined behaviors are undefined, whether consistent or otherwise.  But after the improvements are implemented, it would be helpful to modify documentation to expose the new performance guarantees, making users more aware of the possible hazards.  For example, a new description in the "xml.sax.handler" page might read as follows: 

-----------
ContentHandler.characters(content) -- The Parser will call this method to report chunks of character data.  In general, character data may be reported as a single chunk or as sequence of chunks; but character data sequences with fewer than  xml.sax.handler.ContiguousChunkLength characters, when uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to be delivered as a single chunk...  
-----------

History
Date	User	Action	Args
2022-04-11 14:59:43	admin	set	github: 87726
2021-03-19 18:39:18	ridgerat1611	create