This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Loss of content in simple (but oversize) SAX parsing
Type: behavior Stage: resolved
Components: XML Versions: Python 3.8, Python 3.7
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, ridgerat1611
Priority: normal Keywords:

Created on 2021-03-13 01:35 by ridgerat1611, last changed 2022-04-11 14:59 by admin. This issue is now closed.

File name Uploaded Description Edit ridgerat1611, 2021-03-13 01:35 Test script and (ugh) bulky test dataset for SAX data loss
Messages (17)
msg388582 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-13 01:35
== The Problem ==

I have observed a "loss of data" problem using the Python SAX parser, when processing an oversize but very simple machine-generated xhtml file. The file represents a single N x 11 data table.  W3C "tidy" reports no xml errors.  The table is constructed in an entirely plausible manner, using table, tr, and td tags to define the table structure, and p tags to bracket content, which consists of small chunks of quoted text.  There is nothing pathological, no extraneous whitespace characters, no empty data fields. 

Everything works perfectly in small test cases.  But when a very large number of rows are present, a few characters of content strings are occasionally lost. I have observed 2 or 6 characters dropped.  But here's the strange part.  The pathological behavior disappears (or moves to another location) when one or more non-significant whitespace characters are inserted at an arbitrary location early in the file... e.g. an extra linefeed before the first tr tag. 

== Context ==

I have observed identical behavior on desktop systems using an Intel Xeon E5-1607 or a Core-2 processor, running 32-bit or 64-bit Linux operating systems, variously using Python 3.8.5, 3.8, 3.7.3, and 3.5.1.

== Observing the Problem == 

Sorry that the test data is so bulky (even at 0.5% of original size), but bulk appears to be a necessary condition to observe the problem. Run the following command line.  

python3  EnchTestData.html 

The test script invokes the SAX parser and generates messages on stdout. Using the original test data as provided, the test should run correctly to completion.  Now modify the test data file, deleting the extraneous comment line (there is only one) found near the top of the file.  Repeat the test run, and this time look for missing content characters in parsed content fields of the last record.  
== Any guesses? ==

Beyond "user is oblivious," possibly something abnormal can occur at seams between large blocks of buffered text.  The presence or absence of an extra character early in the data stream results in a corresponding shift in content location at the end of the buffer.  Other clues: is it relevant that the problem appears in a string field that contains slash characters?
msg388638 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-13 23:49
Not a bug, strictly speaking... more like user abuse.

The parsers (expat as well as SAX) must be able to return content text as a sequence of pieces when necessary. For example, as a text sequence interrupted by grouping or styling tags (like <span> or <i>).  Or, extensive text blocks might need to be subdivided for efficient processing.  Users would expect hazards like these and be wary.  But how many users would suspect that a quoted string of length 8 characters would be returned in multiple pieces?  Or that an entity notation would be split down the middle?  Virtually all existing tutorial examples showing content extraction are WRONG -- because the ONLY content that can be trusted must be filtered through some kind of aggregator object.  How many users will know this instinctively?  

It would be very useful for the parser systems to provide some kind of support for text aggregation function.  A guarantee that "small contiguous" text items will not be chopped might also be helpful.
msg388713 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2021-03-15 08:52
Perhaps you could open a documentation bug? I think specific examples of where the documentation is wrong, and how it could be improved, would be helpful.

msg388801 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-16 02:42
I can't find any real errors in documentation.  There are subtle design and implementation decisions that result in unexpected rare side effects.  After processing hundreds of thousands of lines one way, why would the parser suddenly decide to process the next line differently?  Well, because it can, and it happens to be convenient.  And that can catch users off-guard.

I'm considering whether posting an "enhancement" issue would be more appropriate... maybe there is a way to make the parser systems work more nearly the way people currently expect, without breaking things.
msg388818 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2021-03-16 05:10
I think we could document where a "quoted string of length 8 characters would be returned in multiple pieces" occurs. Which API is that?

If we change that, and if we call it an enhancement instead of a bug fix, then it can't be backported. It would be worth doing both: document the behavior for old releases, and change the behavior for new releases.
msg388930 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-17 15:58
Assuming that my understanding is completely correct, the situation is that the xml parser has an unspecified behavior.  This is true in any text content handler, at any time, and applies to the expat parser as well as SAX. In some rare cases, the behavior of the current implementation (and also many past ones) sometimes seems inconsistent and can catch users by surprise -- even some who are relatively knowledgable (which does not include me). 

This is a little abstract, but two things could be done to improve this:

1. Modify the implementation so that the behavior remains unspecified but falls more in line with plausible expectations of the users.  This makes things a little more complicated for the implementer, but does not invalidate the documentation of present or past versions. 

2. The documentation could be updated to expose the new constraints on the previously unspecified behavior, giving users a better chance to recognize and prepare for any remaining difficulties.  However, the implementation changes could be made even without these documentation changes.

So I remain confused about whether this is really a "bug" -- it is an "easy but unfortunate implementation choice" that is technically not wrong, even if sometimes baffling.  Established applications that already use older parser versions are relatively unlikely to start failing given the kind of documents they process, so backport changes might be helpful but do not seem urgent. 

Eric, with this clarification, what is your opinion about how to properly post a new issue -- improvement or bug fix?  I can provide a more detailed technical explanation where a new issue is posted.
msg388932 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2021-03-17 16:12
Could you give an example (using a list of callbacks and values or something) that shows how it's behaving that you think is problematic? That's the part I'm not understanding. This doesn't have to be a real example, just show what the user is getting that's not obvious to the normal user.
msg388938 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-17 16:56
Sure...  I'll cut and paste some of the text I was organizing to go into a possible new issue page.

The only relevant documentation I could find was in the "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (as it has been through many versions):

ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data.  SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks...

As an example, here is a typical snippet taken from Web page 

The application example records the tag name "type" in the "CurrentData" member, and shortly thereafter, the "type" tag's content is received:

   # Call when a character is read
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content

Suppose that the parser receives the following text line from the input file.  


Though there seems no reason for it, the parser could decide to deliver the content text as "Sc" followed by "iFi".  In that case, a second invocation of the "characters" method would overwrite the characters received in the first invocation, and some of the content text seems "lost."  

Given how rarely it happens, I suspect that when internal processing reaches the end of a block of buffered text from the input file, the easiest thing to do is to report any fragments of text that happen to remain at the end, no matter how tiny, and start fresh with the next internal buffer. Easy for the implementer, but baffling to the application developer.  And rare enough to elude application testing.
msg388939 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2021-03-17 17:04
Thanks, that's very helpful. Does this only affect content text?

This should definitely be documented.

As far as changing it, I think the best thing to do is say that if the context text is less than some size (I don't know, maybe 1MB?) that it's guaranteed to be in one callback, but if it's larger than that it might be in multiple chunks. I think you could open this as a feature request. I have no idea how difficult or expensive it would be to implement this.
msg388940 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-17 17:14
Great minds think alike I guess... 

I was thinking of a much smaller carryover size... maybe 1K. With individual text blocks longer than that, the user will almost certainly be dealing with collecting and aggregating content text anyway, and in that case, the problem is solved before it happens. 

Here is a documentation change I was experimenting with...

ContentHandler.characters(content) -- The Parser will call this method to report chunks of character data.  In general, character data may be reported as a single chunk or as sequence of chunks; but character data sequences with fewer than  xml.sax.handler.ContiguousChunkLength characters, when uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to be delivered as a single chunk...  

That puts users on notice, "...wait, are my chunks of text smaller than that?" and they are less likely to be caught unaware.  But of course, the implementation change would be helpful even without this extra warning.
msg388942 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2021-03-17 17:18
I think that's good text, once the enhancement is made. But for existing versions of python, shouldn't we just document that the text might come back in chunks?

I don't have a feel for what the limit should be.
msg388946 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-17 17:20
Oh, and whether this affects only content text...

I would presume so, but I don't know how to tell for sure.  Unspecified behaviors can be very mysterious!
msg388952 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-17 18:19
I think the existing ContentHandler.characters(content) documentation DOES say that the text can come back in chunks... but it is subtle.  It might be possible to say more explicitly that any content no matter how small is allowed to be returned as any number of chunks at any time... Though true, that is harsh, overstating considerably what actually happens.   Concentrating on a better implementation would be more effective than worrying about existing documentation, given how long the existing conditions have prevailed. My opinion, as one who has been bitten.
msg388959 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-17 18:56
If there were a decision NOT TO FIX... maybe then it would make sense to consider documentation patches at a higher priority.  That way, SAX-Python (and expat-Python) tutorials across the Web could start patching their presentations accordingly.
msg389029 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-18 16:33
Eric, now that you know as much as I do about the nature and scope of the peculiar parsing behavior, do you have any suggestions about how to proceed from here?
msg389030 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2021-03-18 16:42
I'd add a note to the docs about it, then open a feature request to change the behavior. You could turn this issue into a documentation fix.

Unfortunately I don't know if there's a core dev who pays attention to the XML parsers. But I can probably find out.
msg389112 - (view) Author: Larry Trammell (ridgerat1611) * Date: 2021-03-19 18:56
Check out issues 

43560 (an enhancement issue to improve handling of small XML content chunks)

43561 (a documentation issue to give users warning about the hazard in the interim before the changes are implemented)
Date User Action Args
2022-04-11 14:59:42adminsetgithub: 87649
2021-03-19 18:56:06ridgerat1611setmessages: + msg389112
2021-03-18 16:42:51eric.smithsetmessages: + msg389030
2021-03-18 16:33:56ridgerat1611setmessages: + msg389029
2021-03-17 18:56:17ridgerat1611setmessages: + msg388959
2021-03-17 18:19:37ridgerat1611setmessages: + msg388952
2021-03-17 17:20:17ridgerat1611setmessages: + msg388946
2021-03-17 17:18:00eric.smithsetmessages: + msg388942
2021-03-17 17:14:44ridgerat1611setmessages: + msg388940
2021-03-17 17:04:22eric.smithsetmessages: + msg388939
2021-03-17 16:56:50ridgerat1611setmessages: + msg388938
2021-03-17 16:12:47eric.smithsetmessages: + msg388932
2021-03-17 15:58:32ridgerat1611setmessages: + msg388930
2021-03-16 05:10:03eric.smithsetmessages: + msg388818
2021-03-16 02:42:02ridgerat1611setmessages: + msg388801
2021-03-15 08:52:53eric.smithsetnosy: + eric.smith
messages: + msg388713
2021-03-13 23:49:17ridgerat1611setstatus: open -> closed
resolution: not a bug
messages: + msg388638

stage: resolved
2021-03-13 01:35:52ridgerat1611create