This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ridgerat1611
Recipients ridgerat1611
Date 2021-03-13.01:35:51
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1615599352.35.0.560655392831.issue43483@roundup.psfhosted.org>
In-reply-to
Content
== The Problem ==

I have observed a "loss of data" problem using the Python SAX parser, when processing an oversize but very simple machine-generated xhtml file. The file represents a single N x 11 data table.  W3C "tidy" reports no xml errors.  The table is constructed in an entirely plausible manner, using table, tr, and td tags to define the table structure, and p tags to bracket content, which consists of small chunks of quoted text.  There is nothing pathological, no extraneous whitespace characters, no empty data fields. 

Everything works perfectly in small test cases.  But when a very large number of rows are present, a few characters of content strings are occasionally lost. I have observed 2 or 6 characters dropped.  But here's the strange part.  The pathological behavior disappears (or moves to another location) when one or more non-significant whitespace characters are inserted at an arbitrary location early in the file... e.g. an extra linefeed before the first tr tag. 

== Context ==

I have observed identical behavior on desktop systems using an Intel Xeon E5-1607 or a Core-2 processor, running 32-bit or 64-bit Linux operating systems, variously using Python 3.8.5, 3.8, 3.7.3, and 3.5.1.

== Observing the Problem == 

Sorry that the test data is so bulky (even at 0.5% of original size), but bulk appears to be a necessary condition to observe the problem. Run the following command line.  

python3  EnchXMLTest.py  EnchTestData.html 

The test script invokes the SAX parser and generates messages on stdout. Using the original test data as provided, the test should run correctly to completion.  Now modify the test data file, deleting the extraneous comment line (there is only one) found near the top of the file.  Repeat the test run, and this time look for missing content characters in parsed content fields of the last record.  
 
== Any guesses? ==

Beyond "user is oblivious," possibly something abnormal can occur at seams between large blocks of buffered text.  The presence or absence of an extra character early in the data stream results in a corresponding shift in content location at the end of the buffer.  Other clues: is it relevant that the problem appears in a string field that contains slash characters?
History
Date User Action Args
2021-03-13 01:35:52ridgerat1611setrecipients: + ridgerat1611
2021-03-13 01:35:52ridgerat1611setmessageid: <1615599352.35.0.560655392831.issue43483@roundup.psfhosted.org>
2021-03-13 01:35:52ridgerat1611linkissue43483 messages
2021-03-13 01:35:51ridgerat1611create