This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ridgerat1611
Recipients ridgerat1611
Date 2021-03-13.23:49:17
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1615679357.96.0.422940065812.issue43483@roundup.psfhosted.org>
In-reply-to
Content
Not a bug, strictly speaking... more like user abuse.

The parsers (expat as well as SAX) must be able to return content text as a sequence of pieces when necessary. For example, as a text sequence interrupted by grouping or styling tags (like <span> or <i>).  Or, extensive text blocks might need to be subdivided for efficient processing.  Users would expect hazards like these and be wary.  But how many users would suspect that a quoted string of length 8 characters would be returned in multiple pieces?  Or that an entity notation would be split down the middle?  Virtually all existing tutorial examples showing content extraction are WRONG -- because the ONLY content that can be trusted must be filtered through some kind of aggregator object.  How many users will know this instinctively?  

It would be very useful for the parser systems to provide some kind of support for text aggregation function.  A guarantee that "small contiguous" text items will not be chopped might also be helpful.
History
Date User Action Args
2021-03-13 23:49:17ridgerat1611setrecipients: + ridgerat1611
2021-03-13 23:49:17ridgerat1611setmessageid: <1615679357.96.0.422940065812.issue43483@roundup.psfhosted.org>
2021-03-13 23:49:17ridgerat1611linkissue43483 messages
2021-03-13 23:49:17ridgerat1611create