Message 388638 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ridgerat1611
Recipients	ridgerat1611
Date	2021-03-13.23:49:17
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1615679357.96.0.422940065812.issue43483@roundup.psfhosted.org>
In-reply-to

Content
Not a bug, strictly speaking... more like user abuse. The parsers (expat as well as SAX) must be able to return content text as a sequence of pieces when necessary. For example, as a text sequence interrupted by grouping or styling tags (like <span> or <i>). Or, extensive text blocks might need to be subdivided for efficient processing. Users would expect hazards like these and be wary. But how many users would suspect that a quoted string of length 8 characters would be returned in multiple pieces? Or that an entity notation would be split down the middle? Virtually all existing tutorial examples showing content extraction are WRONG -- because the ONLY content that can be trusted must be filtered through some kind of aggregator object. How many users will know this instinctively? It would be very useful for the parser systems to provide some kind of support for text aggregation function. A guarantee that "small contiguous" text items will not be chopped might also be helpful.

Not a bug, strictly speaking... more like user abuse.

The parsers (expat as well as SAX) must be able to return content text as a sequence of pieces when necessary. For example, as a text sequence interrupted by grouping or styling tags (like <span> or <i>).  Or, extensive text blocks might need to be subdivided for efficient processing.  Users would expect hazards like these and be wary.  But how many users would suspect that a quoted string of length 8 characters would be returned in multiple pieces?  Or that an entity notation would be split down the middle?  Virtually all existing tutorial examples showing content extraction are WRONG -- because the ONLY content that can be trusted must be filtered through some kind of aggregator object.  How many users will know this instinctively?  

It would be very useful for the parser systems to provide some kind of support for text aggregation function.  A guarantee that "small contiguous" text items will not be chopped might also be helpful.

History
Date	User	Action	Args
2021-03-13 23:49:17	ridgerat1611	set	recipients: + ridgerat1611
2021-03-13 23:49:17	ridgerat1611	set	messageid: <1615679357.96.0.422940065812.issue43483@roundup.psfhosted.org>
2021-03-13 23:49:17	ridgerat1611	link	issue43483 messages
2021-03-13 23:49:17	ridgerat1611	create