Issue 43561: Modify XML parsing library descriptions to forewarn of content loss hazard

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87727

classification

Title:	Modify XML parsing library descriptions to forewarn of content loss hazard
Type:		Stage:
Components:	Documentation	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	docs@python, ridgerat1611
Priority:	normal	Keywords:

Created on 2021-03-19 18:52 by ridgerat1611, last changed 2022-04-11 14:59 by admin.

Messages (1)
msg389111 - (view)	Author: Larry Trammell (ridgerat1611) *	Date: 2021-03-19 18:52
With reference to improvement issue 43560 : If those improvements remain unimplemented, or are demoted to "don't fix", users are left in the tricky situation where XML parsing applications can fail, apparently "losing content" in a rare and unpredictable manner. It would be useful to patch the documentation to give users fair warning of this hazard. For example: the "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (and many prior versions) currently states: ----------- ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks... ----------- The modified documentation would read something like the following: ----------- ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks... To avoid a situation in which one small content fragment unexpectedly overwrites another one, it is essential for the characters() method to collect content by appending, rather than by assignment. ----------- To give a concrete example, suppose that a Python programming site recommends the following coding to preserve a small text chunk bracketed by "<p>" tags: # Note the name attribute of the current tag group def element_handler(self, tagname, attrs) : self.CurrentTag = tagname # Record the content from each "p" tag when encountered def characters(self, content): if self.CurrentTag == "p" : self.name = content Even though that coding could be expected to work most of the time, it is exposed to the hazard that an unanticipated sequence of calls to the characters() function would overwrite data. Instead, the coding should look something like this. # Note the name attribute of the current tag group def element_handler(self, tagname, attrs) : self.CurrentTag = tagname self.name = "" # Accumulate the content from each "p" tag when encountered def characters(self, content): if self.CurrentTag == "p": self.name.append(content)

msg389111 - (view)

Author: Larry Trammell (ridgerat1611) *

Date: 2021-03-19 18:52

With reference to improvement issue 43560 :

If those improvements remain unimplemented, or are demoted to "don't fix", users are left in the tricky situation where XML parsing applications can fail, apparently "losing content" in a rare and unpredictable manner.  It would be useful to patch the documentation to give users fair warning of this hazard. 

For example: the "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (and many prior versions) currently states:

-----------
ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data.  SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks...
-----------
 
The modified documentation would read something like the following:

-----------
ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data.  SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks... To avoid a situation in which one small content fragment unexpectedly overwrites another one, it is essential for the characters() method to collect content by appending, rather than by assignment.
-----------

To give a concrete example, suppose that a Python programming site recommends the following coding to preserve a small text chunk bracketed by "<p>" tags: 

   # Note the name attribute of the current tag group
   def element_handler(self, tagname, attrs) :
       self.CurrentTag = tagname      

   # Record the content from each "p" tag when encountered
   def characters(self, content):
       if self.CurrentTag == "p" :
           self.name = content

Even though that coding could be expected to work most of the time, it is exposed to the hazard that an unanticipated sequence of calls to the characters() function would overwrite data.

Instead, the coding should look something like this.

   # Note the name attribute of the current tag group
   def element_handler(self, tagname, attrs) :
       self.CurrentTag = tagname 
       self.name = ""     

   # Accumulate the content from each "p" tag when encountered
   def characters(self, content):
       if self.CurrentTag == "p":
           self.name.append(content)

History
Date	User	Action	Args
2022-04-11 14:59:43	admin	set	github: 87727
2021-03-19 18:52:35	ridgerat1611	create