Message 97244 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	georg.brandl, terry.reedy
Date	2010-01-05.02:43:51
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1262659434.4.0.874382323645.issue7637@psf.upfronthosting.co.za>
In-reply-to

Content
1. "When you are finished with a DOM, you should clean it up. This is necessary because some versions of Python do not support garbage collection of objects that refer to each other in a cycle. Until this restriction is removed from all versions of Python, it is safest to write your code as if cycles would not be cleaned up." This appears to refer to early 2.x CPython versions without the gc module. Such (cryptic) back references are not appropriate for 3.x docs. Even in 3.x, immediate unlink might be a good idea, especially for CPython (which would then clean up immediately). But none of these issues are specific to DOM objects. Suggested replacement for the above and the current next sentence ("The way to clean up a DOM is to call its unlink() method:") "When you are finished with a DOM, you can call the unlink method to encourage early cleanup of unneeded objects:" Anything more is redundant with the doc for the method. ''' dom1.unlink() dom2.unlink() dom3.unlink() ''' One example at most is quite sufficient. 2. '''Node.toxml([encoding]) Return the XML that the DOM represents as a string. With no argument, the XML header does not specify an encoding, and the result is Unicode string if the default encoding cannot represent all characters in the document. Encoding this string in an encoding other than UTF-8 is likely incorrect, since UTF-8 is the default encoding of XML. With an explicit encoding [1] argument, the result is a byte string in the specified encoding. It is recommended that this argument is always specified. To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as “utf-8”. ''' I find this API a bit confusing. In 3.x, "Return ... a string." means str (unicode), but the rest implies that 'string' should be 'string or bytes'. "default encoding": what is it? ascii, utf-8 as almost implied, something in sys module (if so, please specify). A cleaner API would have been 1. always return str (unicode) or 2. always return bytes, with encoding='utf-i' default or 3. return str if no encoding given or bytes if one is given, with no default. 3. Revision of following antipattern example would be for 2.x also: ''' def getText(nodelist): rc = "" for node in nodelist: if node.nodeType == node.TEXT_NODE: rc = rc + node.data return rc ''' should be (not tested, but pretty straightforward) def getText(nodelist): rc = [] for node in nodelist: if node.nodeType == node.TEXT_NODE: rc.append(node.data) return ''.join(rc)

1. "When you are finished with a DOM, you should clean it up. This is necessary because some versions of Python do not support garbage collection of objects that refer to each other in a cycle. Until this restriction is removed from all versions of Python, it is safest to write your code as if cycles would not be cleaned up."

This appears to refer to early 2.x CPython versions without the gc module. Such (cryptic) back references are not appropriate for 3.x docs. Even in 3.x, immediate unlink might be a good idea, especially for CPython (which would then clean up immediately). But none of these issues are specific to DOM objects. Suggested replacement for the above and the current next sentence ("The way to clean up a DOM is to call its unlink() method:")

"When you are finished with a DOM, you can call the unlink method to encourage early cleanup of unneeded objects:"

Anything more is redundant with the doc for the method.
'''
dom1.unlink()
dom2.unlink()
dom3.unlink()
'''
One example at most is quite sufficient.

2. '''Node.toxml([encoding]) 
Return the XML that the DOM represents as a string.

With no argument, the XML header does not specify an encoding, and the result is Unicode string if the default encoding cannot represent all characters in the document. Encoding this string in an encoding other than UTF-8 is likely incorrect, since UTF-8 is the default encoding of XML.

With an explicit encoding [1] argument, the result is a byte string in the specified encoding. It is recommended that this argument is always specified. To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as “utf-8”.
'''
I find this API a bit confusing.

In 3.x, "Return ... a string." means str (unicode), but the rest implies that 'string' should be 'string or bytes'.

"default encoding": what is it? ascii, utf-8 as almost implied, something in sys module (if so, please specify).

A cleaner API would have been 1. always return str (unicode) or 2. always return bytes, with encoding='utf-i' default or 3. return str if no encoding given or bytes if one is given, with no default.

3. Revision of following antipattern example would be for 2.x also:
'''
def getText(nodelist):
    rc = ""
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
    return rc
'''
should be (not tested, but pretty straightforward)

def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

History
Date	User	Action	Args
2010-01-05 02:43:54	terry.reedy	set	recipients: + terry.reedy, georg.brandl
2010-01-05 02:43:54	terry.reedy	set	messageid: <1262659434.4.0.874382323645.issue7637@psf.upfronthosting.co.za>
2010-01-05 02:43:52	terry.reedy	link	issue7637 messages
2010-01-05 02:43:51	terry.reedy	create