Issue 38011: xml.dom.pulldom splits text data at buffer size when parsing from file

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/82192

classification

Title:	xml.dom.pulldom splits text data at buffer size when parsing from file
Type:	behavior	Stage:
Components:	Documentation, Library (Lib)	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	docs@python, nsturmwind, scoder
Priority:	normal	Keywords:

Created on 2019-09-02 16:25 by nsturmwind, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
test.py	nsturmwind, 2019-09-02 16:25	Reproduction of issue

Messages (4)
msg351021 - (view)	Author: Noam Sturmwind (nsturmwind)	Date: 2019-09-02 16:25
Python 3.7.4 When parsing a file using xml.dom.pulldom.parse(), if the parser is in the middle of text data when default_bufsize is reached it will split the text into multiple DOM Text nodes. This breaks code expecting that reads the text data using node.firstChild.data.
msg351023 - (view)	Author: Noam Sturmwind (nsturmwind)	Date: 2019-09-02 16:27
Note that the parser handles it correctly if the buffer boundary lies in the middle of a tag name; only if it lies in the middle of text data does it result in this behavior.
msg351024 - (view)	Author: Noam Sturmwind (nsturmwind)	Date: 2019-09-02 16:35
I believe this is working as intended, but is potentially surprising behavior. If so, perhaps a note could be added to the xml.dom documentation mentioning that this needs to be accounted for. Per https://stackoverflow.com/a/317494 a correct way to read the text is ''.join(t.nodeValue for t in node.childNodes if t.nodeType == t.TEXT_NODE)
msg351026 - (view)	Author: Stefan Behnel (scoder) *	Date: 2019-09-02 17:39
I don't see anything inherently wrong with having multiple text nodes. In fact, input with very large text content can be considered a security threat (c.f. compression bombs), so a tool like pulldom (which is designed for incremental processing) should not start collecting more content than the user asked for. Getting multiple text nodes in some cases seems an ok-ish price to pay. A documentation PR is welcome.

History
Date	User	Action	Args
2022-04-11 14:59:19	admin	set	github: 82192
2019-09-02 17:39:20	scoder	set	versions: + Python 3.8, Python 3.9 nosy: + docs@python messages: + msg351026 assignee: docs@python components: + Documentation
2019-09-02 16:43:57	xtreak	set	nosy: + scoder
2019-09-02 16:35:31	nsturmwind	set	messages: + msg351024
2019-09-02 16:27:21	nsturmwind	set	messages: + msg351023
2019-09-02 16:25:02	nsturmwind	create