xml.dom.pulldom splits text data at buffer size when parsing from file
Python 3.9, Python 3.8, Python 3.7
Noam Sturmwind (nsturmwind), Stefan Behnel (scoder)
Author: Noam Sturmwind (nsturmwind) Date: 2019-09-02 16:25
Python 3.7.4

When parsing a file using xml.dom.pulldom.parse(), if the parser is in the middle of text data when default_bufsize is reached it will split the text into multiple DOM Text nodes.

This breaks code expecting that reads the text data using
Author: Noam Sturmwind (nsturmwind) Date: 2019-09-02 16:27
Note that the parser handles it correctly if the buffer boundary lies in the middle of a tag name; only if it lies in the middle of text data does it result in this behavior.
Author: Noam Sturmwind (nsturmwind) Date: 2019-09-02 16:35
I believe this is working as intended, but is potentially surprising behavior. If so, perhaps a note could be added to the xml.dom documentation mentioning that this needs to be accounted for.

Per a correct way to read the text is

''.join(t.nodeValue for t in node.childNodes if t.nodeType == t.TEXT_NODE)
Author: Stefan Behnel (scoder) Date: 2019-09-02 17:39
I don't see anything inherently wrong with having multiple text nodes.

In fact, input with very large text content can be considered a security threat (c.f. compression bombs), so a tool like pulldom (which is designed for incremental processing) should not start collecting more content than the user asked for. Getting multiple text nodes in some cases seems an ok-ish price to pay.

A documentation PR is welcome.
