classification
Title: htmlparser unclosed script tag causes data loss
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, terry.reedy, waylan
Priority: normal Keywords: patch

Created on 2020-10-10 01:08 by waylan, last changed 2020-10-16 20:42 by terry.reedy.

Files
File name Uploaded Description Edit
test_html.py waylan, 2020-10-10 01:08 A simple test
Pull Requests
URL Status Linked Edit
PR 22658 open waylan, 2020-10-12 01:15
Messages (2)
msg378359 - (view) Author: Waylan Limberg (waylan) * Date: 2020-10-10 01:08
When the `close` method of the HtmlParser is called, any cached text data is generally flushed and passed to a `data` event; except when in `data_mode`. Specifically, if an unclosed `script` or `style` tag has been encountered, a call to `close` does not flush the data.

A simple test which demonstrates the issue is attached.

I see that in Lib/html/parser.py#L244-L249 there are two nested if statements which both check for `not self.cdata_elem`. Obviously, if we got past the first one, that situation will never exist for the nested one. Somehow this block of code needs a branch for when `self.cdata_elem` is True.

I should note that the input is invalid HTML. However, the existing behavior results in data loss. Within any other unclosed tag (other than `script` or `style`) any data is still flushed and passed to a `data` event. I would expect the same behavior here. Although, the data escaping behavior should perhaps be applied as it is with data within properly closed tags.
msg378748 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-10-16 20:42
Waylan, 3.7 and before only get security fixes.

To me, this might be considered an enhancement rather than bug fix, but I will leave that to Ezio.
History
Date User Action Args
2020-10-16 20:42:44terry.reedysetnosy: + terry.reedy

messages: + msg378748
versions: - Python 3.5, Python 3.6, Python 3.7, Python 3.8, Python 3.9
2020-10-12 07:27:04ezio.melottisetassignee: ezio.melotti

nosy: + ezio.melotti
2020-10-12 01:15:04waylansetkeywords: + patch
stage: patch review
pull_requests: + pull_request21635
2020-10-10 01:08:29waylancreate