HTMLParser: parsing error #58743

MichelLeunen · 2012-04-09T20:01:56Z

BPO	14538
Nosy	@birkenfeld, @ezio-melotti, @merwok, @bitdancer, @JimJJewett, @serhiy-storchaka
Files	issue14538.diff: Patch against 2.7 issue14538-2.diff: Patch against 2.7

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ezio-melotti'
closed_at = <Date 2012-04-19.01:45:30.476>
created_at = <Date 2012-04-09.20:01:55.571>
labels = ['type-bug', 'library']
title = 'HTMLParser: parsing error'
updated_at = <Date 2012-04-19.11:55:29.621>
user = 'https://bugs.python.org/MichelLeunen'

bugs.python.org fields:

activity = <Date 2012-04-19.11:55:29.621>
actor = 'Michel.Leunen'
assignee = 'ezio.melotti'
closed = True
closed_date = <Date 2012-04-19.01:45:30.476>
closer = 'ezio.melotti'
components = ['Library (Lib)']
creation = <Date 2012-04-09.20:01:55.571>
creator = 'Michel.Leunen'
dependencies = []
files = ['25188', '25200']
hgrepos = []
issue_num = 14538
keywords = ['patch']
message_count = 18.0
messages = ['157890', '157899', '157906', '158109', '158111', '158130', '158132', '158140', '158143', '158168', '158185', '158195', '158198', '158215', '158220', '158695', '158696', '158718']
nosy_count = 8.0
nosy_names = ['georg.brandl', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'python-dev', 'Jim.Jewett', 'serhiy.storchaka', 'Michel.Leunen']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue14538'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

MichelLeunen · 2012-04-09T20:01:55Z

HTMLParser fails to parse this structure of tags:

'<a></a><script></script><meta><meta / ><body></body>'

Parsing stops after the first meta tag ignoring the remainers

from HTMLParser import HTMLParser
parser = process_html()
parser.feed('<a></a><script></script><meta><meta / ><body></body>')

Python 2.7.2+ Ubuntu 11.10

jimjjewett · 2012-04-09T20:50:56Z

What do you think it should do?

My thought is that meta tags may or may not be void, but certainly should not be nested. As XML, I would parse that as

<a></a>
<script></script>
<meta>
<meta / >
<body></body>
*missing closing tag

But for html, there is more cleanup. The catch is that this module probably can't keep up with BeautifulSoup or HTML5lib ... is this particular tag situation common enough to be even have a defined handling, let alone one that is worth adding to a "simple" module?

ezio-melotti · 2012-04-09T21:39:38Z

With Python 2.7.3rc2 and 3.3.0a0 (strict=False) I get:

Start tag: a
End tag : a
Start tag: script
End tag : script
Start tag: meta
Data : <meta / >
Start tag: body
End tag : body

This is better, but still not 100% correct, the "<meta / >" shouldn't be seen as data.

birkenfeld · 2012-04-12T07:22:31Z

ISTM that "<meta / >" is neither valid HTML nor valid XHTML.

ezio-melotti · 2012-04-12T07:48:32Z

Here's a patch.

jimjjewett · 2012-04-12T14:03:57Z

-1 on that particular patch.

(with only whitespace between "/" and ">") strikes me as obviously intending to close the tag, and a reasonably common error.

I can't think of any reason to support nested meta tags while not supporting sloppy self-closing tags.

jimjjewett · 2012-04-12T14:09:12Z

This issue is also marked for (bugfix-only) 2.7 and 3.2.

Unless there is a specification somewhere (or at least an editor's draft), I can't really see any particular parse as a bugfix. Was the goal just to make the parse finish, as opposed to stopping part way through the text?

serhiy-storchaka · 2012-04-12T14:58:19Z

To be consistent, this patch should remove the references to http://www.w3.org/TR/html5/tokenization.html#tag-open-state and http://www.w3.org/TR/html5/tokenization.html#tag-open-state as irrelevant.

bitdancer · 2012-04-12T15:26:45Z

Yes, after considerable discussion those of working on this stuff decided that the goal should be that the parser be able to complete parsing, without error, anything the typical browsers can parse (which means, pretty much anything, though that says nothing about whether the result of the parse is useful in any way). In other words, we've been treating it as a bug when the parser throws an error, since one generally uses the library to parse web pages from the internet and having the parse fail leaves you SOL for doing anything useful with the bad pages one gets therefrom. (Note that if the parser was doing strict adherence to the older RFCs our decision would have been different...but it is not. It has always accepted *some* badly formed documents, and rejected others.)

Also note that BeautifulSoup in Python2 used the sgml parser, which didn't throw errors, but that is gone in Python3. In Python3 BeautifulSoup uses the html parser...which is what started us down this road to begin with.

serhiy-storchaka · 2012-04-12T21:03:53Z

I apologize for misplaced sarcasm. After more careful reading of the source code, I found out that the patch really meets the specifications and the behavior of all tested me browsers. Despite its awkward appearance, the patch fixes a flaw of the original code for this corner case. But it allows the passage of an invalid code in strict mode. I think, this problem can be solved by a more direct way, perhaps even simplifying the original code.

ezio-melotti · 2012-04-12T23:26:37Z

Attached a new patch with a few more tests and a simplification of the attrfind regex.

But it allows the passage of an invalid code in strict mode.

HTMLParser is following the HTML5 specs, and doesn't do validation, so there's no strict mode (and the strict mode of Python 3 will be removed/deprecated soon).

I think, this problem can be solved by a more direct way,
perhaps even simplifying the original code.

This might be true, but I'm quite happy with this patch for now. Maybe one day I'll rewrite it.

Unless there is a specification somewhere (or at least an editor's
draft), I can't really see any particular parse as a bugfix.

If HTMLParser doesn't parse as the HTML5 specs say, then it's considered a bug.

jimjjewett · 2012-04-13T02:55:34Z

On Thu, Apr 12, 2012 at 7:26 PM, Ezio Melotti wrote:

If HTMLParser doesn't parse as the HTML5 specs say,
then it's considered a bug.

I had thought that was the goal of the html5lib, and that HTMLParser
was explicitly aiming at a much reduced model in order to remain
simple. (And also because the HTML5 spec is still changing fairly
often, particularly compared to the lifecycle of a feature release of
python.)

ezio-melotti · 2012-04-13T06:20:33Z

HTMLParser is still simpler than html5lib, but if/when possible we are following the HTML5 standard rather than taking arbitrary decisions (like we used to do before HTML5). HTMLParser doesn't claim to be a fully compliant HTML5 parser (and probably never will) -- its goal is rather being able to parse real world HTML in the same way browsers do. There are also a couple of corner cases that are not implemented in HTMLParser because it would make the code too complicated for a little gain.

jimjjewett · 2012-04-13T17:34:02Z

It sounds like this is a case where the docs should mention an external library; perhaps something like changing the intro of http://docs.python.org/dev/library/html.parser.html from:

"""
19.2. html.parser — Simple HTML and XHTML parser
Source code: Lib/html/parser.py

This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
"""

to:

"""
19.2. html.parser — Simple HTML and XHTML parser
Source code: Lib/html/parser.py

This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.

Note that mainstream web browsers also attempt to repair invalid markup; the algorithms for this can be quite complex, and are evolving too quickly for the Python release cycle. Applications handling arbitrary web pages should consider using 3rd-party modules. The python version of html5lib ( http://code.google.com/p/html5lib/ ) is being developed in parallel with the HTML standard itself, and serves as a reference implementation.
"""

merwok · 2012-04-13T17:50:31Z

Sure, the docs should explain better that html.parser tries its best to parse stuff, is not a validating parser, and is actively developed, contrary to the popular belief that standard library modules never get improved. I’m less sure about links, there is more than one popular library (html5lib, lxml, BeautifulSoup). Another bug needs to be opened, we’re off-topic for this one — but thanks for the remark and proposed patch.

python-dev · 2012-04-19T01:36:31Z

New changeset 36c901fcfcda by Ezio Melotti in branch '2.7':
bpo-14538: HTMLParser can now parse correctly start tags that contain a bare /.
http://hg.python.org/cpython/rev/36c901fcfcda

New changeset ba4baaddac8d by Ezio Melotti in branch '3.2':
bpo-14538: HTMLParser can now parse correctly start tags that contain a bare /.
http://hg.python.org/cpython/rev/ba4baaddac8d

New changeset 0f837071fd97 by Ezio Melotti in branch 'default':
bpo-14538: merge with 3.2.
http://hg.python.org/cpython/rev/0f837071fd97

ezio-melotti · 2012-04-19T01:45:30Z

This is now fixed, thanks for the report!

Regarding the documentation feel free to open another issue, but at some point I'll probably update it anyway and/or write down somewhere what the future plans for HTMLParser and its goals.

MichelLeunen · 2012-04-19T11:55:30Z

Thanks guys for your comments and for solving this issue. Great work!

MichelLeunen mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Apr 9, 2012

MichelLeunen mannequin changed the title ~~HTMLParser~~ HTMLParser: parsing error Apr 9, 2012

ezio-melotti self-assigned this Apr 9, 2012

ezio-melotti closed this as completed Apr 19, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLParser: parsing error #58743

HTMLParser: parsing error #58743

MichelLeunen mannequin commented Apr 9, 2012

MichelLeunen mannequin commented Apr 9, 2012

jimjjewett mannequin commented Apr 9, 2012

ezio-melotti commented Apr 9, 2012

birkenfeld commented Apr 12, 2012

ezio-melotti commented Apr 12, 2012

jimjjewett mannequin commented Apr 12, 2012

jimjjewett mannequin commented Apr 12, 2012

serhiy-storchaka commented Apr 12, 2012

bitdancer commented Apr 12, 2012

serhiy-storchaka commented Apr 12, 2012

ezio-melotti commented Apr 12, 2012

jimjjewett mannequin commented Apr 13, 2012

ezio-melotti commented Apr 13, 2012

jimjjewett mannequin commented Apr 13, 2012

merwok commented Apr 13, 2012

python-dev mannequin commented Apr 19, 2012

ezio-melotti commented Apr 19, 2012

MichelLeunen mannequin commented Apr 19, 2012

Navigation Menu

HTMLParser: parsing error #58743

HTMLParser: parsing error #58743

Comments

MichelLeunen mannequin commented Apr 9, 2012

MichelLeunen mannequin commented Apr 9, 2012

jimjjewett mannequin commented Apr 9, 2012

ezio-melotti commented Apr 9, 2012

birkenfeld commented Apr 12, 2012

ezio-melotti commented Apr 12, 2012

jimjjewett mannequin commented Apr 12, 2012

jimjjewett mannequin commented Apr 12, 2012

serhiy-storchaka commented Apr 12, 2012

bitdancer commented Apr 12, 2012

serhiy-storchaka commented Apr 12, 2012

ezio-melotti commented Apr 12, 2012

jimjjewett mannequin commented Apr 13, 2012

ezio-melotti commented Apr 13, 2012

jimjjewett mannequin commented Apr 13, 2012

merwok commented Apr 13, 2012

python-dev mannequin commented Apr 19, 2012

ezio-melotti commented Apr 19, 2012

MichelLeunen mannequin commented Apr 19, 2012