Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser attribute parsing - 2 test cases when it fails #50441

Closed
momat mannequin opened this issue Jun 4, 2009 · 12 comments
Closed

HTMLParser attribute parsing - 2 test cases when it fails #50441

momat mannequin opened this issue Jun 4, 2009 · 12 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@momat
Copy link
Mannequin

momat mannequin commented Jun 4, 2009

BPO 6191
Nosy @birkenfeld, @ezio-melotti, @bitdancer

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2011-04-14.06:07:44.403>
created_at = <Date 2009-06-04.07:46:47.751>
labels = ['type-bug', 'library']
title = 'HTMLParser attribute parsing - 2 test cases when it fails'
updated_at = <Date 2011-05-14.06:36:58.641>
user = 'https://bugs.python.org/momat'

bugs.python.org fields:

activity = <Date 2011-05-14.06:36:58.641>
actor = 'ezio.melotti'
assignee = 'none'
closed = True
closed_date = <Date 2011-04-14.06:07:44.403>
closer = 'ezio.melotti'
components = ['Library (Lib)']
creation = <Date 2009-06-04.07:46:47.751>
creator = 'momat'
dependencies = []
files = []
hgrepos = []
issue_num = 6191
keywords = []
message_count = 12.0
messages = ['88867', '88899', '88903', '88906', '88910', '88913', '89018', '133715', '133731', '133732', '134229', '135959']
nosy_count = 4.0
nosy_names = ['georg.brandl', 'ezio.melotti', 'r.david.murray', 'momat']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue6191'
versions = ['Python 3.2', 'Python 3.3']

@momat
Copy link
Mannequin Author

momat mannequin commented Jun 4, 2009

Of course both are not correct HTML but are easy to guess, so I believe
the parser should not give up too quick here.

  1. extra comma between attributes
    <form action="/xxx.php?a=1&b=2&amp", method="post">

  2. missing closing quotation mark for the first attribute
    <a href="http://xxx.org/xxx.php?a=1 target="_blank">click me</a>

@momat momat mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jun 4, 2009
@birkenfeld
Copy link
Member

I do not think HTMLParser should guess. Guessing always opens the door
to misinterpretation.

@momat
Copy link
Mannequin Author

momat mannequin commented Jun 4, 2009

It depends whether you want a HTMLParser to be an useful tool that can
deal with real world HTML or just a toy without practical meaning.
Crashing on every little deviation from the standard, where more relaxed
approach is possible, doesn't sound to me as a reasonable choice.

Maybe guess is not a proper word... If the standard strict approach
fails, the parser should fall back to a less strict one in an attempt to
actually parse the document. Throwing an exception and giving up is just
not good enough.

Can we have somebody else commenting on this one please?

@momat momat mannequin reopened this Jun 4, 2009
@birkenfeld
Copy link
Member

Throwing an exception and giving up is just not good enough.

Yes it is, in some cases. There are "forgiving" HTML parsers out there,
HTMLParser does not strive to be one.

There are *so many* cases where HTML is a bit malformed that it takes
more than just two exceptions to get it right. It's for a reason that
browsers' parsers are so complex. If you add these corner cases, people
will come asking for this exception, and that one, etc.

@bitdancer
Copy link
Member

In doing web scraping I started using BeautifulSoup precisely because it
was very lenient in what html it accepted (I haven't written such an ap
for a while, so I'm not sure what BeautifulSoup currently does...I
thought I heard it was now using HTMLParser...).

There are a lot of messed up web pages out there.

I don't have time right now to evaluate your particular cases, but my
rule of thumb would be that if the major web browsers do something
"reasonable" with these cases, then a python tool designed to read web
pages should do so as well, where possible. ("Be liberal in what you
accept, and strict in what you generate.")

That said, I'm not sure what HTMLParser's design goals are, so this may
not be an appropriate goal for the module.

@birkenfeld
Copy link
Member

So BeautifulSoup is using HTMLParser? That is interesting, because they
claim to support "broken" HTML.

In any case, if a "quirky" mode is added, it should have to be turned on
explicitly by a flag.

@ezio-melotti
Copy link
Member

BeautifulSoup use SGMLParser for all the versions <3.1. BeautifulSoup
3.1 is supposed to be compatible with Python 3 and since SGMLParser is
gone it's now using HTMLParser, but it's not able to handle some things
anymore.

For more information:
http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

(FWIW I tried BeautifulSoup 3.1 but it failed where BeautifulSoup 3.0.7
was working so I came back to 3.0.7)

@ezio-melotti
Copy link
Member

The first case has been fixed already in 1cbfeffea19f, the second case is not even handled by browsers, so I'm closing this.

@momat
Copy link
Mannequin Author

momat mannequin commented Apr 14, 2011

Great! With one "but"... the second case *is* handled by browsers. Browsers do not throw an exception on it as HTMLParser do. So improvement is definitely possible here. If it is worth an effort, it is not for me to judge.

@ezio-melotti
Copy link
Member

So you are suggesting that
<a href="http://xxx.org/xxx.php?a=1 target="_blank">click me</a>
should result in an 'a' element with an href attribute equals to "http://xxx.org/xxx.php?a=1 target=" and then discard _blank" as extra data?

@momat
Copy link
Mannequin Author

momat mannequin commented Apr 21, 2011

No. As the value of the href attribute is not suppose to contain spaces, I'd rather expect the parser to assume that there is an ending " missing before the space.

@ezio-melotti
Copy link
Member

What I described in my previous message is what Firefox does. If you think this should be changed, I suggest you to open another issue, possibly attaching a test case with the desired behavior and a patch to change it.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants