New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTMLParser attribute parsing - 2 test cases when it fails #50441
Comments
Of course both are not correct HTML but are easy to guess, so I believe
|
I do not think HTMLParser should guess. Guessing always opens the door |
It depends whether you want a HTMLParser to be an useful tool that can Maybe guess is not a proper word... If the standard strict approach Can we have somebody else commenting on this one please? |
Yes it is, in some cases. There are "forgiving" HTML parsers out there, There are *so many* cases where HTML is a bit malformed that it takes |
In doing web scraping I started using BeautifulSoup precisely because it There are a lot of messed up web pages out there. I don't have time right now to evaluate your particular cases, but my That said, I'm not sure what HTMLParser's design goals are, so this may |
So BeautifulSoup is using HTMLParser? That is interesting, because they In any case, if a "quirky" mode is added, it should have to be turned on |
BeautifulSoup use SGMLParser for all the versions <3.1. BeautifulSoup For more information: (FWIW I tried BeautifulSoup 3.1 but it failed where BeautifulSoup 3.0.7 |
The first case has been fixed already in 1cbfeffea19f, the second case is not even handled by browsers, so I'm closing this. |
Great! With one "but"... the second case *is* handled by browsers. Browsers do not throw an exception on it as HTMLParser do. So improvement is definitely possible here. If it is worth an effort, it is not for me to judge. |
So you are suggesting that |
No. As the value of the href attribute is not suppose to contain spaces, I'd rather expect the parser to assume that there is an ending " missing before the space. |
What I described in my previous message is what Firefox does. If you think this should be changed, I suggest you to open another issue, possibly attaching a test case with the desired behavior and a patch to change it. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: