Issue 500073: HTMLParser fail to handle '&foobar'

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/35871

classification

Title:	HTMLParser fail to handle '&foobar'
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.3

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	fdrake	Nosy List:	berniey, fdrake, gvanrossum, loewis, nobody, skip.montanaro
Priority:	high	Keywords:

Created on 2002-01-06 08:06 by berniey, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
sgmllib.py	berniey, 2002-01-08 09:00	Suggested Changes
test.html	berniey, 2002-01-09 00:44	Testing HTML
test.py	berniey, 2002-01-09 00:45	Testing script
htmllib.py	berniey, 2002-01-09 06:35	Suggested Changes
test.html	berniey, 2002-01-09 06:37	New test html

Messages (13)
msg8607 - (view)	Author: Bernard YUE (berniey)	Date: 2002-01-06 08:06
HTMLParser did not distingish between &foobar; and &foobar. The later is still considered as a charref/entityref. Below is my posposed fix: File: sgmllib.py # SGMLParser.goahead() # line 162-176 # from elif rawdata[i] == '&': match = charref.match(rawdata, i) if match: name = match.group(1) self.handle_charref(name) i = match.end(0) if rawdata[i-1] != ';': i = i-1 continue match = entityref.match(rawdata, i) if match: name = match.group(1) self.handle_entityref(name) i = match.end(0) if rawdata[i-1] != ';': i = i-1 continue # to elif rawdata[i] == '&' match = charref.match(rawdata, i) if match: if rawdata[match.end(0)-1] != ';': # not really an charref self.handle_data(rawdata[i]) i = i+1 else: name = match.group(1) self.handle_charref(name) i = match.end(0) continue match = entityref.match(rawdata, i) if match: if rawdata[match.end(0)-1] != ';': # not really an entitiyref self.handle_data(rawdata[i]) i = i+1 else: name = match.group(1) self.handle_entityref(name) i = match.end(0) continue
msg8608 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2002-01-08 21:03
Logged In: YES user_id=44345 Bernie, I see nothing wrong in principal with recognizing "&nbsp" when the user should have typed " ", but I wonder about the validity of "&nbsp". You mentioned it's still a charref or entityref. Is that documented somewhere or is it simply a practical approach to a common problem? Thanks, Skip
msg8609 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-01-08 22:02
Logged In: YES user_id=21627 I fail to see the problem as well. Please attach an example document to this report. Without a detailed analysis of the problem in question, there is zero chance that any change like this is accepted. Here is my analysis from your report: It seems that you complain that sgmllib, when it sees an ill-formed document, behaves in a particular way, whereas you expect to behave it in a different way. Since the document is ill-formed anyways, any behaviour is as good as any other.
msg8610 - (view)	Author: Bernard YUE (berniey)	Date: 2002-01-09 00:43
Logged In: YES user_id=419276 Hi Martin and Skip, Sorry for not explain myself clearly. What I mean is that &foobar should have been treated as '&foobar' literally (i.e. text), and &forbat; should be an entityref and &#forbar; as charref. Currently, sgmllib treated &foobar as entityref and &#foobar as charref and match it against entityref table and charref table. Ignores the entity when a match is not found. My suggested change should fix this problem. Run test.py (test.py and test.html attached) >./test.py Me! Me & You! Copyright@copy;abc Copyright©abc © © But we are expecting: Me&you! Me & You! Copyright@copy;abc Copyright©abc © © My suggested change will print the expected output. # test.html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3c.org/TR/html4/strict.dtd"> <html> <head dir="ltr" lang="en"> <TITLE>Testing Page</TITLE> <META name="AUTHOR" content="Bernard Yue"> <META name="DESCRIPTION" content="Testing Page"> </head> <body> <p>Me&you! Me & You! Copyright@copy;abc Copyright©abc © © </p> </body> </html> # test.py #!/usr/bin/env python from htmllib import HTMLParser from formatter import AbstractFormatter, DumbWriter def test(): _formatter = AbstractFormatter( DumbWriter()) _parser = HTMLParser( _formatter) _f = open( './test.html') _parser.feed( _f.read()) _f.close() _parser.close() print '' if __name__ == '__main__': test()
msg8611 - (view)	Author: Bernard YUE (berniey)	Date: 2002-01-09 01:04
Logged In: YES user_id=419276 Hi again, I just run the test.html with w3c's HTML validator. &you does indeed treated as an invalid entityref in HTML 4.01. I've displays test.html under IE, Netscape and Konqueror and it all gave the result I've expected. I am not sure if sgmllib.py should stick with the standard or go with the general defacto interpretation. But I think it is more sensable to treat &you as text. Bernie
msg8612 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2002-01-09 04:33
Logged In: YES user_id=44345 Bernie, I tried your patch. It looks good to me. I was a tad confused when I first read your bug report. I thought you were suggesting that "&foo" be interpreted as a charref/entityref. Instead you are tightening up the parser. That seems reasonable to me. Martin, what do you think? Skip
msg8613 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-01-09 04:42
Logged In: YES user_id=6380 I'm reassigning this to Fred. In 2.2, the new HTMLParser may or may not still have this problem. In 2.1.2, I think that "fixing" it would be too big a risk of breaking existing code, so I think it should not be fixed.
msg8614 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-01-09 05:30
Logged In: YES user_id=21627 I still recommend to reject this patch, it is plain wrong. Do we all agree that an HTML Document containing &you is ill-formed (all HTML versions)? If so, it is a matter of best-effort what to do with it. In SGML, it is well-formed to omit the semicolon from the entity name in a entity reference in certain cases, see http://bip.cnrs-mrs.fr/bip10/scowl.htm#semi Therefore, omission of the semicolon does not mean that you don't have an entity reference, and sgmllib's processing of entity references is completely correct - it would be an error to treat &you as data. Therefore, your document is correct SGML. It just fails to be correct HTML, since the entity 'you' is not defined. If you want to process such a document in a specific way, I recommend to subclass HTMLParser, overriding unknown_entityref.
msg8615 - (view)	Author: Bernard YUE (berniey)	Date: 2002-01-09 06:35
Logged In: YES user_id=419276 Hi Guys, I felt embarrass as I confuss everybody here. Martin is nearly 100% right. Except that all &foo, &foo;, &#bar, &#bar; are all valid entity in HTML 4.01 as well if it was defined (I did not put enough test case in the old test.html to spot my mistake, when I ran it with the W3C Html validator, the new one should include all cases). Hence the existing sgmllib.py was correct<Oops!>. However, all the major browsers (IE, Natscape, Konqueror, Opera) choose to print the invalid HTML as plain text. Hence I think htmllib.py might as well follow the crowd as well. My suggestion is to added functions HTMLParser.unknown_charref() and and HTMLParser.unknown_entityref() as follows (files attached): # --- treat unknown entity as plain text def unknown_charref(self, ref): self.handle_data( '&#' + ref) def unknown_entityref(self, ref): self.handle_data( '&'+ ref) Sorry again for my previous incorrect patches. Bernie
msg8616 - (view)	Author: Fred Drake (fdrake)	Date: 2002-03-13 06:02
Logged In: YES user_id=3066 Bump the priority so I'll have to look at this when I'm not too tired to think straight.
msg8617 - (view)	Author: Nobody/Anonymous (nobody)	Date: 2002-03-18 20:55
Logged In: NO no entiendo el proyecto pyton ni el funcionamiento del server no he encontrado ningun archivo pdf que te pueda explicar el desarrollo en español atte sebastia
msg8618 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-03-18 21:00
Logged In: YES user_id=6380 http://www.python.org/doc/NonEnglish.html#spanish
msg8619 - (view)	Author: Fred Drake (fdrake)	Date: 2002-06-14 01:35
Logged In: YES user_id=3066 I agree that this should be rejected; this is not a recurring complaint about the module, and there's no reason to further exacerbate the HTML-as-deployed problem. Let's stick with the (relatively) strict interpretation.

History
Date	User	Action	Args
2022-04-10 16:04:51	admin	set	github: 35871
2002-01-06 08:06:19	berniey	create