Issue500073
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2002-01-06 08:06 by berniey, last changed 2022-04-10 16:04 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
sgmllib.py | berniey, 2002-01-08 09:00 | Suggested Changes | ||
test.html | berniey, 2002-01-09 00:44 | Testing HTML | ||
test.py | berniey, 2002-01-09 00:45 | Testing script | ||
htmllib.py | berniey, 2002-01-09 06:35 | Suggested Changes | ||
test.html | berniey, 2002-01-09 06:37 | New test html |
Messages (13) | |||
---|---|---|---|
msg8607 - (view) | Author: Bernard YUE (berniey) | Date: 2002-01-06 08:06 | |
HTMLParser did not distingish between &foobar; and &foobar. The later is still considered as a charref/entityref. Below is my posposed fix: File: sgmllib.py # SGMLParser.goahead() # line 162-176 # from elif rawdata[i] == '&': match = charref.match(rawdata, i) if match: name = match.group(1) self.handle_charref(name) i = match.end(0) if rawdata[i-1] != ';': i = i-1 continue match = entityref.match(rawdata, i) if match: name = match.group(1) self.handle_entityref(name) i = match.end(0) if rawdata[i-1] != ';': i = i-1 continue # to elif rawdata[i] == '&' match = charref.match(rawdata, i) if match: if rawdata[match.end(0)-1] != ';': # not really an charref self.handle_data(rawdata[i]) i = i+1 else: name = match.group(1) self.handle_charref(name) i = match.end(0) continue match = entityref.match(rawdata, i) if match: if rawdata[match.end(0)-1] != ';': # not really an entitiyref self.handle_data(rawdata[i]) i = i+1 else: name = match.group(1) self.handle_entityref(name) i = match.end(0) continue |
|||
msg8608 - (view) | Author: Skip Montanaro (skip.montanaro) * | Date: 2002-01-08 21:03 | |
Logged In: YES user_id=44345 Bernie, I see nothing wrong in principal with recognizing " " when the user should have typed " ", but I wonder about the validity of " ". You mentioned it's still a charref or entityref. Is that documented somewhere or is it simply a practical approach to a common problem? Thanks, Skip |
|||
msg8609 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2002-01-08 22:02 | |
Logged In: YES user_id=21627 I fail to see the problem as well. Please attach an example document to this report. Without a detailed analysis of the problem in question, there is zero chance that any change like this is accepted. Here is my analysis from your report: It seems that you complain that sgmllib, when it sees an ill-formed document, behaves in a particular way, whereas you expect to behave it in a different way. Since the document is ill-formed anyways, any behaviour is as good as any other. |
|||
msg8610 - (view) | Author: Bernard YUE (berniey) | Date: 2002-01-09 00:43 | |
Logged In: YES user_id=419276 Hi Martin and Skip, Sorry for not explain myself clearly. What I mean is that &foobar should have been treated as '&foobar' literally (i.e. text), and &forbat; should be an entityref and &#forbar; as charref. Currently, sgmllib treated &foobar as entityref and &#foobar as charref and match it against entityref table and charref table. Ignores the entity when a match is not found. My suggested change should fix this problem. Run test.py (test.py and test.html attached) >./test.py Me! Me & You! Copyright@copy;abc Copyright©abc © © But we are expecting: Me&you! Me & You! Copyright@copy;abc Copyright©abc © © My suggested change will print the expected output. # test.html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3c.org/TR/html4/strict.dtd"> <html> <head dir="ltr" lang="en"> <TITLE>Testing Page</TITLE> <META name="AUTHOR" content="Bernard Yue"> <META name="DESCRIPTION" content="Testing Page"> </head> <body> <p>Me&you! Me & You! Copyright@copy;abc Copyright©abc © © </p> </body> </html> # test.py #!/usr/bin/env python from htmllib import HTMLParser from formatter import AbstractFormatter, DumbWriter def test(): _formatter = AbstractFormatter( DumbWriter()) _parser = HTMLParser( _formatter) _f = open( './test.html') _parser.feed( _f.read()) _f.close() _parser.close() print '' if __name__ == '__main__': test() |
|||
msg8611 - (view) | Author: Bernard YUE (berniey) | Date: 2002-01-09 01:04 | |
Logged In: YES user_id=419276 Hi again, I just run the test.html with w3c's HTML validator. &you does indeed treated as an invalid entityref in HTML 4.01. I've displays test.html under IE, Netscape and Konqueror and it all gave the result I've expected. I am not sure if sgmllib.py should stick with the standard or go with the general defacto interpretation. But I think it is more sensable to treat &you as text. Bernie |
|||
msg8612 - (view) | Author: Skip Montanaro (skip.montanaro) * | Date: 2002-01-09 04:33 | |
Logged In: YES user_id=44345 Bernie, I tried your patch. It looks good to me. I was a tad confused when I first read your bug report. I thought you were suggesting that "&foo" be interpreted as a charref/entityref. Instead you are tightening up the parser. That seems reasonable to me. Martin, what do you think? Skip |
|||
msg8613 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2002-01-09 04:42 | |
Logged In: YES user_id=6380 I'm reassigning this to Fred. In 2.2, the new HTMLParser may or may not still have this problem. In 2.1.2, I think that "fixing" it would be too big a risk of breaking existing code, so I think it should not be fixed. |
|||
msg8614 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2002-01-09 05:30 | |
Logged In: YES user_id=21627 I still recommend to reject this patch, it is plain wrong. Do we all agree that an HTML Document containing &you is ill-formed (all HTML versions)? If so, it is a matter of best-effort what to do with it. In SGML, it is well-formed to omit the semicolon from the entity name in a entity reference in certain cases, see http://bip.cnrs-mrs.fr/bip10/scowl.htm#semi Therefore, omission of the semicolon does *not* mean that you don't have an entity reference, and sgmllib's processing of entity references is completely correct - it would be an error to treat &you as data. Therefore, your document is correct SGML. It just fails to be correct HTML, since the entity 'you' is not defined. If you want to process such a document in a specific way, I recommend to subclass HTMLParser, overriding unknown_entityref. |
|||
msg8615 - (view) | Author: Bernard YUE (berniey) | Date: 2002-01-09 06:35 | |
Logged In: YES user_id=419276 Hi Guys, I felt embarrass as I confuss everybody here. Martin is nearly 100% right. Except that all &foo, &foo;, &#bar, &#bar; are all valid entity in HTML 4.01 as well if it was defined (I did not put enough test case in the old test.html to spot my mistake, when I ran it with the W3C Html validator, the new one should include all cases). Hence the existing sgmllib.py was correct<Oops!>. However, all the major browsers (IE, Natscape, Konqueror, Opera) choose to print the invalid HTML as plain text. Hence I think htmllib.py might as well follow the crowd as well. My suggestion is to added functions HTMLParser.unknown_charref() and and HTMLParser.unknown_entityref() as follows (files attached): # --- treat unknown entity as plain text def unknown_charref(self, ref): self.handle_data( '&#' + ref) def unknown_entityref(self, ref): self.handle_data( '&'+ ref) Sorry again for my previous incorrect patches. Bernie |
|||
msg8616 - (view) | Author: Fred Drake (fdrake) | Date: 2002-03-13 06:02 | |
Logged In: YES user_id=3066 Bump the priority so I'll have to look at this when I'm not too tired to think straight. |
|||
msg8617 - (view) | Author: Nobody/Anonymous (nobody) | Date: 2002-03-18 20:55 | |
Logged In: NO no entiendo el proyecto pyton ni el funcionamiento del server no he encontrado ningun archivo pdf que te pueda explicar el desarrollo en español atte sebastia |
|||
msg8618 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2002-03-18 21:00 | |
Logged In: YES user_id=6380 http://www.python.org/doc/NonEnglish.html#spanish |
|||
msg8619 - (view) | Author: Fred Drake (fdrake) | Date: 2002-06-14 01:35 | |
Logged In: YES user_id=3066 I agree that this should be rejected; this is not a recurring complaint about the module, and there's no reason to further exacerbate the HTML-as-deployed problem. Let's stick with the (relatively) strict interpretation. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:04:51 | admin | set | github: 35871 |
2002-01-06 08:06:19 | berniey | create |