This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser fail to handle '&foobar'
Type: Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: fdrake Nosy List: berniey, fdrake, gvanrossum, loewis, nobody, skip.montanaro
Priority: high Keywords:

Created on 2002-01-06 08:06 by berniey, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
sgmllib.py berniey, 2002-01-08 09:00 Suggested Changes
test.html berniey, 2002-01-09 00:44 Testing HTML
test.py berniey, 2002-01-09 00:45 Testing script
htmllib.py berniey, 2002-01-09 06:35 Suggested Changes
test.html berniey, 2002-01-09 06:37 New test html
Messages (13)
msg8607 - (view) Author: Bernard YUE (berniey) Date: 2002-01-06 08:06
HTMLParser did not distingish between &foobar; and 
&foobar.  The later is still considered as a 
charref/entityref.  Below is my posposed fix:

File:  sgmllib.py

# SGMLParser.goahead()
# line 162-176
# from
            elif rawdata[i] == '&':
                match = charref.match(rawdata, i)
                if match:
                    name = match.group(1)
                    self.handle_charref(name)
                    i = match.end(0)
                    if rawdata[i-1] != ';': i = i-1
                    continue
                match = entityref.match(rawdata, i)
                if match:
                    name = match.group(1)
                    self.handle_entityref(name)
                    i = match.end(0)
                    if rawdata[i-1] != ';': i = i-1
                    continue

# to
            elif rawdata[i] == '&'
                match = charref.match(rawdata, i)
                if match:
                    if rawdata[match.end(0)-1] != ';':
                        # not really an charref
                        self.handle_data(rawdata[i])
                        i = i+1
                    else:
                        name = match.group(1)
                        self.handle_charref(name)
                        i = match.end(0)
                    continue
                match = entityref.match(rawdata, i)
                if match:
                    if rawdata[match.end(0)-1] != ';':
                        # not really an entitiyref
                        self.handle_data(rawdata[i])
                        i = i+1
                    else: 
                        name = match.group(1)
                        self.handle_entityref(name)
                        i = match.end(0)
                    continue

msg8608 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2002-01-08 21:03
Logged In: YES 
user_id=44345

Bernie,

I see nothing wrong in principal with recognizing 
"&nbsp"
when the user should have typed " ", but I wonder
about 
the validity of "&nbsp".  You mentioned it's still
a charref or 
entityref.  Is that documented somewhere or
is it simply a practical 
approach to a common problem?

Thanks,

Skip
msg8609 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-01-08 22:02
Logged In: YES 
user_id=21627

I fail to see the problem as well. Please attach an example
document to this report. Without a detailed analysis of the
problem in question, there is zero chance that any change
like this is accepted.

Here is my analysis from your report: It seems that you
complain that sgmllib, when it sees an ill-formed document,
behaves in a particular way, whereas you expect to behave it
in a different way. Since the document is ill-formed
anyways, any behaviour is as good as any other.
msg8610 - (view) Author: Bernard YUE (berniey) Date: 2002-01-09 00:43
Logged In: YES 
user_id=419276

Hi Martin and Skip,

Sorry for not explain myself clearly.  What I mean is that &foobar 
should have been treated as '&foobar' literally (i.e. text), and 
&forbat; should be an entityref and &#forbar; as charref.

Currently, sgmllib treated &foobar as entityref and &#foobar as 
charref and match it against entityref table and charref table.  
Ignores the entity when a match is not found.

My suggested change should fix this problem.  Run test.py 
(test.py and test.html attached)

>./test.py

Me! Me & You! Copyright@copy;abc Copyright©abc © ©

But we are expecting:
Me&you! Me & You! Copyright@copy;abc Copyright©abc © ©

My suggested change will print the expected output.

# test.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3c.org/TR/html4/strict.dtd">

<html>
<head dir="ltr" lang="en">
  <TITLE>Testing Page</TITLE>
  <META name="AUTHOR" content="Bernard Yue">
  <META name="DESCRIPTION" content="Testing Page">
</head>
<body>
  <p>Me&you!  Me & You! Copyright@copy;abc 
Copyright©abc &copy; ©
  </p>
</body>
</html>

# test.py
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter


def test():
    _formatter = AbstractFormatter( DumbWriter())
    _parser = HTMLParser( _formatter)
    _f = open( './test.html')

    _parser.feed( _f.read())
    _f.close()
    _parser.close()
    print ''

if __name__ == '__main__':
    test()


msg8611 - (view) Author: Bernard YUE (berniey) Date: 2002-01-09 01:04
Logged In: YES 
user_id=419276

Hi again,

I just run the test.html with w3c's HTML validator.  &you does 
indeed treated as an invalid entityref in HTML 4.01.  I've displays 
test.html under IE, Netscape and Konqueror and it all gave the 
result I've expected.  I am not sure if sgmllib.py should stick with 
the standard or go with the general defacto interpretation.

But I think it is more sensable to treat &you as text.


Bernie
msg8612 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2002-01-09 04:33
Logged In: YES 
user_id=44345

Bernie,

I tried your patch.  It looks good to me.  I was a tad
confused 
when I first read your bug report.  I thought
you were suggesting that 
"&foo" be interpreted as a
charref/entityref.  Instead you are 
tightening up the
parser.

That seems reasonable to me.  Martin, what 
do you think?

Skip
msg8613 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-01-09 04:42
Logged In: YES 
user_id=6380

I'm reassigning this to Fred.

In 2.2, the new HTMLParser may or may not still have this
problem.

In 2.1.2, I think that "fixing" it would be too big a risk
of breaking existing code, so I think it should not be
fixed.
msg8614 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-01-09 05:30
Logged In: YES 
user_id=21627

I still recommend to reject this patch, it is plain wrong.
Do we all agree that an HTML Document containing &you is
ill-formed (all HTML versions)? If so, it is a matter of
best-effort what to do with it.

In SGML, it is well-formed to omit the semicolon from the
entity name in a entity reference in certain cases, see

http://bip.cnrs-mrs.fr/bip10/scowl.htm#semi

Therefore, omission of the semicolon does *not* mean that
you don't have an entity reference, and sgmllib's processing
of entity references is completely correct - it would be an
error to treat &you as data. 

Therefore, your document is correct SGML. It just fails to
be correct HTML, since the entity 'you' is not defined.

If you want to process such a document in a specific way, I
recommend to subclass HTMLParser, overriding unknown_entityref.
msg8615 - (view) Author: Bernard YUE (berniey) Date: 2002-01-09 06:35
Logged In: YES 
user_id=419276

Hi Guys,

I felt embarrass as I confuss everybody here.  Martin is nearly 
100% right.  Except that all &foo, &foo;, &#bar, &#bar; are all 
valid entity in HTML 4.01 as well if it was defined (I did not put 
enough test case in the old test.html to spot my mistake, when I 
ran it with the W3C Html validator, the new one should include 
all cases).  Hence the existing sgmllib.py was correct<Oops!>.

However, all the major browsers (IE, Natscape, Konqueror, Opera) 
choose to print the invalid HTML as plain text.  Hence I think 
htmllib.py might as well follow the crowd as well.

My suggestion is to added functions 
HTMLParser.unknown_charref() and and 
HTMLParser.unknown_entityref() as follows (files attached):

    # --- treat unknown entity as plain text

    def unknown_charref(self, ref):
        self.handle_data( '&#' + ref)

    def unknown_entityref(self, ref): 
        self.handle_data( '&'+ ref)

Sorry again for my previous incorrect patches.

Bernie
msg8616 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2002-03-13 06:02
Logged In: YES 
user_id=3066

Bump the priority so I'll have to look at this when I'm not
too tired to think straight.
msg8617 - (view) Author: Nobody/Anonymous (nobody) Date: 2002-03-18 20:55
Logged In: NO 

no entiendo el proyecto pyton ni el funcionamiento del 
server no he encontrado ningun archivo pdf que te pueda 
explicar el desarrollo en español atte sebastia
msg8618 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-03-18 21:00
Logged In: YES 
user_id=6380

http://www.python.org/doc/NonEnglish.html#spanish
msg8619 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2002-06-14 01:35
Logged In: YES 
user_id=3066

I agree that this should be rejected; this is not a
recurring complaint about the module, and there's no reason
to further exacerbate the HTML-as-deployed problem.  Let's
stick with the (relatively) strict interpretation.
History
Date User Action Args
2022-04-10 16:04:51adminsetgithub: 35871
2002-01-06 08:06:19bernieycreate