Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser attribute parsing bug #37954

Closed
freddrake opened this issue Feb 10, 2003 · 6 comments
Closed

HTMLParser attribute parsing bug #37954

freddrake opened this issue Feb 10, 2003 · 6 comments
Assignees
Labels
easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@freddrake
Copy link
Member

BPO 683938
Nosy @freddrake, @bitdancer

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/freddrake'
closed_at = <Date 2010-08-18.13:15:34.348>
created_at = <Date 2003-02-10.14:57:40.000>
labels = ['easy', 'type-feature', 'library']
title = 'HTMLParser attribute parsing bug'
updated_at = <Date 2010-08-18.13:15:34.346>
user = 'https://github.com/freddrake'

bugs.python.org fields:

activity = <Date 2010-08-18.13:15:34.346>
actor = 'BreamoreBoy'
assignee = 'fdrake'
closed = True
closed_date = <Date 2010-08-18.13:15:34.348>
closer = 'BreamoreBoy'
components = ['Library (Lib)']
creation = <Date 2003-02-10.14:57:40.000>
creator = 'fdrake'
dependencies = []
files = []
hgrepos = []
issue_num = 683938
keywords = ['easy']
message_count = 6.0
messages = ['60305', '60306', '60307', '60308', '60309', '114217']
nosy_count = 6.0
nosy_names = ['fdrake', 'calvin', 'titus', 'smroid', 'r.david.murray', 'BreamoreBoy']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'test needed'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue683938'
versions = ['Python 2.7']

@freddrake
Copy link
Member Author

HTMLParser (reportedly) fails to parse this construct:

<a href="http://ss"title="pe"\>P\</a>

(Note that a required space between the two attributes
of the "a" tag has been omitted). The W3C validator
appearantly treats this differently, so there's no
point in arguing the letter of the law.

Assigned to me.

@freddrake freddrake self-assigned this Feb 10, 2003
@freddrake freddrake added the stdlib Python modules in the Lib dir label Feb 10, 2003
@calvin
Copy link
Mannequin

calvin mannequin commented Mar 31, 2003

Logged In: YES
user_id=9205

HTMLParser (and lots of other parsers I tried) has
definitely limits when it comes to error recovering. I dont
know if its good to put further development effort in
HTMLParser as it will IMHO never reach the ability to cope
with all the crappy HTML out there.
If you really want to have a html parser in Python, I
suggest you look at my htmlsax module packaged with
linkchecker (linkchecker.sf.net) and webcleaner
(webcleaner.sf.net), the parser is tested with lots of real
world examples.
The parser packaged with linkchecker has line counting, the
one with webcleaner not.

Cheers, Bastian

@smroid
Copy link
Mannequin

smroid mannequin commented May 14, 2003

Logged In: YES
user_id=159908

Two troublesome input examples:
<table border=0 width="100%"cellspacing=0 cellpadding=0>
<option selected value=>

Here's a fix I came up with in HTMLParser.py: replace the
definition of locatestarttagend with:

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
  \s*                                # whitespace after tag name
  (?:
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )?
       )?
     )
     \s*                             # whitespace between attrs
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)

@bitdancer
Copy link
Member

Logged In: YES
user_id=100308

I'm using python 2.3.3.

I note that bug 699079, which addresses this same issue, was closed
as "not a bug". As far as I can tell the current behavior of
HTMLParser, unlike what was reported in that bug report, is to
silently stop parsing. This is a problem, as it took me quite a
while to track down why my application wasn't working, whereas if
an exception had been generated I'd have figured it out right quick.

If it's going to stop parsing when the error occurs, then I'd much
rather it generate an exception. I can always trap the exception
if I want to keep going. Since it apparently used to work that
way, I'm hoping maybe a quick poke through CVS by someone knowledgeable
with the code can restore the excption behavior, pending a more
satisfactory resolution to the problem.

@titus
Copy link
Mannequin

titus mannequin commented Dec 19, 2004

Logged In: YES
user_id=23486

In response to rdmurray's comment: in Python 2.4, at least, an exception
is raised.

Not sure why this bug is being kept open... but see bug 736428 and
patch 755660 for related issues.

@devdanzin devdanzin mannequin added the type-feature A feature request or enhancement label Feb 12, 2009
@devdanzin devdanzin mannequin added the easy label Apr 22, 2009
@BreamoreBoy
Copy link
Mannequin

BreamoreBoy mannequin commented Aug 18, 2010

Closed as fixed in r23322.

@BreamoreBoy BreamoreBoy mannequin closed this as completed Aug 18, 2010
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants