HTMLParser.locatestartagend regex too stringent #41113

dyoo · 2004-11-01T18:05:39Z

BPO	1058305
Nosy	@devdanzin, @bitdancer
Superseder	bpo-1486713: HTMLParser : A auto-tolerant parsing mode
Files	HTMLParser.py.diff: diff against Lib/HTMLParser.py from Python 2.3.3

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-12-03.03:00:05.751>
created_at = <Date 2004-11-01.18:05:39.000>
labels = ['easy', 'type-feature', 'library']
title = 'HTMLParser.locatestartagend regex too stringent'
updated_at = <Date 2010-12-03.03:00:05.749>
user = 'https://bugs.python.org/dyoo'

bugs.python.org fields:

activity = <Date 2010-12-03.03:00:05.749>
actor = 'r.david.murray'
assignee = 'none'
closed = True
closed_date = <Date 2010-12-03.03:00:05.751>
closer = 'r.david.murray'
components = ['Library (Lib)']
creation = <Date 2004-11-01.18:05:39.000>
creator = 'dyoo'
dependencies = []
files = ['1469']
hgrepos = []
issue_num = 1058305
keywords = ['patch', 'easy']
message_count = 6.0
messages = ['22976', '82104', '114390', '115604', '115623', '123168']
nosy_count = 3.0
nosy_names = ['dyoo', 'ajaksu2', 'r.david.murray']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = '1486713'
type = 'enhancement'
url = 'https://bugs.python.org/issue1058305'
versions = ['Python 3.2']

dyoo · 2004-11-01T18:05:39Z

In Python 2.3.3, HTMLParser uses a certain regex that
is too stringent, and it does not capture slightly
malformed HTML gracefully.

The current definition of HTMLParser.locatestartendtag:

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*       # tag name
  (?:\s+                  # whitespace before attribute
name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)

does not capture strings like:

<IMG SRC = "abc.jpg"WIDTH=5>

where there is no space between the closing quote and
the next attribute name. Many sources of HTML are
slightly malformed this way --- in particular, CNN.com
--- so being slightly lenient might be good. We can
slightly relax the constraint:

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*       # tag name
  (?:\s*          # optional whitespace before
attribute name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)

which allows the parser to process more of the HTML out
there.

See:

http://mail.python.org/pipermail/tutor/2004-October/032835.html

and:

http://mail.python.org/pipermail/tutor/2004-October/032869.html

for an explanation of what motivates this change.

Thanks!

devdanzin · 2009-02-14T18:45:26Z

The regex is still the same. This is one of many 'HTMLParser regex for
attributes' issues.

BreamoreBoy · 2010-08-19T18:21:52Z

I'll close this in a couple of weeks unless anyone objects.

BreamoreBoy · 2010-09-04T18:44:33Z

No reply to msg114390.

bitdancer · 2010-09-05T02:45:05Z

Closing this issue as out of date was inappropriate. It may be a duplicate, but someone with an interest should go through and evaluate all the related 'tolerant HTML parser' issues.

bpo-1486713 could perhaps serve as a master issue for this set.

bitdancer · 2010-12-03T03:00:06Z

Closing this in favor of 1486713, which has a patch and covers additional issues.

dyoo mannequin added stdlib Python modules in the Lib dir labels Nov 1, 2004

devdanzin mannequin added easy type-feature A feature request or enhancement labels Feb 14, 2009

BreamoreBoy mannequin closed this as completed Sep 4, 2010

bitdancer reopened this Sep 5, 2010

bitdancer closed this as completed Dec 3, 2010

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLParser.locatestartagend regex too stringent #41113

HTMLParser.locatestartagend regex too stringent #41113

dyoo mannequin commented Nov 1, 2004

dyoo mannequin commented Nov 1, 2004

devdanzin mannequin commented Feb 14, 2009

BreamoreBoy mannequin commented Aug 19, 2010

BreamoreBoy mannequin commented Sep 4, 2010

bitdancer commented Sep 5, 2010

bitdancer commented Dec 3, 2010

HTMLParser.locatestartagend regex too stringent #41113

HTMLParser.locatestartagend regex too stringent #41113

Comments

dyoo mannequin commented Nov 1, 2004

dyoo mannequin commented Nov 1, 2004

devdanzin mannequin commented Feb 14, 2009

BreamoreBoy mannequin commented Aug 19, 2010

BreamoreBoy mannequin commented Sep 4, 2010

bitdancer commented Sep 5, 2010

bitdancer commented Dec 3, 2010