Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser.locatestartagend regex too stringent #41113

Closed
dyoo mannequin opened this issue Nov 1, 2004 · 6 comments
Closed

HTMLParser.locatestartagend regex too stringent #41113

dyoo mannequin opened this issue Nov 1, 2004 · 6 comments
Labels
easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@dyoo
Copy link
Mannequin

dyoo mannequin commented Nov 1, 2004

BPO 1058305
Nosy @devdanzin, @bitdancer
Superseder
  • bpo-1486713: HTMLParser : A auto-tolerant parsing mode
  • Files
  • HTMLParser.py.diff: diff against Lib/HTMLParser.py from Python 2.3.3
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-12-03.03:00:05.751>
    created_at = <Date 2004-11-01.18:05:39.000>
    labels = ['easy', 'type-feature', 'library']
    title = 'HTMLParser.locatestartagend regex too stringent'
    updated_at = <Date 2010-12-03.03:00:05.749>
    user = 'https://bugs.python.org/dyoo'

    bugs.python.org fields:

    activity = <Date 2010-12-03.03:00:05.749>
    actor = 'r.david.murray'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-12-03.03:00:05.751>
    closer = 'r.david.murray'
    components = ['Library (Lib)']
    creation = <Date 2004-11-01.18:05:39.000>
    creator = 'dyoo'
    dependencies = []
    files = ['1469']
    hgrepos = []
    issue_num = 1058305
    keywords = ['patch', 'easy']
    message_count = 6.0
    messages = ['22976', '82104', '114390', '115604', '115623', '123168']
    nosy_count = 3.0
    nosy_names = ['dyoo', 'ajaksu2', 'r.david.murray']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '1486713'
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1058305'
    versions = ['Python 3.2']

    @dyoo
    Copy link
    Mannequin Author

    dyoo mannequin commented Nov 1, 2004

    In Python 2.3.3, HTMLParser uses a certain regex that
    is too stringent, and it does not capture slightly
    malformed HTML gracefully.

    The current definition of HTMLParser.locatestartendtag:

    locatestarttagend = re.compile(r"""
      <[a-zA-Z][-.a-zA-Z0-9:_]*       # tag name
      (?:\s+                  # whitespace before attribute
    name
        (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
          (?:\s*=\s*                     # value indicator
            (?:'[^']*'                   # LITA-enclosed value
              |\"[^\"]*\"                # LIT-enclosed value
              |[^'\">\s]+                # bare value
             )
           )?
         )
       )*
      \s*                                # trailing whitespace
    """, re.VERBOSE)

    does not capture strings like:

    <IMG SRC = "abc.jpg"WIDTH=5>
    

    where there is no space between the closing quote and
    the next attribute name. Many sources of HTML are
    slightly malformed this way --- in particular, CNN.com
    --- so being slightly lenient might be good. We can
    slightly relax the constraint:

    locatestarttagend = re.compile(r"""
      <[a-zA-Z][-.a-zA-Z0-9:_]*       # tag name
      (?:\s*          # optional whitespace before
    attribute name
        (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
          (?:\s*=\s*                     # value indicator
            (?:'[^']*'                   # LITA-enclosed value
              |\"[^\"]*\"                # LIT-enclosed value
              |[^'\">\s]+                # bare value
             )
           )?
         )
       )*
      \s*                                # trailing whitespace
    """, re.VERBOSE)

    which allows the parser to process more of the HTML out
    there.

    See:

    http://mail.python.org/pipermail/tutor/2004-October/032835.html

    and:

    http://mail.python.org/pipermail/tutor/2004-October/032869.html

    for an explanation of what motivates this change.

    Thanks!

    @dyoo dyoo mannequin added stdlib Python modules in the Lib dir labels Nov 1, 2004
    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Feb 14, 2009

    The regex is still the same. This is one of many 'HTMLParser regex for
    attributes' issues.

    @devdanzin devdanzin mannequin added easy type-feature A feature request or enhancement labels Feb 14, 2009
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Aug 19, 2010

    I'll close this in a couple of weeks unless anyone objects.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Sep 4, 2010

    No reply to msg114390.

    @BreamoreBoy BreamoreBoy mannequin closed this as completed Sep 4, 2010
    @BreamoreBoy BreamoreBoy mannequin closed this as completed Sep 4, 2010
    @bitdancer
    Copy link
    Member

    Closing this issue as out of date was inappropriate. It may be a duplicate, but someone with an interest should go through and evaluate all the related 'tolerant HTML parser' issues.

    bpo-1486713 could perhaps serve as a master issue for this set.

    @bitdancer bitdancer reopened this Sep 5, 2010
    @bitdancer bitdancer reopened this Sep 5, 2010
    @bitdancer
    Copy link
    Member

    Closing this in favor of 1486713, which has a patch and covers additional issues.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant