Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser fix to accept malformed tag attributes #41013

Closed
nnseva mannequin opened this issue Oct 13, 2004 · 11 comments
Closed

HTMLParser fix to accept malformed tag attributes #41013

nnseva mannequin opened this issue Oct 13, 2004 · 11 comments
Labels
easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@nnseva
Copy link
Mannequin

nnseva mannequin commented Oct 13, 2004

BPO 1046092
Nosy @devdanzin, @ezio-melotti, @bitdancer
Superseder
  • bpo-1486713: HTMLParser : A auto-tolerant parsing mode
  • Files
  • HTMLParser.py.patch: This is a patch
  • html.parser.diff: patch to limit nonstrict-regexp from eating too much
  • test-htmlparser-attrs.py: test with unquoted attribtues
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-12-03.04:17:17.428>
    created_at = <Date 2004-10-13.10:11:24.000>
    labels = ['easy', 'type-feature', 'library']
    title = 'HTMLParser fix to accept malformed tag attributes'
    updated_at = <Date 2011-05-10.14:06:07.334>
    user = 'https://bugs.python.org/nnseva'

    bugs.python.org fields:

    activity = <Date 2011-05-10.14:06:07.334>
    actor = 'ezio.melotti'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-12-03.04:17:17.428>
    closer = 'r.david.murray'
    components = ['Library (Lib)']
    creation = <Date 2004-10-13.10:11:24.000>
    creator = 'nnseva'
    dependencies = []
    files = ['1423', '21891', '21892']
    hgrepos = []
    issue_num = 1046092
    keywords = ['patch', 'easy']
    message_count = 11.0
    messages = ['22675', '22676', '22677', '22678', '81692', '114333', '121677', '123176', '135179', '135180', '135701']
    nosy_count = 7.0
    nosy_names = ['jlgijsbers', 'nnseva', 'ajaksu2', 'ezio.melotti', 'Neil Muller', 'r.david.murray', 'svilend']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = 'resolved'
    status = 'closed'
    superseder = '1486713'
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1046092'
    versions = ['Python 3.2']

    @nnseva
    Copy link
    Mannequin Author

    nnseva mannequin commented Oct 13, 2004

    This is a patch to fix bugs bpo-975556 and bpo-921657.

    I think, it should be made just because the parser
    should accept as many pages as it can. At the other
    hand, the code near to fixed contains regexp to accept
    mailformed attributes in other cases: compare attrfind
    variable and locatestarttagend variable values.

    @nnseva nnseva mannequin added stdlib Python modules in the Lib dir labels Oct 13, 2004
    @jlgijsbers
    Copy link
    Mannequin

    jlgijsbers mannequin commented Oct 13, 2004

    Logged In: YES
    user_id=469548

    There's no uploaded file! You have to check the
    checkbox labeled "Check to Upload & Attach File"
    when you upload a file.

    Please try again.

    (This is a SourceForge annoyance that we can do
    nothing about. :-( )

    @nnseva
    Copy link
    Mannequin Author

    nnseva mannequin commented Oct 15, 2004

    Logged In: YES
    user_id=325678

    There's no uploaded file! You have to check the
    checkbox labeled "Check to Upload & Attach File"
    when you upload a file.

    Please try again.

    (This is a SourceForge annoyance that we can do
    nothing about. :-( )

    @nnseva
    Copy link
    Mannequin Author

    nnseva mannequin commented Oct 15, 2004

    Logged In: YES
    user_id=325678

    Missed patch, sorry ...

    @devdanzin devdanzin mannequin added type-feature A feature request or enhancement labels Feb 9, 2009
    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Feb 11, 2009

    Heh, the patch applies cleanly to trunk more than four years later and
    tests pass fine. We'll surely need better tests if the behavior change
    is considered an improvement.

    @devdanzin devdanzin mannequin added easy labels Apr 22, 2009
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Aug 19, 2010

    The patch is a one line change to a compiled regex. Would someone with html and/or regex knowledge like to comment, thanks, as I've no idea as to the implications. I also agree with comments in msg81692 regarding better unit tests. Please don't ask me! :)

    @NeilMuller
    Copy link
    Mannequin

    NeilMuller mannequin commented Nov 20, 2010

    I think this change is makes the parser far too lenient. Something like the explicit tolerant mode proposed in bpo-1486713 is a better solution.

    @bitdancer
    Copy link
    Member

    Included this in the 'strict=False' mode in the bpo-1486713 patch.

    @bitdancer bitdancer changed the title HTMLParser fix to accept mailformed tag attributes HTMLParser fix to accept malformed tag attributes Dec 3, 2010
    @bitdancer bitdancer changed the title HTMLParser fix to accept mailformed tag attributes HTMLParser fix to accept malformed tag attributes Dec 3, 2010
    @svilend
    Copy link
    Mannequin

    svilend mannequin commented May 5, 2011

    this seems to eat too much into data and gets past endpos of the chunk processed, and parser gets confused and treats any subsequent stuff as data. i didn't think out how to fix the regexp as such, but instead limited its span to :endpos so it doesnot eat too much.
    seems to happen with unquoted attributes.

    @ezio-melotti
    Copy link
    Member

    This issue is closed, so it's better if you create a new issue.
    Even better if you can attach a patch that adds a testcase to Lib/test/test_htmlparser.py

    @ezio-melotti
    Copy link
    Member

    For the record, the new issue is bpo-12008.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants