HTMLParser ParseError in start tag #40065

berndzedv · 2004-03-23T10:17:42Z

BPO	921657
Nosy	@akuchling

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/akuchling'
closed_at = <Date 2004-10-13.10:16:26.000>
created_at = <Date 2004-03-23.10:17:42.000>
labels = ['library']
title = 'HTMLParser ParseError in start tag'
updated_at = <Date 2004-10-13.10:16:26.000>
user = 'https://bugs.python.org/berndzedv'

bugs.python.org fields:

activity = <Date 2004-10-13.10:16:26.000>
actor = 'nnseva'
assignee = 'akuchling'
closed = True
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2004-03-23.10:17:42.000>
creator = 'bernd_zedv'
dependencies = []
files = []
hgrepos = []
issue_num = 921657
keywords = []
message_count = 4.0
messages = ['20293', '20294', '20295', '20296']
nosy_count = 3.0
nosy_names = ['akuchling', 'bernd_zedv', 'nnseva']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue921657'
versions = ['Python 2.3']

berndzedv · 2004-03-23T10:17:42Z

when this - obviously correct html - is parsed:

this exception is raised:
HTMLParseError: junk characters in start
tag: '@domain.com>', at line 1, column 1

I work around this by adding '@' to the
allowed character's class:

import HTMLParser
HTMLParser.attrfind = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)
_#=~@]*))?')

myparser = HTMLParser.HTMLParser()
myparser.feed('<a ... ')

akuchling · 2004-04-19T13:01:09Z

Logged In: YES
user_id=11375

I don't believe this HTML is obviously correct.
The section on attributes in the HTML 4.01 Recommendation
(http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2) says:

In certain cases, authors may specify the value of an
attribute without any quotation marks. The attribute value
may only contain letters (a-z and A-Z), digits (0-9),
hyphens (ASCII decimal 45), periods (ASCII decimal 46),
underscores (ASCII decimal 95), and colons (ASCII decimal
58). We recommend using quotation marks even when it is
possible to eliminate them.

The regex is already more liberal than this, allowing slashes
and various other symbols, so we might as well add '@', but
you should also consider adding quotation marks to the
original attribute.

akuchling · 2004-06-05T15:32:16Z

Logged In: YES
user_id=11375

Committed to the CVS HEAD; thanks!

nnseva · 2004-10-13T10:16:26Z

Logged In: YES
user_id=325678

see request bpo-1046092 to fix it

berndzedv mannequin closed this as completed Mar 23, 2004

berndzedv mannequin assigned akuchling Mar 23, 2004

berndzedv mannequin added the stdlib Python modules in the Lib dir label Mar 23, 2004

berndzedv mannequin closed this as completed Mar 23, 2004

berndzedv mannequin assigned akuchling Mar 23, 2004

berndzedv mannequin added the stdlib Python modules in the Lib dir label Mar 23, 2004

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLParser ParseError in start tag #40065

HTMLParser ParseError in start tag #40065

berndzedv mannequin commented Mar 23, 2004

berndzedv mannequin commented Mar 23, 2004

akuchling commented Apr 19, 2004

akuchling commented Jun 5, 2004

nnseva mannequin commented Oct 13, 2004

HTMLParser ParseError in start tag #40065

HTMLParser ParseError in start tag #40065

Comments

berndzedv mannequin commented Mar 23, 2004

berndzedv mannequin commented Mar 23, 2004

akuchling commented Apr 19, 2004

akuchling commented Jun 5, 2004

nnseva mannequin commented Oct 13, 2004