Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser ParseError in start tag #40065

Closed
berndzedv mannequin opened this issue Mar 23, 2004 · 4 comments
Closed

HTMLParser ParseError in start tag #40065

berndzedv mannequin opened this issue Mar 23, 2004 · 4 comments
Assignees
Labels
stdlib Python modules in the Lib dir

Comments

@berndzedv
Copy link
Mannequin

berndzedv mannequin commented Mar 23, 2004

BPO 921657
Nosy @akuchling

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/akuchling'
closed_at = <Date 2004-10-13.10:16:26.000>
created_at = <Date 2004-03-23.10:17:42.000>
labels = ['library']
title = 'HTMLParser ParseError in start tag'
updated_at = <Date 2004-10-13.10:16:26.000>
user = 'https://bugs.python.org/berndzedv'

bugs.python.org fields:

activity = <Date 2004-10-13.10:16:26.000>
actor = 'nnseva'
assignee = 'akuchling'
closed = True
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2004-03-23.10:17:42.000>
creator = 'bernd_zedv'
dependencies = []
files = []
hgrepos = []
issue_num = 921657
keywords = []
message_count = 4.0
messages = ['20293', '20294', '20295', '20296']
nosy_count = 3.0
nosy_names = ['akuchling', 'bernd_zedv', 'nnseva']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue921657'
versions = ['Python 2.3']

@berndzedv
Copy link
Mannequin Author

berndzedv mannequin commented Mar 23, 2004

when this - obviously correct html - is parsed:

<a href=mailto:xyz@domain.com>xyz</a>

this exception is raised:
HTMLParseError: junk characters in start
tag: '@domain.com>', at line 1, column 1

I work around this by adding '@' to the
allowed character's class:

import HTMLParser
HTMLParser.attrfind = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)
_#=~@]*))?')

myparser = HTMLParser.HTMLParser()
myparser.feed('<a ... ')

@berndzedv berndzedv mannequin closed this as completed Mar 23, 2004
@berndzedv berndzedv mannequin assigned akuchling Mar 23, 2004
@berndzedv berndzedv mannequin added the stdlib Python modules in the Lib dir label Mar 23, 2004
@berndzedv berndzedv mannequin closed this as completed Mar 23, 2004
@berndzedv berndzedv mannequin assigned akuchling Mar 23, 2004
@berndzedv berndzedv mannequin added the stdlib Python modules in the Lib dir label Mar 23, 2004
@akuchling
Copy link
Member

Logged In: YES
user_id=11375

I don't believe this HTML is obviously correct.
The section on attributes in the HTML 4.01 Recommendation
(http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2) says:

In certain cases, authors may specify the value of an
attribute without any quotation marks. The attribute value
may only contain letters (a-z and A-Z), digits (0-9),
hyphens (ASCII decimal 45), periods (ASCII decimal 46),
underscores (ASCII decimal 95), and colons (ASCII decimal
58). We recommend using quotation marks even when it is
possible to eliminate them.

The regex is already more liberal than this, allowing slashes
and various other symbols, so we might as well add '@', but
you should also consider adding quotation marks to the
original attribute.

@akuchling
Copy link
Member

Logged In: YES
user_id=11375

Committed to the CVS HEAD; thanks!

@nnseva
Copy link
Mannequin

nnseva mannequin commented Oct 13, 2004

Logged In: YES
user_id=325678

see request bpo-1046092 to fix it

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir
Projects
None yet
Development

No branches or pull requests

1 participant