Issue921657
Created on 2004-03-23 10:17 by bernd_zedv, last changed 2004-10-13 10:16 by nnseva. This issue is now closed.
| Messages (4) | |||
|---|---|---|---|
| msg20293 - (view) | Author: Bernd Zimmermann (bernd_zedv) | Date: 2004-03-23 10:17 | |
when this - obviously correct html - is parsed: <a href=mailto:xyz@domain.com>xyz</a> this exception is raised: HTMLParseError: junk characters in start tag: '@domain.com>', at line 1, column 1 I work around this by adding '@' to the allowed character's class: import HTMLParser HTMLParser.attrfind = re.compile( r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\) _#=~@]*))?') myparser = HTMLParser.HTMLParser() myparser.feed('<a ... ') |
|||
| msg20294 - (view) | Author: A.M. Kuchling (akuchling) * ![]() |
Date: 2004-04-19 13:01 | |
Logged In: YES user_id=11375 I don't believe this HTML is obviously correct. The section on attributes in the HTML 4.01 Recommendation (http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2) says: In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them. The regex is already more liberal than this, allowing slashes and various other symbols, so we might as well add '@', but you should also consider adding quotation marks to the original attribute. |
|||
| msg20295 - (view) | Author: A.M. Kuchling (akuchling) * ![]() |
Date: 2004-06-05 15:32 | |
Logged In: YES user_id=11375 Committed to the CVS HEAD; thanks! |
|||
| msg20296 - (view) | Author: Vsevolod Novikov (nnseva) | Date: 2004-10-13 10:16 | |
Logged In: YES user_id=325678 see request #1046092 to fix it |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2004-03-23 10:17:42 | bernd_zedv | create | |
