This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author nerby
Recipients
Date 2006-03-27.12:51:59
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
According to HTML 4.0 specification it is possible to
have hexadecimal numeric character references, not only
decimal (see
http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1).

However sgmllib.SGMLparser does not recognize the
hexadecimal form.

More and more HTML pages now use entities with a high
codepoint, not in the official HTML entity list, so
proper handling to these references should be implemented.

A possible solution could be:
- improving the "charref" regular expression, so to
include exadecimal values;
- considering all numeric references valid: those with
n < 255 should be converted to the corresponding
characters, those above 255 should be left as numerical
charrefs. 
History
Date User Action Args
2008-01-20 09:58:32adminlinkissue1459279 messages
2008-01-20 09:58:32admincreate