Message 60894 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	nerby
Recipients
Date	2006-03-27.12:51:59
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
According to HTML 4.0 specification it is possible to have hexadecimal numeric character references, not only decimal (see http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1). However sgmllib.SGMLparser does not recognize the hexadecimal form. More and more HTML pages now use entities with a high codepoint, not in the official HTML entity list, so proper handling to these references should be implemented. A possible solution could be: - improving the "charref" regular expression, so to include exadecimal values; - considering all numeric references valid: those with n < 255 should be converted to the corresponding characters, those above 255 should be left as numerical charrefs.

According to HTML 4.0 specification it is possible to
have hexadecimal numeric character references, not only
decimal (see
http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1).

However sgmllib.SGMLparser does not recognize the
hexadecimal form.

More and more HTML pages now use entities with a high
codepoint, not in the official HTML entity list, so
proper handling to these references should be implemented.

A possible solution could be:
- improving the "charref" regular expression, so to
include exadecimal values;
- considering all numeric references valid: those with
n < 255 should be converted to the corresponding
characters, those above 255 should be left as numerical
charrefs.

History
Date	User	Action	Args
2008-01-20 09:58:32	admin	link	issue1459279 messages
2008-01-20 09:58:32	admin	create