Title: sgmllib.SGMLparser and hexadecimal numeric character refs
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 3.2
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, nerby
Priority: normal Keywords: easy

Created on 2006-03-27 12:51 by nerby, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (3)
msg60894 - (view) Author: Francesco Ricciardi (nerby) Date: 2006-03-27 12:51
According to HTML 4.0 specification it is possible to
have hexadecimal numeric character references, not only
decimal (see

However sgmllib.SGMLparser does not recognize the
hexadecimal form.

More and more HTML pages now use entities with a high
codepoint, not in the official HTML entity list, so
proper handling to these references should be implemented.

A possible solution could be:
- improving the "charref" regular expression, so to
include exadecimal values;
- considering all numeric references valid: those with
n < 255 should be converted to the corresponding
characters, those above 255 should be left as numerical
msg109853 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-10 11:21
sgmllib has been removed from py3k.
msg114670 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-22 10:45
sgmllib has been deprecated since 2.6 and has been removed from py3k.
