classification
Title: sgmllib.SGMLparser and hexadecimal numeric character refs
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, nerby
Priority: normal Keywords: easy

Created on 2006-03-27 12:51 by nerby, last changed 2010-08-22 10:45 by BreamoreBoy. This issue is now closed.

Messages (3)
msg60894 - (view) Author: Francesco Ricciardi (nerby) Date: 2006-03-27 12:51
According to HTML 4.0 specification it is possible to
have hexadecimal numeric character references, not only
decimal (see
http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1).

However sgmllib.SGMLparser does not recognize the
hexadecimal form.

More and more HTML pages now use entities with a high
codepoint, not in the official HTML entity list, so
proper handling to these references should be implemented.

A possible solution could be:
- improving the "charref" regular expression, so to
include exadecimal values;
- considering all numeric references valid: those with
n < 255 should be converted to the corresponding
characters, those above 255 should be left as numerical
charrefs. 
msg109853 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-10 11:21
sgmllib has been removed from py3k.
msg114670 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-22 10:45
sgmllib has been deprecated since 2.6 and has been removed from py3k.
History
Date User Action Args
2010-08-22 10:45:52BreamoreBoysetstatus: open -> closed
resolution: out of date
messages: + msg114670

versions: + Python 3.2, - Python 2.7
2010-07-10 11:21:22BreamoreBoysetnosy: + BreamoreBoy

messages: + msg109853
versions: - Python 3.1
2009-04-22 12:45:50ajaksu2setkeywords: + easy
2009-03-21 02:02:53ajaksu2setstage: test needed
type: enhancement
versions: + Python 3.1, Python 2.7, - Python 2.4
2006-03-27 12:51:59nerbycreate