Title: sgmllib doesn't support hex or Unicode character references
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 2.7
Status: closed Resolution: rejected
Assigned To: Nosy List: aaronsw, fdrake, loewis, meatballhat
Priority: low Keywords: easy

Created on 2003-09-09 20:53 by aaronsw, last changed 2010-08-02 01:13 by fdrake. This issue is now closed.

Messages (7)
msg60380 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-09 20:53
sgmllib doesn't support the hexadecimal style of character nor 
Unicode characters, both of which are commonly seen on web pages. 
The following replacements fix both problems.

charref = re.compile('&#([0-9a-fA-F]+)[^0-9a-fA-F]')

	def handle_charref(self, ref):
			if ref[0] == 'x' or ref[0] == 'X': m = 
int(ref[1:], 16)
			else: m = int(ref)
		except ValueError:
msg60381 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-09 21:00
Oops, that should be: 

charref = re.compile('&#([0-9a-fA-FxX][0-9a-fA-F]*)[^0-9a-fA-F]')
msg60382 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-09-10 16:58
Are you sure hexadecimal character references are part of
the SGML standard?
msg60383 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-10 22:42
I don't have the money to shell out for the XML spec, but according to http:// they were 
added in SGML TC 2.
msg63530 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2008-03-14 16:30
SGML TC 2 can be found here:

See the section K.4.1 for hexidecimal character references.

Since this is really an update to the SGML standard, and not part of the
original, any support for this should be an optional feature.  It's
really only interesting on the web, where standards compliance is... a
little on the lax side.  It would be reasonable to enable this by
default from htmllib (if not already supported in htmllib; I don't

I'm fairly sure hex character references are already supported in
msg112262 - (view) Author: Dan Buch (meatballhat) Date: 2010-08-01 03:45
gads ... didn't mean to submit a title change there

Since this is removed from Python 3, should the status be changed to Rejected?
msg112414 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2010-08-02 01:13
Rejected since this didn't make it into Python 2.7.
