classification
Title: sgmllib doesn't support hex or Unicode character references
Type: feature request Stage: test needed
Components: Library (Lib) Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: aaronsw, fdrake, loewis (3)
Priority: low Keywords easy

Created on 2003-09-09 20:53 by aaronsw, last changed 2009-04-22 17:21 by ajaksu2.

Messages (5)
msg60380 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-09 20:53
sgmllib doesn't support the hexadecimal style of character nor 
Unicode characters, both of which are commonly seen on web pages. 
The following replacements fix both problems.

charref = re.compile('&#([0-9a-fA-F]+)[^0-9a-fA-F]')

	def handle_charref(self, ref):
		try:
			if ref[0] == 'x' or ref[0] == 'X': m = 
int(ref[1:], 16)
			else: m = int(ref)
			self.handle_data(unichr(m).encode('utf-8'))
		except ValueError:
			self.unknown_charref(ref)
msg60381 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-09 21:00
Logged In: YES 
user_id=122141

Oops, that should be: 

charref = re.compile('&#([0-9a-fA-FxX][0-9a-fA-F]*)[^0-9a-fA-F]')
msg60382 - (view) Author: Martin v. Löwis (loewis) Date: 2003-09-10 16:58
Logged In: YES 
user_id=21627

Are you sure hexadecimal character references are part of
the SGML standard?
msg60383 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-10 22:42
Logged In: YES 
user_id=122141

I don't have the money to shell out for the XML spec, but according to http://
developers.omnimark.com/documentation/concept/764.htm they were 
added in SGML TC 2.
msg63530 - (view) Author: Fred L. Drake, Jr. (fdrake) Date: 2008-03-14 16:30
SGML TC 2 can be found here:
http://www1.y12.doe.gov/capabilities/sgml/wg8/document/1955.htm

See the section K.4.1 for hexidecimal character references.

Since this is really an update to the SGML standard, and not part of the
original, any support for this should be an optional feature.  It's
really only interesting on the web, where standards compliance is... a
little on the lax side.  It would be reasonable to enable this by
default from htmllib (if not already supported in htmllib; I don't
remember).

I'm fairly sure hex character references are already supported in
HTMLParser.
History
Date User Action Args
2009-04-22 17:21:11ajaksu2setkeywords: + easy
2009-02-13 03:41:24ajaksu2setpriority: normal -> low
stage: test needed
type: feature request
versions: + Python 2.7, - Python 2.3
2008-03-14 16:30:02fdrakesetnosy: + fdrake
messages: + msg63530
2003-09-09 20:53:13aaronswcreate