classification
Title: sgmllib doesn't support hex or Unicode character references
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: aaronsw, fdrake, loewis, meatballhat
Priority: low Keywords: easy

Created on 2003-09-09 20:53 by aaronsw, last changed 2010-08-02 01:13 by fdrake. This issue is now closed.

Messages (7)
msg60380 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-09 20:53
sgmllib doesn't support the hexadecimal style of character nor 
Unicode characters, both of which are commonly seen on web pages. 
The following replacements fix both problems.

charref = re.compile('&#([0-9a-fA-F]+)[^0-9a-fA-F]')

	def handle_charref(self, ref):
		try:
			if ref[0] == 'x' or ref[0] == 'X': m = 
int(ref[1:], 16)
			else: m = int(ref)
			self.handle_data(unichr(m).encode('utf-8'))
		except ValueError:
			self.unknown_charref(ref)
msg60381 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-09 21:00
Logged In: YES 
user_id=122141

Oops, that should be: 

charref = re.compile('&#([0-9a-fA-FxX][0-9a-fA-F]*)[^0-9a-fA-F]')
msg60382 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-09-10 16:58
Logged In: YES 
user_id=21627

Are you sure hexadecimal character references are part of
the SGML standard?
msg60383 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-10 22:42
Logged In: YES 
user_id=122141

I don't have the money to shell out for the XML spec, but according to http://
developers.omnimark.com/documentation/concept/764.htm they were 
added in SGML TC 2.
msg63530 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2008-03-14 16:30
SGML TC 2 can be found here:
http://www1.y12.doe.gov/capabilities/sgml/wg8/document/1955.htm

See the section K.4.1 for hexidecimal character references.

Since this is really an update to the SGML standard, and not part of the
original, any support for this should be an optional feature.  It's
really only interesting on the web, where standards compliance is... a
little on the lax side.  It would be reasonable to enable this by
default from htmllib (if not already supported in htmllib; I don't
remember).

I'm fairly sure hex character references are already supported in
HTMLParser.
msg112262 - (view) Author: Dan Buch (meatballhat) Date: 2010-08-01 03:45
gads ... didn't mean to submit a title change there

Since this is removed from Python 3, should the status be changed to Rejected?
msg112414 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2010-08-02 01:13
Rejected since this didn't make it into Python 2.7.
History
Date User Action Args
2010-08-02 01:13:52fdrakesetstatus: open -> closed
resolution: rejected
messages: + msg112414
2010-08-01 03:45:12meatballhatsetnosy: + meatballhat

messages: + msg112262
title: gmllib doesn't support hex or Unicode character references -> sgmllib doesn't support hex or Unicode character references
2010-08-01 03:43:28meatballhatsettitle: sgmllib doesn't support hex or Unicode character references -> gmllib doesn't support hex or Unicode character references
2009-04-22 17:21:11ajaksu2setkeywords: + easy
2009-02-13 03:41:24ajaksu2setpriority: normal -> low
stage: test needed
type: enhancement
versions: + Python 2.7, - Python 2.3
2008-03-14 16:30:02fdrakesetnosy: + fdrake
messages: + msg63530
2003-09-09 20:53:13aaronswcreate