Issue803422
Created on 2003-09-09 20:53 by aaronsw, last changed 2009-04-22 17:21 by ajaksu2.
|
msg60380 - (view) |
Author: Aaron Swartz (aaronsw) |
Date: 2003-09-09 20:53 |
|
sgmllib doesn't support the hexadecimal style of character nor
Unicode characters, both of which are commonly seen on web pages.
The following replacements fix both problems.
charref = re.compile('&#([0-9a-fA-F]+)[^0-9a-fA-F]')
def handle_charref(self, ref):
try:
if ref[0] == 'x' or ref[0] == 'X': m =
int(ref[1:], 16)
else: m = int(ref)
self.handle_data(unichr(m).encode('utf-8'))
except ValueError:
self.unknown_charref(ref)
|
|
msg60381 - (view) |
Author: Aaron Swartz (aaronsw) |
Date: 2003-09-09 21:00 |
|
Logged In: YES
user_id=122141
Oops, that should be:
charref = re.compile('&#([0-9a-fA-FxX][0-9a-fA-F]*)[^0-9a-fA-F]')
|
|
msg60382 - (view) |
Author: Martin v. Löwis (loewis) |
Date: 2003-09-10 16:58 |
|
Logged In: YES
user_id=21627
Are you sure hexadecimal character references are part of
the SGML standard?
|
|
msg60383 - (view) |
Author: Aaron Swartz (aaronsw) |
Date: 2003-09-10 22:42 |
|
Logged In: YES
user_id=122141
I don't have the money to shell out for the XML spec, but according to http://
developers.omnimark.com/documentation/concept/764.htm they were
added in SGML TC 2.
|
|
msg63530 - (view) |
Author: Fred L. Drake, Jr. (fdrake) |
Date: 2008-03-14 16:30 |
|
SGML TC 2 can be found here:
http://www1.y12.doe.gov/capabilities/sgml/wg8/document/1955.htm
See the section K.4.1 for hexidecimal character references.
Since this is really an update to the SGML standard, and not part of the
original, any support for this should be an optional feature. It's
really only interesting on the web, where standards compliance is... a
little on the lax side. It would be reasonable to enable this by
default from htmllib (if not already supported in htmllib; I don't
remember).
I'm fairly sure hex character references are already supported in
HTMLParser.
|
|
| Date |
User |
Action |
Args |
| 2009-04-22 17:21:11 | ajaksu2 | set | keywords:
+ easy |
| 2009-02-13 03:41:24 | ajaksu2 | set | priority: normal -> low stage: test needed type: feature request versions:
+ Python 2.7, - Python 2.3 |
| 2008-03-14 16:30:02 | fdrake | set | nosy:
+ fdrake messages:
+ msg63530 |
| 2003-09-09 20:53:13 | aaronsw | create | |
|