Message95689
Here is a regexp I use to clean up text (note, that I don't touch
"compatibility characters" that are also not recommended in XML; some
other developers remove them too):
# http://www.w3.org/TR/REC-xml/#NT-Char
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
# [#x10000- #x10FFFF]
# (any Unicode character, excluding the surrogate blocks, FFFE, and
FFFF)
_char_tail = ''
if sys.maxunicode > 0x10000:
_char_tail = u'%s-%s' % (unichr(0x10000),
unichr(min(sys.maxunicode, 0x10FFFF)))
_nontext_sub = re.compile(
ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' %
_char_tail,
re.U).sub
def replace_nontext(text, replacement=u'\uFFFD'):
return _nontext_sub(replacement, text) |
|
Date |
User |
Action |
Args |
2009-11-24 17:26:35 | ods | set | recipients:
+ ods, effbot, strangefeatures |
2009-11-24 17:26:35 | ods | set | messageid: <1259083595.27.0.0857744224848.issue5166@psf.upfronthosting.co.za> |
2009-11-24 17:26:33 | ods | link | issue5166 messages |
2009-11-24 17:26:33 | ods | create | |
|