This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ods
Recipients effbot, ods, strangefeatures
Date 2009-11-24.17:26:33
SpamBayes Score 0.08601087
Marked as misclassified No
Message-id <1259083595.27.0.0857744224848.issue5166@psf.upfronthosting.co.za>
In-reply-to
Content
Here is a regexp I use to clean up text (note, that I don't touch 
"compatibility characters" that are also not recommended in XML; some 
other developers remove them too):

# http://www.w3.org/TR/REC-xml/#NT-Char
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
#          [#x10000- #x10FFFF]
# (any Unicode character, excluding the surrogate blocks, FFFE, and 
FFFF)
_char_tail = ''
if sys.maxunicode > 0x10000:
    _char_tail = u'%s-%s' % (unichr(0x10000),
                             unichr(min(sys.maxunicode, 0x10FFFF)))
_nontext_sub = re.compile(
                ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' % 
_char_tail,
                re.U).sub
def replace_nontext(text, replacement=u'\uFFFD'):
    return _nontext_sub(replacement, text)
History
Date User Action Args
2009-11-24 17:26:35odssetrecipients: + ods, effbot, strangefeatures
2009-11-24 17:26:35odssetmessageid: <1259083595.27.0.0857744224848.issue5166@psf.upfronthosting.co.za>
2009-11-24 17:26:33odslinkissue5166 messages
2009-11-24 17:26:33odscreate