Message 95689 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ods
Recipients	effbot, ods, strangefeatures
Date	2009-11-24.17:26:33
SpamBayes Score	0.08601087
Marked as misclassified	No
Message-id	<1259083595.27.0.0857744224848.issue5166@psf.upfronthosting.co.za>
In-reply-to

Content
Here is a regexp I use to clean up text (note, that I don't touch "compatibility characters" that are also not recommended in XML; some other developers remove them too): # http://www.w3.org/TR/REC-xml/#NT-Char # Char ::= #x9 \| #xA \| #xD \| [#x20-#xD7FF] \| [#xE000-#xFFFD] \| # [#x10000- #x10FFFF] # (any Unicode character, excluding the surrogate blocks, FFFE, and FFFF) _char_tail = '' if sys.maxunicode > 0x10000: _char_tail = u'%s-%s' % (unichr(0x10000), unichr(min(sys.maxunicode, 0x10FFFF))) _nontext_sub = re.compile( ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' % _char_tail, re.U).sub def replace_nontext(text, replacement=u'\uFFFD'): return _nontext_sub(replacement, text)

Here is a regexp I use to clean up text (note, that I don't touch 
"compatibility characters" that are also not recommended in XML; some 
other developers remove them too):

# http://www.w3.org/TR/REC-xml/#NT-Char
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
#          [#x10000- #x10FFFF]
# (any Unicode character, excluding the surrogate blocks, FFFE, and 
FFFF)
_char_tail = ''
if sys.maxunicode > 0x10000:
    _char_tail = u'%s-%s' % (unichr(0x10000),
                             unichr(min(sys.maxunicode, 0x10FFFF)))
_nontext_sub = re.compile(
                ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' % 
_char_tail,
                re.U).sub
def replace_nontext(text, replacement=u'\uFFFD'):
    return _nontext_sub(replacement, text)

History
Date	User	Action	Args
2009-11-24 17:26:35	ods	set	recipients: + ods, effbot, strangefeatures
2009-11-24 17:26:35	ods	set	messageid: <1259083595.27.0.0857744224848.issue5166@psf.upfronthosting.co.za>
2009-11-24 17:26:33	ods	link	issue5166 messages
2009-11-24 17:26:33	ods	create