This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author wrohdewald
Recipients r.david.murray, wrohdewald
Date 2014-10-28.04:21:13
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1414470074.83.0.0100417910401.issue22746@psf.upfronthosting.co.za>
In-reply-to
Content
If you cannot offer a solution for arbitrary unicode, you have no solution at all. Afer all, that is what unicode is about: support ALL languages, not only your own.

I do not quite understand why you think this is not a bug.

If cgitb encodes unicode like & x e 4 ; (remove spaces), the browser does not have to guess the encoding, it will always show the correct character. This works for all of unicode. See https://en.wikipedia.org/wiki/Unicode_and_HTML#Numeric_character_references

So this bug is fixable, I am reopening it.

For Python3, the fix is actually very simple: Do not write doc but str(doc.encode('ascii', 'xmlcharrefreplace')), like in the attached patch. This patch works for me but there might be yet uncovered code paths. And my source file is encoded in utf-8, other source file encodings should be tested too. I do not know if cgitb correctly honors the source file header like # -*- coding: utf-8 -*-

Fixing this for Python2 is certainly doable too but perhaps more difficult because a Python2 str() may have an unknown encoding.
History
Date User Action Args
2014-10-28 04:21:14wrohdewaldsetrecipients: + wrohdewald, r.david.murray
2014-10-28 04:21:14wrohdewaldsetmessageid: <1414470074.83.0.0100417910401.issue22746@psf.upfronthosting.co.za>
2014-10-28 04:21:14wrohdewaldlinkissue22746 messages
2014-10-28 04:21:14wrohdewaldcreate