Message 230117 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	wrohdewald
Recipients	r.david.murray, wrohdewald
Date	2014-10-28.04:21:13
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1414470074.83.0.0100417910401.issue22746@psf.upfronthosting.co.za>
In-reply-to

Content
If you cannot offer a solution for arbitrary unicode, you have no solution at all. Afer all, that is what unicode is about: support ALL languages, not only your own. I do not quite understand why you think this is not a bug. If cgitb encodes unicode like & x e 4 ; (remove spaces), the browser does not have to guess the encoding, it will always show the correct character. This works for all of unicode. See https://en.wikipedia.org/wiki/Unicode_and_HTML#Numeric_character_references So this bug is fixable, I am reopening it. For Python3, the fix is actually very simple: Do not write doc but str(doc.encode('ascii', 'xmlcharrefreplace')), like in the attached patch. This patch works for me but there might be yet uncovered code paths. And my source file is encoded in utf-8, other source file encodings should be tested too. I do not know if cgitb correctly honors the source file header like # -- coding: utf-8 -- Fixing this for Python2 is certainly doable too but perhaps more difficult because a Python2 str() may have an unknown encoding.

If you cannot offer a solution for arbitrary unicode, you have no solution at all. Afer all, that is what unicode is about: support ALL languages, not only your own.

I do not quite understand why you think this is not a bug.

If cgitb encodes unicode like & x e 4 ; (remove spaces), the browser does not have to guess the encoding, it will always show the correct character. This works for all of unicode. See https://en.wikipedia.org/wiki/Unicode_and_HTML#Numeric_character_references

So this bug is fixable, I am reopening it.

For Python3, the fix is actually very simple: Do not write doc but str(doc.encode('ascii', 'xmlcharrefreplace')), like in the attached patch. This patch works for me but there might be yet uncovered code paths. And my source file is encoded in utf-8, other source file encodings should be tested too. I do not know if cgitb correctly honors the source file header like # -*- coding: utf-8 -*-

Fixing this for Python2 is certainly doable too but perhaps more difficult because a Python2 str() may have an unknown encoding.

History
Date	User	Action	Args
2014-10-28 04:21:14	wrohdewald	set	recipients: + wrohdewald, r.david.murray
2014-10-28 04:21:14	wrohdewald	set	messageid: <1414470074.83.0.0100417910401.issue22746@psf.upfronthosting.co.za>
2014-10-28 04:21:14	wrohdewald	link	issue22746 messages
2014-10-28 04:21:14	wrohdewald	create