Issue 22746: cgitb html: wrong encoding for utf-8

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/66935

classification

Title:	cgitb html: wrong encoding for utf-8
Type:	behavior	Stage:	needs patch
Components:	Library (Lib), Unicode	Versions:	Python 3.4, Python 3.5, Python 2.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, ezio.melotti, r.david.murray, serhiy.storchaka, vstinner, wrohdewald
Priority:	normal	Keywords:	patch

Created on 2014-10-27 18:48 by wrohdewald, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
cgibug.py	wrohdewald, 2014-10-27 18:48
22746.patch	wrohdewald, 2014-10-28 04:21

Messages (11)
msg230085 - (view)	Author: Wolfgang Rohdewald (wrohdewald)	Date: 2014-10-27 18:48
The attached script shows the non-ascii characters wrong wherever they occur, including the exception message and the comment in the source code. Looking at the produced .html, I can say that cgitb simply passes the single byte utf-8 codes without encoding them as needed. Same happens with Python3.4 (after applying some quick and dirty changes to cgitb.py, see bug #22745).
msg230099 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-10-27 19:54
If you look at the file, you'll find that the data is in utf-8 (at least if your locale is a utf-8 locale). However, html is by default interpreted as latin-1, so that's what the webrowser displays when you pass the file on disk to it. If you add "encoding='latin-1'" to your open call, your script will work. What you do if you need to display non-latin1 characters, I don't know. (See https://bugzil.la/760050, for example). Note: the above is for python3. I don't remember how you do the equivalent in python2...a naive codecs.open call just got me a UnicodeDecodeError.
msg230117 - (view)	Author: Wolfgang Rohdewald (wrohdewald)	Date: 2014-10-28 04:21
If you cannot offer a solution for arbitrary unicode, you have no solution at all. Afer all, that is what unicode is about: support ALL languages, not only your own. I do not quite understand why you think this is not a bug. If cgitb encodes unicode like & x e 4 ; (remove spaces), the browser does not have to guess the encoding, it will always show the correct character. This works for all of unicode. See https://en.wikipedia.org/wiki/Unicode_and_HTML#Numeric_character_references So this bug is fixable, I am reopening it. For Python3, the fix is actually very simple: Do not write doc but str(doc.encode('ascii', 'xmlcharrefreplace')), like in the attached patch. This patch works for me but there might be yet uncovered code paths. And my source file is encoded in utf-8, other source file encodings should be tested too. I do not know if cgitb correctly honors the source file header like # -- coding: utf-8 -- Fixing this for Python2 is certainly doable too but perhaps more difficult because a Python2 str() may have an unknown encoding.
msg230131 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2014-10-28 09:12
What about open(..., encoding='latin-1', errors='xmlcharrefreplace')
msg230133 - (view)	Author: Wolfgang Rohdewald (wrohdewald)	Date: 2014-10-28 09:32
> What about > open(..., encoding='latin-1', errors='xmlcharrefreplace') That works fine. I tested with a chinese character 与 But I do not think the application should work around something that cgitb is supposed to handle. More so since the documentation is dead silent about this. You need to use codecs.open instead of open and add those kw arguments. As long as this is not explained in the documentation, I guess it is a bug for everyone not using latin-1.
msg230134 - (view)	Author: Wolfgang Rohdewald (wrohdewald)	Date: 2014-10-28 09:37
correction: A bug for everyone using non-ascii characters.
msg230148 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2014-10-28 13:19
> You need to use codecs.open instead of open No, why? in python3 open() supports the errors handler.
msg230149 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-10-28 13:43
In normal HTML utf-8 works fine, doesn't it?. It's only when reading from a file (where the browser doesn't know the encoding) that it fails. Do you have a use case for xmlcharrefreplace in the HTML context (which is what cgitb is primarily targeted at). Some place where the web page can't be declared as utf-8, perhaps? I suppose it might be a not-unreasonable enhancement request to have a parameter to Hook that says "do xmlcharrefreplace", but since the workaround is actually simpler than that, I don't know if that is worthwhile or not. Or do people feel like doing the replacement all the time (it's only in tracebacks, after all) be the right thing to do?
msg230159 - (view)	Author: Wolfgang Rohdewald (wrohdewald)	Date: 2014-10-28 16:01
> > You need to use codecs.open instead of open > No, why? in python3 open() supports the errors handler. right, but not in python2 which has the same problem. I need my code to run with both. > Do you have a use case for xmlcharrefreplace in the HTML context? No, my only use case is the local file.
msg230361 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2014-10-31 17:49
> In normal HTML utf-8 works fine, doesn't it? It does, in fact as long as the encoding used by the browser matches the one used in the file, no charrefs needs to be used (except > < and "). Of course, if non-Unicode encodings are used, the range of available characters that can go directly in the HTML will be more limited, but this can be solved by using charrefs -- the browser will display the corresponding character no matter what is the encoding. This also means that if charrefs are used for all non-ASCII characters, then the browser will be able to display the page no matter what encoding is being used (as long as it's ASCII-compatible, and most encoding are). The downside is that it will make the source less readable and possible longer, especially if there are lot of non-ASCII characters, but if most of the characters are expected to be ASCII, using charrefs might be ok.
msg232073 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-12-03 07:50
We can convert cgitb.hook to produce ASCII-compatible output with charrefs in 3.x. But there is a problem with str in 2.7. 8-bit string can contain non-ASCII data and the encoding is not known in general case.

History
Date	User	Action	Args
2022-04-11 14:58:09	admin	set	github: 66935
2014-12-03 07:50:08	serhiy.storchaka	set	messages: + msg232073
2014-10-31 17:49:33	ezio.melotti	set	messages: + msg230361
2014-10-28 16:01:17	wrohdewald	set	messages: + msg230159
2014-10-28 13:43:52	r.david.murray	set	resolution: remind -> messages: + msg230149 versions: + Python 3.4, Python 3.5
2014-10-28 13:32:09	vstinner	set	nosy: + vstinner components: + Unicode
2014-10-28 13:19:27	amaury.forgeotdarc	set	messages: + msg230148
2014-10-28 12:37:33	serhiy.storchaka	set	nosy: + ezio.melotti, serhiy.storchaka
2014-10-28 09:37:20	wrohdewald	set	messages: + msg230134
2014-10-28 09:32:36	wrohdewald	set	messages: + msg230133
2014-10-28 09:12:11	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg230131 stage: resolved -> needs patch
2014-10-28 04:24:31	wrohdewald	set	resolution: remind
2014-10-28 04:21:14	wrohdewald	set	status: closed -> open files: + 22746.patch messages: + msg230117 keywords: + patch resolution: not a bug -> (no value)
2014-10-27 19:54:32	r.david.murray	set	status: open -> closed nosy: + r.david.murray messages: + msg230099 resolution: not a bug stage: resolved
2014-10-27 18:48:57	wrohdewald	create