classification
Title: cgitb html: wrong encoding for utf-8
Type: behavior Stage: needs patch
Components: Library (Lib), Unicode Versions: Python 3.4, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, ezio.melotti, r.david.murray, serhiy.storchaka, vstinner, wrohdewald
Priority: normal Keywords: patch

Created on 2014-10-27 18:48 by wrohdewald, last changed 2014-12-03 07:50 by serhiy.storchaka.

Files
File name Uploaded Description Edit
cgibug.py wrohdewald, 2014-10-27 18:48
22746.patch wrohdewald, 2014-10-28 04:21
Messages (11)
msg230085 - (view) Author: Wolfgang Rohdewald (wrohdewald) Date: 2014-10-27 18:48
The attached script shows the non-ascii characters wrong wherever they occur, including the exception message and the comment in the source code.

Looking at the produced .html, I can say that cgitb simply passes the single byte utf-8 codes without encoding them as needed.

Same happens with Python3.4 (after applying some quick and dirty changes to cgitb.py, see bug #22745).
msg230099 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-10-27 19:54
If you look at the file, you'll find that the data is in utf-8 (at least if your locale is a utf-8 locale).  However, html is by default interpreted as latin-1, so that's what the webrowser displays when you pass the file on disk to it.  If you add "encoding='latin-1'" to your open call, your script will work.  What you do if you need to display non-latin1 characters, I don't know.  (See https://bugzil.la/760050, for example).

Note: the above is for python3.  I don't remember how you do the equivalent in python2...a naive codecs.open call just got me a UnicodeDecodeError.
msg230117 - (view) Author: Wolfgang Rohdewald (wrohdewald) Date: 2014-10-28 04:21
If you cannot offer a solution for arbitrary unicode, you have no solution at all. Afer all, that is what unicode is about: support ALL languages, not only your own.

I do not quite understand why you think this is not a bug.

If cgitb encodes unicode like & x e 4 ; (remove spaces), the browser does not have to guess the encoding, it will always show the correct character. This works for all of unicode. See https://en.wikipedia.org/wiki/Unicode_and_HTML#Numeric_character_references

So this bug is fixable, I am reopening it.

For Python3, the fix is actually very simple: Do not write doc but str(doc.encode('ascii', 'xmlcharrefreplace')), like in the attached patch. This patch works for me but there might be yet uncovered code paths. And my source file is encoded in utf-8, other source file encodings should be tested too. I do not know if cgitb correctly honors the source file header like # -*- coding: utf-8 -*-

Fixing this for Python2 is certainly doable too but perhaps more difficult because a Python2 str() may have an unknown encoding.
msg230131 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2014-10-28 09:12
What about
  open(..., encoding='latin-1', errors='xmlcharrefreplace')
msg230133 - (view) Author: Wolfgang Rohdewald (wrohdewald) Date: 2014-10-28 09:32
> What about
>  open(..., encoding='latin-1', errors='xmlcharrefreplace')

That works fine. I tested with a chinese character 与

But I do not think the application should work around something that cgitb is supposed to handle. More so since the documentation is dead silent about this. You need to use codecs.open instead of open and add those kw arguments. As long as this is not explained in the documentation, I guess it is a bug for everyone not using latin-1.
msg230134 - (view) Author: Wolfgang Rohdewald (wrohdewald) Date: 2014-10-28 09:37
correction: A bug for everyone using non-ascii characters.
msg230148 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2014-10-28 13:19
> You need to use codecs.open instead of open
No, why? in python3 open() supports the errors handler.
msg230149 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-10-28 13:43
In normal HTML utf-8 works fine, doesn't it?. It's only when reading from a file (where the browser doesn't know the encoding) that it fails.  Do you have a use case for xmlcharrefreplace in the HTML context (which is what cgitb is primarily targeted at).  Some place where the web page can't be declared as utf-8, perhaps?

I suppose it might be a not-unreasonable enhancement request to have a parameter to Hook that says "do xmlcharrefreplace", but since the workaround is actually simpler than that, I don't know if that is worthwhile or not.  Or do people feel like doing the replacement all the time (it's only in tracebacks, after all) be the right thing to do?
msg230159 - (view) Author: Wolfgang Rohdewald (wrohdewald) Date: 2014-10-28 16:01
> > You need to use codecs.open instead of open
> No, why? in python3 open() supports the errors handler.

right, but not in python2 which has the same problem. I need my code to run with both.

> Do you have a use case for xmlcharrefreplace in the HTML context?

No, my only use case is the local file.
msg230361 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-10-31 17:49
> In normal HTML utf-8 works fine, doesn't it?

It does, in fact as long as the encoding used by the browser matches the one used in the file, no charrefs needs to be used (except > < and ").  Of course, if non-Unicode encodings are used, the range of available characters that can go directly in the HTML will be more limited, but this can be solved by using charrefs -- the browser will display the corresponding character no matter what is the encoding.  This also means that if charrefs are used for all non-ASCII characters, then the browser will be able to display the page no matter what encoding is being used (as long as it's ASCII-compatible, and most encoding are).  The downside is that it will make the source less readable and possible longer, especially if there are lot of non-ASCII characters, but if most of the characters are expected to be ASCII, using charrefs might be ok.
msg232073 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-12-03 07:50
We can convert cgitb.hook to produce ASCII-compatible output with charrefs in 3.x. But there is a problem with str in 2.7. 8-bit string can contain non-ASCII data and the encoding is not known in general case.
History
Date User Action Args
2014-12-03 07:50:08serhiy.storchakasetmessages: + msg232073
2014-10-31 17:49:33ezio.melottisetmessages: + msg230361
2014-10-28 16:01:17wrohdewaldsetmessages: + msg230159
2014-10-28 13:43:52r.david.murraysetresolution: remind ->
messages: + msg230149
versions: + Python 3.4, Python 3.5
2014-10-28 13:32:09vstinnersetnosy: + vstinner
components: + Unicode
2014-10-28 13:19:27amaury.forgeotdarcsetmessages: + msg230148
2014-10-28 12:37:33serhiy.storchakasetnosy: + ezio.melotti, serhiy.storchaka
2014-10-28 09:37:20wrohdewaldsetmessages: + msg230134
2014-10-28 09:32:36wrohdewaldsetmessages: + msg230133
2014-10-28 09:12:11amaury.forgeotdarcsetnosy: + amaury.forgeotdarc

messages: + msg230131
stage: resolved -> needs patch
2014-10-28 04:24:31wrohdewaldsetresolution: remind
2014-10-28 04:21:14wrohdewaldsetstatus: closed -> open
files: + 22746.patch
messages: + msg230117

keywords: + patch
resolution: not a bug -> (no value)
2014-10-27 19:54:32r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg230099

resolution: not a bug
stage: resolved
2014-10-27 18:48:57wrohdewaldcreate