classification
Title: http.server Header Unicode Bug
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: georg.brandl Nosy List: aronacher, benjamin.peterson, eric.araujo, ezio.melotti, georg.brandl, haypo, orsenthil
Priority: normal Keywords: patch

Created on 2011-01-22 12:45 by aronacher, last changed 2012-09-25 12:45 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
http-server-unicode.patch aronacher, 2011-01-22 12:45 review
Messages (9)
msg126832 - (view) Author: Armin Ronacher (aronacher) * (Python committer) Date: 2011-01-22 12:45
I have a critical bugfix that should make it into Python 3.2 even when it's in release candidate state.  Currently http.server.BaseHTTPServer encodes headers with ASCII charset.  This is at least in violation with PEP 3333 which demands that latin1 is used.

Because HTTP itself suggests latin1 (iso-8859-1) I strongly recommend changing this in BaseHTTPServer and not wsgiref.

The attached patch fixes that in a backwards compatible fashion.
msg126834 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-22 13:04
Extract of PEP 3333: << Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding. >>

What is the best choice for portability (HTTP servers and web browsers): latin1 or MIME encoding? Latin1 is a small subset of Unicode: only U+0000..U+00FF.

We should maybe give the choice to the user between Latin1, MIME, or maybe something else (eg. UTF-8, cp1252, ...). Or at least, you should try something like:

try:
   bytes = text.encode('latin1')
except UnicodeEncodeError:
   bytes = encodeMIME(text, 'utf-8')

Would it be a good idea to accept raw bytes headers? HTTP is *supposed* to be correctly encoded using different RFC, but in practical, anyone is free to do whateven he wants.

Sentence extracted randomly from the WWW (dec. 2008): "it seems that neither Tomcat 5.5 or 6 properly decodes HTTP headers as per RFC 2047! The Tomcat code assumes everywhere that header values use ISO-8859-1."

Finally, why do you consider that this issue have to be fixed before Python 3.2?
msg126835 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-22 13:14
RFC 5987 (Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters), August 2010:
http://greenbytes.de/tech/webdav/rfc5987.html#language.specification.in.encoded.words

<< 3.3 Language Specification in Encoded Words

Section 5 of [RFC2231] extends the encoding defined in [RFC2047] to also support language specification in encoded words. Although the HTTP/1.1 specification does refer to RFC 2047 ([RFC2616], Section 2.2), it's not clear to which header field exactly it applies, and whether it is implemented in practice (see <http://tools.ietf.org/wg/httpbis/trac/ticket/111> for details).

Thus, this specification does not include this feature. >>

Hum ok, Latin1 looks safe and enough.
msg126836 - (view) Author: Armin Ronacher (aronacher) * (Python committer) Date: 2011-01-22 13:16
Georg Brandl signed off the commit and Python 3.2 will ship with the HTTP server accepting latin1 bytes.
msg126840 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011-01-22 14:50
Armin committed the patch in r88142 and followed up with r88143 for the http.client library.

Needs backporting?
msg126845 - (view) Author: √Čric Araujo (eric.araujo) * (Python committer) Date: 2011-01-22 17:35
I think so.
msg147895 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-18 16:47
Now it's too late for 3.1, should this still go to 2.7?
msg147896 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-11-18 16:51
Please.
msg147908 - (view) Author: Armin Ronacher (aronacher) * (Python committer) Date: 2011-11-18 17:19
2.7 does not suffer from this since 2.7 does not support unicode in headers.
History
Date User Action Args
2012-09-25 12:45:02pitrousetstatus: open -> closed
resolution: accepted -> fixed
stage: commit review -> resolved
2011-11-18 17:19:29aronachersetmessages: + msg147908
2011-11-18 16:51:57benjamin.petersonsetmessages: + msg147896
2011-11-18 16:47:48ezio.melottisetnosy: + ezio.melotti

messages: + msg147895
versions: - Python 3.1
2011-01-22 17:35:51eric.araujosetversions: + Python 3.1, Python 2.7, Python 3.2
nosy: + orsenthil, eric.araujo, benjamin.peterson

messages: + msg126845

resolution: accepted
stage: patch review -> commit review
2011-01-22 14:50:10georg.brandlsetmessages: + msg126840
2011-01-22 13:16:22aronachersetmessages: + msg126836
2011-01-22 13:14:01hayposetmessages: + msg126835
2011-01-22 13:04:28hayposetnosy: + haypo
messages: + msg126834
2011-01-22 12:45:50aronachercreate