This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: BaseHTTPServer cannot accept Unicode data
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, isonno
Priority: normal Keywords:

Created on 2007-11-08 23:13 by isonno, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
TestUnicodeHTTP.py isonno, 2007-11-08 23:16
Messages (6)
msg57282 - (view) Author: J. Peterson (isonno) Date: 2007-11-08 23:16
Within a do_GET hander for a BaseHTTPServer.BaseHTTPRequestHandler,
trying to write unicode data causes a UnicodeEncodeError exception.  It
should be possible to send Unicode data to the browser.

The enclosed Python file demonstrates the issue.
msg57294 - (view) Author: J. Peterson (isonno) Date: 2007-11-09 02:51
The diagnostic printed is:
  File "C:\Apps\Python25\lib\socket.py", line 255, in write
    data = str(data) # XXX Should really reject non-string non-buffers

The comment indicates the developer was aware of the bug.  See also
similar bug in writelines(), near line 267.
msg57312 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-11-09 15:57
Due to its nature it is impossible to transmit unicode over the wire.
Unicode must always be encoded to bytes before it can be stored on the
hard disk or transmitted. Typically it's UTF-8 but in your case it
depends on the client's browser and the Request header.

The simple BaseHTTPServer isn't clever enough to encode your unicode
data on the fly. You have to do it yourself.
msg57330 - (view) Author: J. Peterson (isonno) Date: 2007-11-09 21:03
As implemented it's not even possible to send UTF-8, because the "data =
str(data)" line only accepts seven bit ASCII with the default encoding.
 Since there's no easy way to change the encoding "str()" uses, some
other mechanism should be available to do the encoding (as implied by
the "XXX" comment).
msg57331 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-11-09 21:08
Yes, it's possible to send UTF-8 data:

>>> data = u"testdata umlaut öäü".encode("utf-8")
>>> data
'testdata umlaut \xc3\xb6\xc3\xa4\xc3\xbc'
>>> type(data)
<type 'str'>
>>> data == str(data)
True
>>> data is str(data)
True

You have to encode your unicode string to a byte string.
u''.encode(encoding) always returns a string. str() on a string doesn't
alter a string. As you can clearly see it's a NOOP (no operation).
msg57332 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-11-09 21:11
PS: http://www.joelonsoftware.com/articles/Unicode.html is a nice
article about unicode and character sets. Joel is amazing when it comes
to explaining complex problems in simple words.
History
Date User Action Args
2022-04-11 14:56:28adminsetgithub: 45751
2007-11-09 21:11:21christian.heimessetmessages: + msg57332
2007-11-09 21:08:46christian.heimessetmessages: + msg57331
2007-11-09 21:03:23isonnosetmessages: + msg57330
2007-11-09 19:34:30gvanrossumsetstatus: open -> closed
resolution: not a bug
2007-11-09 15:57:34christian.heimessetnosy: + christian.heimes
messages: + msg57312
2007-11-09 02:51:31isonnosetmessages: + msg57294
2007-11-08 23:16:21isonnosetfiles: + TestUnicodeHTTP.py
messages: + msg57282
2007-11-08 23:13:40isonnocreate