Author quentel
Recipients amaury.forgeotdarc, barry, eric.araujo, erob, flox, ggenellina, oopos, pebbe, pitrou, quentel, r.david.murray, tcourbon, tobias, v+python, vstinner
Date 2011-01-12.21:15:46
SpamBayes Score 0.0
Marked as misclassified No
Message-id <1294866962.11.0.803094391738.issue4953@psf.upfronthosting.co.za>
In-reply-to
Content
Many thoughts and tests after...

Glenn, the both of us were wrong : the encoding to use in FieldStorage is neither latin-1, nor sys.stdin.encoding : I tested form fields with characters whose utf-8 encoding has bytes that map to undefined in cp1252, the calls to the decode() method with sys.stdin.encoding failed

The encoding used by the browser is defined in the Content-Type meta tag, or the content-type header ; if not, the default seems to vary for different browsers. So it's definitely better to define it

The argument stream_encoding used in FieldStorage *must* be this encoding ; in this version, it is set to utf-8 by default

But this raises another problem, when the CGI script has to print the data received. The built-in print() function encodes the string with sys.stdout.encoding, and this will fail if the string can't be encoded with it. It is the case on my PC, where sys.stdout.encoding is cp1252 : it can't handle Arabic or Chinese characters

The solution I have tried is to pass another argument, charset, to the FieldStorage contructor, defaulting to utf-8. It must be the same as the charset defined in the CGI script in the Content-Type header

FieldStorage uses this argument to override the built-in print() function :
- flush the text layer of sys.stdin, in case calls to print() have been made before calling FieldStorage
- get the binary layer of stdout : out = sys.stdout.detach()
- define a function _print this way:
	def _print(*strings):
		for item in strings:
			out.write(str(item).encode(charset))
		out.write(b'\r\n')
- override print() :
    import builtins
    builtins.print = _print

The function print() in the CGI script now sends the strings encoded with "charset" to the binary layer of sys.stdout. All the tests I made with Arabic or Chinese input fileds, or file names, succed when using this patch ; so do test_cgi and cgi_test (slightly modified)
History
Date User Action Args
2011-01-12 21:16:02quentelsetrecipients: + quentel, barry, amaury.forgeotdarc, ggenellina, pitrou, vstinner, eric.araujo, v+python, r.david.murray, oopos, tcourbon, tobias, flox, pebbe, erob
2011-01-12 21:16:02quentelsetmessageid: <1294866962.11.0.803094391738.issue4953@psf.upfronthosting.co.za>
2011-01-12 21:15:46quentellinkissue4953 messages
2011-01-12 21:15:46quentelcreate