Message 126065 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	v+python
Recipients	amaury.forgeotdarc, barry, eric.araujo, erob, flox, ggenellina, oopos, pebbe, pitrou, quentel, r.david.murray, tcourbon, tobias, v+python, vstinner
Date	2011-01-12.02:07:07
SpamBayes Score	4.6051474e-07
Marked as misclassified	No
Message-id	<1294798034.23.0.645802705375.issue4953@psf.upfronthosting.co.za>
In-reply-to

Content
Aha! Found a page <http://htmlpurifier.org/docs/enduser-utf8.html#whyutf8-support> which links to another page <http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html> that explains the behavior. The synopsis is that browsers (all modern browsers) return form data Form data is generally returned in the same character encoding as the Form page itself was sent to the client. I suspect this explains the differences between what Pierre and I are reporting. I suspect (but would appreciate confirmation from Pierre), that his web pages use <meta http-equiv="Content-Type" content="text/html; charset=CP-1252" /> or else do not use such a meta tag, and his server is configured (or defaults) to send HTTP headers: Content-Type: text/html; charset=CP-1252 Whereas, I do know that all my web pages are coded in UTF-8, have no meta tags, and my CGI scripts are sending Content-Type: text/html; charset=UTF-8 for all served form pages... and thus getting back UTF-8 also, per the above explanation. What does this mean for Python support for http.server and cgi? Well, http.server, by default, sends Content-Type without charset, except for directory listings, where it supplies charset= the result of sys.getfilesystemcoding(). So it is up to META tags to define the coding, or for the browser to guess. That's probably OK: for a single machine environment, it is likely that the data files are coded in the default file system encoding, and it is likely the browser will guess that. But it quickly breaks when going to a multiple machine or internet environment with different default encodings on different machines. So if using http.server in such an environment, it is necessary to inform the client of the page encoding using META tags, or generating the Content-Type: HTTP header in the CGI script (which latter is what I'm doing for the forms and data of interest). What does it mean for cgi.py's FieldStorage? Well, use of the default encoding can work in the single machine environment... so I guess there are would be worse things that doing so, as Pierre has been doing. But clearly, that isn't the complete solution. The new parameter he proposes to FieldStorage can be used, if the application can properly determine the likeliest encoding for the form data, before calling it. On a single machine system, that could be the default, as mentioned above. On a single application web server, it could be some constant encoding used for all pages (like I use UTF-8 for all my pages). For a multiple application web server, as long as each application uses a consistent encoding, that application could properly guess the encoding to pass to FieldStorage. Or, if the application wishes to allow multiple encodings, as long as it can keep track of them, and use the right ones at the right time, it is welcome to. How does this affect email? Not at all, directly. How does this affect cgi.py's use of email? It means that cgi.py cannot use BytesFeedParser, in spite of what the standards say, so Pierre's approach of predecoding the headers is the correct one, since email doesn't offer an encoding parameter. Since email doesn't support disk storage for file uploads, but buffers everything in memory, it means that cgi.py can only pass headers to FeedParser, so has to detect end-of-headers itself, since email provides no feedback to indicate that end-of-headers was reached, and that means that cgi.py must parse the MIME parts itself, so it can put the large parts on disk. It means that the email package provides extremely little value to cgi.py, and since web browsers and multipart/form-data use simple subsets of the full power of RFC822 headers, email could be replaced with the use of its existing parse_header function, but that should be deprecated. A copy could be moved inside FieldStorage class and fixed a bit.

Aha!

Found a page <http://htmlpurifier.org/docs/enduser-utf8.html#whyutf8-support> which links to another page <http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html> that explains the behavior.

The synopsis is that browsers (all modern browsers) return form data
Form data is generally returned in the same character encoding as the Form page itself was sent to the client.

I suspect this explains the differences between what Pierre and I are reporting.  I suspect (but would appreciate confirmation from Pierre), that his web pages use 
<meta http-equiv="Content-Type" content="text/html; charset=CP-1252" />
or else do not use such a meta tag, and his server is configured (or defaults) to send HTTP headers:
Content-Type: text/html; charset=CP-1252

Whereas, I do know that all my web pages are coded in UTF-8, have no meta tags, and my CGI scripts are sending 
Content-Type: text/html; charset=UTF-8
for all served form pages... and thus getting back UTF-8 also, per the above explanation.

What does this mean for Python support for http.server and cgi?
Well, http.server, by default, sends Content-Type without charset, except for directory listings, where it supplies charset= the result of sys.getfilesystemcoding().  So it is up to META tags to define the coding, or for the browser to guess.  That's probably OK: for a single machine environment, it is likely that the data files are coded in the default file system encoding, and it is likely the browser will guess that.  But it quickly breaks when going to a multiple machine or internet environment with different default encodings on different machines.  So if using http.server in such an environment, it is necessary to inform the client of the page encoding using META tags, or generating the Content-Type: HTTP header in the CGI script (which latter is what I'm doing for the forms and data of interest).

What does it mean for cgi.py's FieldStorage?

Well, use of the default encoding can work in the single machine environment... so I guess there are would be worse things that doing so, as Pierre has been doing.  But clearly, that isn't the complete solution.  The new parameter he proposes to FieldStorage can be used, if the application can properly determine the likeliest encoding for the form data, before calling it.

On a single machine system, that could be the default, as mentioned above.  On a single application web server, it could be some constant encoding used for all pages (like I use UTF-8 for all my pages).  For a multiple application web server, as long as each application uses a consistent encoding, that application could properly guess the encoding to pass to FieldStorage.  Or, if the application wishes to allow multiple encodings, as long as it can keep track of them, and use the right ones at the right time, it is welcome to.

How does this affect email?  Not at all, directly.

How does this affect cgi.py's use of email?
It means that cgi.py cannot use BytesFeedParser, in spite of what the standards say, so Pierre's approach of predecoding the headers is the correct one, since email doesn't offer an encoding parameter.  Since email doesn't support disk storage for file uploads, but buffers everything in memory, it means that cgi.py can only pass headers to FeedParser, so has to detect end-of-headers itself, since email provides no feedback to indicate that end-of-headers was reached, and that means that cgi.py must parse the MIME parts itself, so it can put the large parts on disk. It means that the email package provides extremely little value to cgi.py, and since web browsers and multipart/form-data use simple subsets of the full power of RFC822 headers, email could be replaced with the use of its existing parse_header function, but that should be deprecated.  A copy could be moved inside FieldStorage class and fixed a bit.

History
Date	User	Action	Args
2011-01-12 02:07:14	v+python	set	recipients: + v+python, barry, amaury.forgeotdarc, ggenellina, pitrou, vstinner, eric.araujo, r.david.murray, oopos, tcourbon, tobias, flox, pebbe, quentel, erob
2011-01-12 02:07:14	v+python	set	messageid: <1294798034.23.0.645802705375.issue4953@psf.upfronthosting.co.za>
2011-01-12 02:07:07	v+python	link	issue4953 messages
2011-01-12 02:07:07	v+python	create