Author v+python
Recipients amaury.forgeotdarc, barry, eric.araujo, erob, flox, ggenellina, gvanrossum, oopos, pebbe, pitrou, quentel, r.david.murray, tcourbon, tercero12, tobias, v+python
Date 2011-01-05.04:33:32
SpamBayes Score 7.25531e-14
Marked as misclassified No
Message-id <1294202014.38.0.0840529690103.issue4953@psf.upfronthosting.co.za>
In-reply-to
Content
R. David said:
>From looking over the cgi code it is not clear to me whether Pierre's approach is simpler or more complex than the alternative approach of starting with binary input and decoding as appropriate.  From a consistency perspective I would prefer the latter, but I don't know if I'll have time to try it out before rc1.

I say:
I agree with R. David that an approach using the binary input seems more appropriate, as the HTTP byte stream is defined as binary.  Do the 3.2 beta email docs now include documentation for the binary input interfaces required to code that solution?  Or could you provide appropriate guidance and review, should someone endeavor to attempt such a solution?

The remaining concerns below are only concerns; they may be totally irrelevant, and I'm too ignorant of how the code works to realize their irrelevance.  Hopefully someone that understands the code can comment and explain.

I believe that the proper solution is to make cgi work if sys.stdin has already been converted to be a binary stream, or if it hasn't, to dive down to the underlying binary stream, using detach().  Then the data should be processed as binary, and decoded once, when the proper decoding parameters are known.  The default encoding seems to be different on different platforms, but the binary stream is standardized.  It looks like new code was added to attempt to preprocess the MIME data into chunks to be fed to the email parser, and while I can believe code could be written to do such correctly (but I can't speak for whether this patch code is correct or not), it seems redundant/inefficient and error-prone to do it once outside the email parser, and again inside it.

I also doubt that self.fp.encoding is consistent from platform to platform).  But the HTTP bytestream is binary, and self-describing or declared by HTTP or HTML standards for the parts that are not self-describing.  The default platform encoding used for the preopened sys.stdin is not particularly relevant and may introduce mojibake type bugs, decoding errors in the presence of some inputs, and/or platform inconsistencies, and it seems that that is generally where self.fp.encoding, used in various places in this patch, comes from.

Regarding the binary vs. text issue; when using both binary and text interfaces on output streams, there is the need to do flushing between text and binary writes to preserve the proper sequencing of data in the output.  For input, is it possible that mixing text and binary input could result in the binary input missing data that has already been preloaded into the text buffer?  Although, for CGI programs, no one should have done any text inputs before calling the CGI functions, so perhaps this is also not a concern... and there probably isn't any buffering on socket streams (the usual CGI use case) but I see the use of both binary and text input functions in this patch, so this may be another issue that someone could explain why such a mix is or isn't a problem.
History
Date User Action Args
2011-01-05 04:33:34v+pythonsetrecipients: + v+python, gvanrossum, barry, amaury.forgeotdarc, ggenellina, pitrou, eric.araujo, r.david.murray, oopos, tercero12, tcourbon, tobias, flox, pebbe, quentel, erob
2011-01-05 04:33:34v+pythonsetmessageid: <1294202014.38.0.0840529690103.issue4953@psf.upfronthosting.co.za>
2011-01-05 04:33:32v+pythonlinkissue4953 messages
2011-01-05 04:33:32v+pythoncreate