Message 97071 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	loewis, pitrou, r.david.murray
Date	2009-12-30.23:49:23
SpamBayes Score	5.5406613e-09
Marked as misclassified	No
Message-id	<1262216965.73.0.255021516016.issue7606@psf.upfronthosting.co.za>
In-reply-to

Content
David: I think it's a little bit more complicated. RFC 2616 says that the value of a header is TEXT, which is defined as The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047 So I think send_header should change in the following way: a) if isinstance(value, bytes): send value as-is b) if value can be encoded in latin-1: encode in latin-1, then send as-is c) otherwise: MIME-encode as UTF-8, using the following algorithm 1. count the number of non-ascii characters, by encoding with ascii, ignore, and comparing result lengths 2. if there are less than 10% non-ascii character, use the Q encoding 3. otherwise, use the B encoding The purpose of the algorithm in c) would be that text containing a few non-latin characters still comes out right even if the receiver fails to decode the header. The same change would also apply to the client-side of sending headers. On the receiving side, we should offer an option to decode headers (both for client and server); this should be an option because senders may not comply with RFC 2616. Reading should then proceed as follows: 1. check whether there are MIME markers in the text 2. if so, MIME-decode 3. if not, decode as latin-1

David: I think it's a little bit more complicated. RFC 2616 says that
the value of a header is *TEXT, which is defined as

   The TEXT rule is only used for descriptive field contents and values 
   that are not intended to be interpreted by the message parser. Words 
   of *TEXT MAY contain characters from character sets other than 
   ISO-8859-1 only when encoded according to the rules of RFC 2047

So I think send_header should change in the following way:

a) if isinstance(value, bytes): send value as-is
b) if value can be encoded in latin-1: encode in latin-1, then send as-is
c) otherwise: MIME-encode as UTF-8, using the following algorithm
   1. count the number of non-ascii characters, by encoding with
      ascii, ignore, and comparing result lengths
   2. if there are less than 10% non-ascii character, use the Q encoding
   3. otherwise, use the B encoding

The purpose of the algorithm in c) would be that text containing a few
non-latin characters still comes out right even if the receiver fails to
decode the header.

The same change would also apply to the client-side of sending headers.
On the receiving side, we should offer an option to decode headers (both
for client and server); this should be an option because senders may not
comply with RFC 2616. Reading should then proceed as follows:
1. check whether there are MIME markers in the text
2. if so, MIME-decode
3. if not, decode as latin-1

History
Date	User	Action	Args
2009-12-30 23:49:25	loewis	set	recipients: + loewis, pitrou, r.david.murray
2009-12-30 23:49:25	loewis	set	messageid: <1262216965.73.0.255021516016.issue7606@psf.upfronthosting.co.za>
2009-12-30 23:49:24	loewis	link	issue7606 messages
2009-12-30 23:49:23	loewis	create