Title: email.Header ignores maxlinelen when wrapping encoded words
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3, Python 2.7
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: dandre, r.david.murray
Priority: normal Keywords: easy

Created on 2011-07-28 13:15 by dandre, last changed 2012-03-26 23:26 by r.david.murray. This issue is now closed.

File name Uploaded Description Edit dandre, 2011-07-28 13:15
Messages (10)
msg141290 - (view) Author: (dandre) Date: 2011-07-28 13:15
Hello there, first of all, thank you all for Python. I didn't know it was so great; otherwise I'd have checked it out before.

Using 2.7.2 MSC v.1500 32 Intel bit for now.

Playing with email.header, I came across an odd behaviour.

Attached please find a script which demonstrates that
1) maxlinelen is ignored and
2) header fields are split in a manner not suitable for all systems involved in email processing.

The script will print the headers. They're all the same and extend over two lines; both should probably not be the case, although it dosn't hurt in itself.

If you uncomment the SMTP part of the script and send that email to yourself, you'll probably see that the From: and To: header will be misinterpreted by your email client; I tested this with two different email providers. Looking at the raw data which are received, it appears that at least in one case, a system along the way added a comma between the two "To:" lines. This is something which one should easily be able to avoid, if only the maxlinelen would be obeyed...

Having taken a look at, it appears to me that the semantics of _encode_chunks() does not exactly match its documentation due to the results of (at least some) charset.header_encode() calls. What seems to happen is that charset.header_encode() can return several lines already, and it will apparently split the line without any deeper knowledge. As a result, the Header module will not apply its sophisticated maxlinelen/splitchars logic. The header is split at some pretty arbitrary point and not all systems appear to be happy with that, although the relevant RFC apparently only reads "SHOULD" in this regard.
msg141293 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-07-28 13:57
You are using Header incorrectly.  It should look more like this:

    th = _e_header.Header(maxlinelen=200, header_name='To')
    th.append(wtc[:-1], charset='utf-8')

This results in:

  To: ABCDEFGH =?utf-8?b?0ILYgeC5hOC8kuGPiuGauw==?= <>

Which is valid per RFC, which encoding the address is not.  A compliant mailer should be able to handle the Subject line from your version correctly, but not the To or From lines.

The fact that you don't want the trailing spaces is an artifact of the API.  Using this API requires more knowledge of the RFCs than anyone should want to have.  In Python 3.3 we will be introducing a new API in the email package that will make all of this *much* simpler.

The maxlinelen issue does appear to be a bug, though.
msg141298 - (view) Author: (dandre) Date: 2011-07-28 14:16
Thank you for pointing out my wrong usage of Header.

Does this mean I should call Header.append() for each token, with tokens being separated by WS, or probably rather COMMASPACE in the case of To:? Or does it mean I should call Header.append() for each "logical" token of From: and To:, let's say, for the two parts returned by email.utils.parseaddr()?

Please excuse me if this is not the right place to discuss this, but I'm unaware of any place on the Web wehre these questions are addressed.
msg141302 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-07-28 14:47
They probably ought to be discussed in our docs :(

The only thing that may be encoded in an address is the "display name" (the first part returned by parseaddr).  (Actually the domain name could be IDNA encoded, but we don't support that directly in email.  3.3 will.)  So the easiest way to code this is probably to take a list of parseaddr'ed addresses, and append display_name as utf-8, and the address formatted as '<%s>,' as ascii.  Omitting the comma for the last one, of course.  Not very elegant, but I believe it should work.

If you want to get fancy you can split out the domain and run it through the IDNA codec to encode it before passing it in as part of the ASCII token.

Header puts spaces between ASCII and non-ASCII tokens automatically, so you don't have to add them to either the encoded or unencoded tokens.
msg141303 - (view) Author: (dandre) Date: 2011-07-28 15:29
Thanks again for the clarification.

I may not look like it ;), but my fanciness has to go even further. So, for the sake of completeness, it appears that the world is now up to UTF-8 local parts of email addresses, and punycode for the domain?

But then there's RFC 5335 which seems to go further, although, frankly speaking, I'd love to see examples in RFCs every now and then, and it sounds like it's not exactly supported by too many mailers along the way.

Either way, if the Mozilla example is something to live up to, I hope I'll be allowed to have WS between a non-UTF-8 '<', the UTF-8 local part and the '@', because email.Headers will always create that, right?

Is there a place I can register such a fancy email address (AND understand the website and webmailer's UI) for testing purposes?

Hats off to you who is dealing with these ugly compromises, keeping an outdated underlying standard on life support.
msg141306 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-07-28 15:53
Interesting thread.  I have my eye on supporting 5335 in the revised email package, but it is secondary goal to getting an improved API for the already-accepted RFCs.

You will note that the encoded word local part is *not* standard.  I think the email package may decode them anyway, but just like TB it provides no mechanism for creating them in the first place since it is not RFC compliant.  You could open a feature request for adding support for doing so (as an *optional* feature :), which I would then try to get in to 3.3.  (It puzzles me why it *isn't* allowed by the RFC, by the way).

To do it yourself now, you will probably have to create a temporary Header, pass it just the local part, call its encode or __str__ to get the encoded word (which won't have any spaces since it will be the only token), and then format that in to your rebuilt address string.
msg141312 - (view) Author: (dandre) Date: 2011-07-28 19:01
I made a test and, interestingly, I /can/ send an email to myself setting up the header like this:

h.append(b'My Name',         charset='utf-8')
h.append(b' < ',             charset='us-ascii')
h.append(b'my',              charset='utf-8')
h.append(b'@email.address>', charset='us-ascii')

The message in my Inbox will then have a To: header along the lines of
"=?utf-8?q?My Name?= <=?utf-8?q?my?=@email.address>
so the mailers are sure nice to me.

The startling part of it all seems to be that such email addresses are already out there and seem to be supported by enough mailers, albeit not by enough client-side systems.

With this non-standard approach and RFC 5335, I feel tempted to hope for a helper method which finds "the" ("an") canonical form of an email address...
msg141313 - (view) Author: (dandre) Date: 2011-07-28 19:11
Erm, sorry.
The header, of course, does not have much to do with the address the email is to be delivered to.
With my provider's setup, the mailer will reply that =?utf-8?q?my?= is not a known user.
Which could change, of course...
msg141314 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-07-28 19:18
Yes, exactly.  It is a valid ascii token so MTA's pass it through.  It's the receiving system that needs to be willing to decode it.
msg156881 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-26 23:26
Looking at this again, as_string defaults to maxheaderlen=0, which means no wrapping.  In Python 3.2 you can pass it a maxheaderlen of 78 to get the correct behavior for passing the message to smtp.
Date User Action Args
2012-03-26 23:26:15r.david.murraysetstatus: open -> closed
resolution: not a bug
messages: + msg156881

stage: resolved
2011-07-28 19:18:14r.david.murraysetmessages: + msg141314
2011-07-28 19:11:43dandresetmessages: + msg141313
2011-07-28 19:01:20dandresetmessages: + msg141312
2011-07-28 15:53:20r.david.murraysetmessages: + msg141306
2011-07-28 15:29:38dandresetmessages: + msg141303
2011-07-28 14:47:53r.david.murraysetmessages: + msg141302
2011-07-28 14:16:58dandresetmessages: + msg141298
2011-07-28 13:57:56r.david.murraysetassignee: r.david.murray
2011-07-28 13:57:00r.david.murraysetnosy: + r.david.murray
versions: + Python 3.2, Python 3.3
messages: + msg141293

keywords: + easy
title: email.Header corupts international email header and ignores maxlinelen in some cases -> email.Header ignores maxlinelen when wrapping encoded words
2011-07-28 13:15:06dandrecreate