classification
Title: Getting an email's subject is error-prone
Type: behavior Stage: resolved
Components: email Versions: Python 3.8, Python 3.7, Python 3.6, Python 3.5, Python 3.4, Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Alex Corcoles, barry, r.david.murray, xtreak
Priority: normal Keywords:

Created on 2018-10-10 16:55 by Alex Corcoles, last changed 2018-10-11 15:47 by Alex Corcoles. This issue is now closed.

Messages (8)
msg327481 - (view) Author: Alex Corcoles (Alex Corcoles) Date: 2018-10-10 16:55
Hi,

This is something that has hit us a few times, as we write a significant quantity of software which parses email messages.

The thing is, we use email.header.decode_header to decode the Subject: header and it is pretty common for headers to be word-wrapped. If they are, decode_header will return a string with newlines in it.

This is something which is unexpected for many people, and can cause bugs which are very difficult to detect in code review or testing, as it's easy to not trigger wordwrapping if not done deliberately.

We would humbly suggest to provide a friendly way to get an email's subject in the expected fashion (i.e. with no newlines) or point out this caveat in the docs (or maybe change decode_header to remove newlines itself).

Kind regards,

Álex
msg327487 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-10-10 18:05
Use the new email policies in python3.  It handles all the decoding for you.  I'm afraid you are on your own for python2.
msg327488 - (view) Author: Alex Corcoles (Alex Corcoles) Date: 2018-10-10 18:21
To clarify (and maybe help someone which might come across), you mean:

In [1]: message_text = """To: alex@corcoles.net
   ...: Subject: ** ACKNOWLEDGEMENT Host Alert: archerc7.bcn.int.pdp7.net is DOWN
   ...:  **
   ...: User-Agent: Heirloom mailx 12.5 7/5/10
   ...: MIME-Version: 1.0
   ...: Content-Type: text/plain; charset=us-ascii
   ...: Content-Transfer-Encoding: 7bit
   ...: 
   ...: ***** Nagios *****
   ...: """
In [2]: import email
In [4]: message = email.message_from_string(message_text)
In [5]: message.get('Subject')
Out[5]: '** ACKNOWLEDGEMENT Host Alert: archerc7.bcn.int.pdp7.net is DOWN\n **'

In [7]: from email import policy
In [8]: message = email.message_from_string(message_text, policy=policy.HTTP)
In [9]: message.get('Subject')
Out[9]: '** ACKNOWLEDGEMENT Host Alert: archerc7.bcn.int.pdp7.net is DOWN **'

Yeah, there's a bundled policy that does what I need, but I think it's not very intuitive.

I get that the stdlib is deliberately low level in these parts, and it's more of building block to create higher level libraries on top of that, but still I feel that getting an email's subject in a friendly fashion should be easy and intuitive in the stdlib, or the stdlib's docs should point out clearly to go and look for a higher level library because email is hard.

OTOH, working with mail sucks and should be discouraged, so if you want to close this definitely I won't complain.
msg327499 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-10-10 22:05
The new policies *make* the email library that higher level library, that was pretty much the whole point :)  I don't know how to make getting the fully decoded subject more intuitive than:

  msg['subject']

The fact that you have to specify a policy is due to backward compatibility concerns, and there's not really any way around that.  That's the only difference between your two examples (other than the fact that the second one does what you want :).

Note that you *really* want to be using message_from_bytes, and for email either policy.default or policy.SMTP.  This *is* documented in the python3 docs.  If you don't find them clear, then an issue to improve the docs would be welcome.

Since python2 is approaching EOL, we could also start transitioning to policy.default actually being the *default*.  That will take two release cycles (one that will generate a deprecation notice that the default is going to change, and another that will actually make the change).
msg327525 - (view) Author: Alex Corcoles (Alex Corcoles) Date: 2018-10-11 07:59
Well, I think that having to choose the "HTTP" policy to get a message subject's without newlines goes against the expectations of anyone who is not well knowledgeable of email.

It's not very easy to deduct that, out of all the available policies, HTTP is the one that has this effect (or writing your own).

It's not obvious that a subject can have newlines, as I don't think I've ever seen a MUA that does not hide them.

You can be bitten quite easily by that (we have, more than once).

It's the stdlib's maintainers' prerrogative to decide that they are going to provide low-level libraries (and in general, I agree with that, high-level stdlibs have a lot of problems), but at least I'd include some warning like:

"Email is an old and annoying protocol, and parsing email is full of annoyances and exceptions. email provides low-level building blocks to handle email in detail. If you want high-level processing we advise you to look at libraries that build on it".

In any case, email.policy provides more hints as to headers being wordwrapped, and while it's not ideal, it certainly is an improvement WRT to Python 2, so this bug has helped me and I hope maybe someone will read it when Googling for the same problem, so while I think some more could be done, if you close this I won't complain.

Thanks,

Álex
msg327537 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-10-11 15:31
Can you demonstrate that policy.default and policy.SMTP produce a subject with newlines?  If they do, that is a serious bug.

Please don't reopen the issue.  I'll reopen it if you convince me there is a bug :)

The statement you suggest we add is not appropriate[*], since the python3 email library *is* a high level library now.  If it isn't handling something for you when you use policy.default or policy.SMTP, then that is a bug.  (Well, it's MIME Multipart handling still leaves something to be desired...you still have to know more than is optimal about multiparts, but the hooks are there for someone to improve that aspect further.)

[*] The part about the protocol is certainly true, though :)
msg327538 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-10-11 15:33
I'm guessing you got confused by the fact that the HTTP policy doesn't *add* new lines when *serializing*.  If you can point to the part of the docs you read that produced that confusion, maybe we can improve it.
msg327539 - (view) Author: Alex Corcoles (Alex Corcoles) Date: 2018-10-11 15:47
Duh, I'm an idiot, I only tested policy.HTTP and *NOT* supplying a policy (which I believed was equivalent to using policy.default).

policy.default and policy.SMTP do indeed produce a newline-less subject indeed.

I only tested policy.HTTP because the docs talk about unlimited line-length, but that's a problem of the docs, but rather, a problem of my idiocy.

Given this, I agree with everything you said. Personally I'd prefer if policy.default was the default, but I guess that won't change due to backwards compatibility reasons and I guess it'd be excessive to create a new set of function calls and deprecate the old, so I'm happy if this remains closed.

Apologies for my stupidity,

Álex
History
Date User Action Args
2018-10-11 15:47:30Alex Corcolessetmessages: + msg327539
2018-10-11 15:33:43r.david.murraysetmessages: + msg327538
2018-10-11 15:31:21r.david.murraysetstatus: open -> closed

messages: + msg327537
2018-10-11 07:59:17Alex Corcolessetstatus: closed -> open

messages: + msg327525
2018-10-10 22:05:12r.david.murraysetstatus: open -> closed

messages: + msg327499
2018-10-10 18:21:25Alex Corcolessetstatus: closed -> open

messages: + msg327488
2018-10-10 18:05:41r.david.murraysetstatus: open -> closed
resolution: out of date
messages: + msg327487

stage: resolved
2018-10-10 17:41:15xtreaksetnosy: + xtreak
2018-10-10 16:55:29Alex Corcolescreate