Issue34954
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2018-10-10 16:55 by Alex Corcoles, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Messages (8) | |||
---|---|---|---|
msg327481 - (view) | Author: Alex Corcoles (Alex Corcoles) | Date: 2018-10-10 16:55 | |
Hi, This is something that has hit us a few times, as we write a significant quantity of software which parses email messages. The thing is, we use email.header.decode_header to decode the Subject: header and it is pretty common for headers to be word-wrapped. If they are, decode_header will return a string with newlines in it. This is something which is unexpected for many people, and can cause bugs which are very difficult to detect in code review or testing, as it's easy to not trigger wordwrapping if not done deliberately. We would humbly suggest to provide a friendly way to get an email's subject in the expected fashion (i.e. with no newlines) or point out this caveat in the docs (or maybe change decode_header to remove newlines itself). Kind regards, Álex |
|||
msg327487 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2018-10-10 18:05 | |
Use the new email policies in python3. It handles all the decoding for you. I'm afraid you are on your own for python2. |
|||
msg327488 - (view) | Author: Alex Corcoles (Alex Corcoles) | Date: 2018-10-10 18:21 | |
To clarify (and maybe help someone which might come across), you mean: In [1]: message_text = """To: alex@corcoles.net ...: Subject: ** ACKNOWLEDGEMENT Host Alert: archerc7.bcn.int.pdp7.net is DOWN ...: ** ...: User-Agent: Heirloom mailx 12.5 7/5/10 ...: MIME-Version: 1.0 ...: Content-Type: text/plain; charset=us-ascii ...: Content-Transfer-Encoding: 7bit ...: ...: ***** Nagios ***** ...: """ In [2]: import email In [4]: message = email.message_from_string(message_text) In [5]: message.get('Subject') Out[5]: '** ACKNOWLEDGEMENT Host Alert: archerc7.bcn.int.pdp7.net is DOWN\n **' In [7]: from email import policy In [8]: message = email.message_from_string(message_text, policy=policy.HTTP) In [9]: message.get('Subject') Out[9]: '** ACKNOWLEDGEMENT Host Alert: archerc7.bcn.int.pdp7.net is DOWN **' Yeah, there's a bundled policy that does what I need, but I think it's not very intuitive. I get that the stdlib is deliberately low level in these parts, and it's more of building block to create higher level libraries on top of that, but still I feel that getting an email's subject in a friendly fashion should be easy and intuitive in the stdlib, or the stdlib's docs should point out clearly to go and look for a higher level library because email is hard. OTOH, working with mail sucks and should be discouraged, so if you want to close this definitely I won't complain. |
|||
msg327499 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2018-10-10 22:05 | |
The new policies *make* the email library that higher level library, that was pretty much the whole point :) I don't know how to make getting the fully decoded subject more intuitive than: msg['subject'] The fact that you have to specify a policy is due to backward compatibility concerns, and there's not really any way around that. That's the only difference between your two examples (other than the fact that the second one does what you want :). Note that you *really* want to be using message_from_bytes, and for email either policy.default or policy.SMTP. This *is* documented in the python3 docs. If you don't find them clear, then an issue to improve the docs would be welcome. Since python2 is approaching EOL, we could also start transitioning to policy.default actually being the *default*. That will take two release cycles (one that will generate a deprecation notice that the default is going to change, and another that will actually make the change). |
|||
msg327525 - (view) | Author: Alex Corcoles (Alex Corcoles) | Date: 2018-10-11 07:59 | |
Well, I think that having to choose the "HTTP" policy to get a message subject's without newlines goes against the expectations of anyone who is not well knowledgeable of email. It's not very easy to deduct that, out of all the available policies, HTTP is the one that has this effect (or writing your own). It's not obvious that a subject can have newlines, as I don't think I've ever seen a MUA that does not hide them. You can be bitten quite easily by that (we have, more than once). It's the stdlib's maintainers' prerrogative to decide that they are going to provide low-level libraries (and in general, I agree with that, high-level stdlibs have a lot of problems), but at least I'd include some warning like: "Email is an old and annoying protocol, and parsing email is full of annoyances and exceptions. email provides low-level building blocks to handle email in detail. If you want high-level processing we advise you to look at libraries that build on it". In any case, email.policy provides more hints as to headers being wordwrapped, and while it's not ideal, it certainly is an improvement WRT to Python 2, so this bug has helped me and I hope maybe someone will read it when Googling for the same problem, so while I think some more could be done, if you close this I won't complain. Thanks, Álex |
|||
msg327537 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2018-10-11 15:31 | |
Can you demonstrate that policy.default and policy.SMTP produce a subject with newlines? If they do, that is a serious bug. Please don't reopen the issue. I'll reopen it if you convince me there is a bug :) The statement you suggest we add is not appropriate[*], since the python3 email library *is* a high level library now. If it isn't handling something for you when you use policy.default or policy.SMTP, then that is a bug. (Well, it's MIME Multipart handling still leaves something to be desired...you still have to know more than is optimal about multiparts, but the hooks are there for someone to improve that aspect further.) [*] The part about the protocol is certainly true, though :) |
|||
msg327538 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2018-10-11 15:33 | |
I'm guessing you got confused by the fact that the HTTP policy doesn't *add* new lines when *serializing*. If you can point to the part of the docs you read that produced that confusion, maybe we can improve it. |
|||
msg327539 - (view) | Author: Alex Corcoles (Alex Corcoles) | Date: 2018-10-11 15:47 | |
Duh, I'm an idiot, I only tested policy.HTTP and *NOT* supplying a policy (which I believed was equivalent to using policy.default). policy.default and policy.SMTP do indeed produce a newline-less subject indeed. I only tested policy.HTTP because the docs talk about unlimited line-length, but that's a problem of the docs, but rather, a problem of my idiocy. Given this, I agree with everything you said. Personally I'd prefer if policy.default was the default, but I guess that won't change due to backwards compatibility reasons and I guess it'd be excessive to create a new set of function calls and deprecate the old, so I'm happy if this remains closed. Apologies for my stupidity, Álex |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:07 | admin | set | github: 79135 |
2018-10-11 15:47:30 | Alex Corcoles | set | messages: + msg327539 |
2018-10-11 15:33:43 | r.david.murray | set | messages: + msg327538 |
2018-10-11 15:31:21 | r.david.murray | set | status: open -> closed messages: + msg327537 |
2018-10-11 07:59:17 | Alex Corcoles | set | status: closed -> open messages: + msg327525 |
2018-10-10 22:05:12 | r.david.murray | set | status: open -> closed messages: + msg327499 |
2018-10-10 18:21:25 | Alex Corcoles | set | status: closed -> open messages: + msg327488 |
2018-10-10 18:05:41 | r.david.murray | set | status: open -> closed resolution: out of date messages: + msg327487 stage: resolved |
2018-10-10 17:41:15 | xtreak | set | nosy:
+ xtreak |
2018-10-10 16:55:29 | Alex Corcoles | create |