classification
Title: email.utils.parsedate_to_datetime() should return None when date cannot be parsed
Type: behavior Stage:
Components: email Versions: Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, ncoghlan, r.david.murray, timb07
Priority: normal Keywords:

Created on 2017-06-15 23:12 by timb07, last changed 2017-08-08 01:34 by ncoghlan.

Pull Requests
URL Status Linked Edit
PR 2229 closed timb07, 2017-06-16 00:52
PR 2254 open timb07, 2017-06-17 01:38
Messages (9)
msg296137 - (view) Author: Tim Bell (timb07) * Date: 2017-06-15 23:12
Python 3.6 documentation for email.utils.parsedate_to_datetime() says "Performs the same function as parsedate(), but on success returns a datetime." The docs for parsedate() say "If it succeeds in parsing the date...; otherwise None will be returned." By implication, parsedate_to_datetime() should return None when the date can't be parsed.

There are two different failure modes for parsedate_to_datetime():

1. When _parsedate_tz() fails to parse the date and returns None:

>>> from email.utils import parsedate_to_datetime
>>> parsedate_to_datetime('0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/email/utils.py", line 210, in parsedate_to_datetime
    *dtuple, tz = _parsedate_tz(data)
TypeError: 'NoneType' object is not iterable


2. When _parsedate_tz() succeeds, but conversion to datetime.datetime fails:

>>> parsedate_to_datetime('Tue, 06 Jun 2017 27:39:33 +0600')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/email/utils.py", line 214, in parsedate_to_datetime
    tzinfo=datetime.timezone(datetime.timedelta(seconds=tz)))
ValueError: hour must be in 0..23


Note that this second case is the one that led me to this issue. I am using the email package to parse spam emails for subsequent analysis, and a certain group of spam emails contain invalid hour fields in their Date header. I don't require the invalid Date header to be converted to a datetime.datetime, but accessing email_message['date'] to access the header value as a string triggers the ValueError exception. I can work around this with a custom email policy, but the observed behaviour does seem to contradict the documented behaviour.

Also, in relation to https://bugs.python.org/issue15925, r.david.murray commented "Oh, and I'm purposely allowing parsedate_to_datetime throw exceptions.  I suppose that should be documented, but that's a separate issue." However, no argument for why parsedate_to_datetime throwing exceptions is desired was given.
msg296141 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-06-16 01:25
The problem is that if it returns None on parse failure, then you can't tell the difference between the header not existing and the date not being parseable.  I don't have a solution for this problem.  Suggestions welcome.  (Note that this is only a problem in the new policy, where the parsing is done automatically; in the compat32 policy you have to apply parsedate yourself, so you can tell the difference between a non-existent header and a failed date parse).
msg296153 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-06-16 03:17
I'm not sure it would be any better, but what about defining something like a DateFormatDefect and returning that?
msg296154 - (view) Author: Tim Bell (timb07) * Date: 2017-06-16 03:19
My proposed solution (in https://github.com/python/cpython/pull/2229) is two-part:

1. change parsedate_to_datetime() to return None rather than raising an exception; and

2. change headerregistry.DateHeader.parse() to check for None being returned from parsedate_to_datetime(), and to add a defect; the datetime attribute is set to None (as if the Date header were missing), but the header still evaluates as a string to the supplied header value.

I'm not sure what the use case is for distinguishing between a missing Date header and an invalid date value, but can't that be distinguished by the different defects added to the header?

In any case, if I'm not fully grasping the context and parsedate_to_datetime() should continue to throw exceptions, then a slightly different modification to DateHeader to catch those exceptions would seem sensible, and would address my use case.
msg296215 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-06-16 17:44
OK, I think I've reloaded my brain at least partially on this topic.

I think my original reason for having prasedate_to_datetime raise errors is that it was my opinion that that is the way it should work, and that parsedate should work the same way (raise errors, not return None).  The logic is that parsedate is not itself part of the *parser* and it is the parser that has a contract to not raise errors but instead register defects.  When you call parsedate from your code (that is, not as part of the parser), it ought to raise an error, IMO, and so I made parsedate_to_datetime do that.

I think I understand the logic behind the original behavior: None as the 'error value', thus being consistent with the parser in not raising errors.  But I think our understanding of Python best practices has evolved (solidified?) since the days when the parsedate API was designed, and raising errors is better.

*However*, consistency is also important, so if the consensus is that parsedate_to_datetime should parallel the parsedate API, I'm not going to argue with it.

Regardless of that, however, I think your notion, Tim, that the *string* value of a date header with an invalid date in it should be the invalid string is a good one.  One can check the validity by inspecting the datetime argument.  Regardless of whether errors are reported via None or an exception, the headerregistry method should catch the error and set the value accordingly (to the invalid string on error, to the normalized string if valid).

A couple of notes on the PR you submitted. (1) this change affects only the new policies, so the test should go somewhere in the new tests, not in test_email, which means you don't need to muck with the test support infrastructure in that file.  There are already date header tests in test_headerregistry, so add the new test there. (2) I'm moving us away from putting 'test emails' in separate files, so include the text under test in the test file.  You only need the date string in the date header test, but you can add your sample (changed to meet Brett's child filter, although I bet any children who will be looking at the python source code will already have seen many such spam emails) to test_inversion (which currently only contains one test message in msg_params, add yours  to that list and make it two :)

As for the decision on the return value vs exception, let's see which side Barry comes down on :)
msg296226 - (view) Author: Tim Bell (timb07) * Date: 2017-06-17 01:52
Thanks for the feedback. I've made a new pull request which addresses the points raised.
msg296231 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-06-17 03:53
Thanks for all the great detailed background, and the suggested approaches.  I think there are a couple of constraints that would be good to resolve.

* parsedate_to_datetime() is documented as "performing the same function as parsedate()" with an explicit difference in the good path return value, but no explicit difference in the bad path.  So the implication is pretty strong that it should return None when the date cannot be parsed.  Have a consistent API with parsedate() is important, and documented, so I think it's reasonable that the implementation should match.

* Clearly, header parsing can't raise exceptions.

* It should be easy to tell the difference between a missing Date header and a bogus date header.  Yes, this is an important use case.  For example, Mailman may do certain things when the Date header is missing (e.g. it could reject the message, or it could clobber the header with the server's time, etc.).  Yet if the header exists and is bogus, then you might want to preserve the bogus header for forensic or idempotency reasons.

It seems to me that the way to resolve this is to fix parsedate_to_datetime() so that it returns None on failure, but to add a (new) defect in DateHeader.parse() when that happens, e.g. InvalidDateDefect.  Then, as Tim suggestions and it seems like RDM agrees, that the invalid string value be used as the string value of the header in that case.

Thoughts?
msg296235 - (view) Author: Tim Bell (timb07) * Date: 2017-06-17 07:44
I've updated the pull request to incorporate Barry's suggestion of a new defect for this situation, InvalidDateDefect.
msg296361 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-06-19 17:33
I'll make one argument in favor of retaining the exception, and if that doesn't fly then I agree to the solution and will try to review the PR soon.

The argument is this: if parsedate_to_datetime raises an error, you get information about *why* the date was invalid, which you don't get from a 'None' return.  It is my thought that this would be the most useful behavior for the cases where you call it directly (otherwise, why call it directly?)

(And as far as the doc issue goes, you are correct Barry that the current docs don't document the difference in the error case; I noted in another issue that that "should be fixed"...which is only the case now if you agree to my argument above :)
History
Date User Action Args
2017-08-08 01:34:07ncoghlansetnosy: + ncoghlan
2017-06-19 17:33:22r.david.murraysetmessages: + msg296361
2017-06-17 07:44:10timb07setmessages: + msg296235
2017-06-17 03:53:48barrysetmessages: + msg296231
2017-06-17 01:52:49timb07setmessages: + msg296226
2017-06-17 01:38:04timb07setpull_requests: + pull_request2304
2017-06-16 17:44:50r.david.murraysetmessages: + msg296215
2017-06-16 03:19:14timb07setmessages: + msg296154
2017-06-16 03:17:23barrysetmessages: + msg296153
2017-06-16 01:25:02r.david.murraysetmessages: + msg296141
2017-06-16 00:52:19timb07setpull_requests: + pull_request2274
2017-06-15 23:12:43timb07create