classification
Title: email.utils.parsedate_to_datetime() should return None when date cannot be parsed
Type: behavior Stage: resolved
Components: email Versions: Python 3.10
process
Status: open Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: barry, lukasz.langa, miss-islington, ncoghlan, r.david.murray, serhiy.storchaka, sim0n, timb07
Priority: normal Keywords: patch

Created on 2017-06-15 23:12 by timb07, last changed 2020-10-27 07:37 by serhiy.storchaka.

Pull Requests
URL Status Linked Edit
PR 2229 closed timb07, 2017-06-16 00:52
PR 2254 closed timb07, 2017-06-17 01:38
PR 10783 closed timb07, 2018-11-29 00:28
PR 22090 merged sim0n, 2020-09-04 16:22
Messages (18)
msg296137 - (view) Author: Tim Bell (timb07) * Date: 2017-06-15 23:12
Python 3.6 documentation for email.utils.parsedate_to_datetime() says "Performs the same function as parsedate(), but on success returns a datetime." The docs for parsedate() say "If it succeeds in parsing the date...; otherwise None will be returned." By implication, parsedate_to_datetime() should return None when the date can't be parsed.

There are two different failure modes for parsedate_to_datetime():

1. When _parsedate_tz() fails to parse the date and returns None:

>>> from email.utils import parsedate_to_datetime
>>> parsedate_to_datetime('0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/email/utils.py", line 210, in parsedate_to_datetime
    *dtuple, tz = _parsedate_tz(data)
TypeError: 'NoneType' object is not iterable


2. When _parsedate_tz() succeeds, but conversion to datetime.datetime fails:

>>> parsedate_to_datetime('Tue, 06 Jun 2017 27:39:33 +0600')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/email/utils.py", line 214, in parsedate_to_datetime
    tzinfo=datetime.timezone(datetime.timedelta(seconds=tz)))
ValueError: hour must be in 0..23


Note that this second case is the one that led me to this issue. I am using the email package to parse spam emails for subsequent analysis, and a certain group of spam emails contain invalid hour fields in their Date header. I don't require the invalid Date header to be converted to a datetime.datetime, but accessing email_message['date'] to access the header value as a string triggers the ValueError exception. I can work around this with a custom email policy, but the observed behaviour does seem to contradict the documented behaviour.

Also, in relation to https://bugs.python.org/issue15925, r.david.murray commented "Oh, and I'm purposely allowing parsedate_to_datetime throw exceptions.  I suppose that should be documented, but that's a separate issue." However, no argument for why parsedate_to_datetime throwing exceptions is desired was given.
msg296141 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-06-16 01:25
The problem is that if it returns None on parse failure, then you can't tell the difference between the header not existing and the date not being parseable.  I don't have a solution for this problem.  Suggestions welcome.  (Note that this is only a problem in the new policy, where the parsing is done automatically; in the compat32 policy you have to apply parsedate yourself, so you can tell the difference between a non-existent header and a failed date parse).
msg296153 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-06-16 03:17
I'm not sure it would be any better, but what about defining something like a DateFormatDefect and returning that?
msg296154 - (view) Author: Tim Bell (timb07) * Date: 2017-06-16 03:19
My proposed solution (in https://github.com/python/cpython/pull/2229) is two-part:

1. change parsedate_to_datetime() to return None rather than raising an exception; and

2. change headerregistry.DateHeader.parse() to check for None being returned from parsedate_to_datetime(), and to add a defect; the datetime attribute is set to None (as if the Date header were missing), but the header still evaluates as a string to the supplied header value.

I'm not sure what the use case is for distinguishing between a missing Date header and an invalid date value, but can't that be distinguished by the different defects added to the header?

In any case, if I'm not fully grasping the context and parsedate_to_datetime() should continue to throw exceptions, then a slightly different modification to DateHeader to catch those exceptions would seem sensible, and would address my use case.
msg296215 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-06-16 17:44
OK, I think I've reloaded my brain at least partially on this topic.

I think my original reason for having prasedate_to_datetime raise errors is that it was my opinion that that is the way it should work, and that parsedate should work the same way (raise errors, not return None).  The logic is that parsedate is not itself part of the *parser* and it is the parser that has a contract to not raise errors but instead register defects.  When you call parsedate from your code (that is, not as part of the parser), it ought to raise an error, IMO, and so I made parsedate_to_datetime do that.

I think I understand the logic behind the original behavior: None as the 'error value', thus being consistent with the parser in not raising errors.  But I think our understanding of Python best practices has evolved (solidified?) since the days when the parsedate API was designed, and raising errors is better.

*However*, consistency is also important, so if the consensus is that parsedate_to_datetime should parallel the parsedate API, I'm not going to argue with it.

Regardless of that, however, I think your notion, Tim, that the *string* value of a date header with an invalid date in it should be the invalid string is a good one.  One can check the validity by inspecting the datetime argument.  Regardless of whether errors are reported via None or an exception, the headerregistry method should catch the error and set the value accordingly (to the invalid string on error, to the normalized string if valid).

A couple of notes on the PR you submitted. (1) this change affects only the new policies, so the test should go somewhere in the new tests, not in test_email, which means you don't need to muck with the test support infrastructure in that file.  There are already date header tests in test_headerregistry, so add the new test there. (2) I'm moving us away from putting 'test emails' in separate files, so include the text under test in the test file.  You only need the date string in the date header test, but you can add your sample (changed to meet Brett's child filter, although I bet any children who will be looking at the python source code will already have seen many such spam emails) to test_inversion (which currently only contains one test message in msg_params, add yours  to that list and make it two :)

As for the decision on the return value vs exception, let's see which side Barry comes down on :)
msg296226 - (view) Author: Tim Bell (timb07) * Date: 2017-06-17 01:52
Thanks for the feedback. I've made a new pull request which addresses the points raised.
msg296231 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-06-17 03:53
Thanks for all the great detailed background, and the suggested approaches.  I think there are a couple of constraints that would be good to resolve.

* parsedate_to_datetime() is documented as "performing the same function as parsedate()" with an explicit difference in the good path return value, but no explicit difference in the bad path.  So the implication is pretty strong that it should return None when the date cannot be parsed.  Have a consistent API with parsedate() is important, and documented, so I think it's reasonable that the implementation should match.

* Clearly, header parsing can't raise exceptions.

* It should be easy to tell the difference between a missing Date header and a bogus date header.  Yes, this is an important use case.  For example, Mailman may do certain things when the Date header is missing (e.g. it could reject the message, or it could clobber the header with the server's time, etc.).  Yet if the header exists and is bogus, then you might want to preserve the bogus header for forensic or idempotency reasons.

It seems to me that the way to resolve this is to fix parsedate_to_datetime() so that it returns None on failure, but to add a (new) defect in DateHeader.parse() when that happens, e.g. InvalidDateDefect.  Then, as Tim suggestions and it seems like RDM agrees, that the invalid string value be used as the string value of the header in that case.

Thoughts?
msg296235 - (view) Author: Tim Bell (timb07) * Date: 2017-06-17 07:44
I've updated the pull request to incorporate Barry's suggestion of a new defect for this situation, InvalidDateDefect.
msg296361 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-06-19 17:33
I'll make one argument in favor of retaining the exception, and if that doesn't fly then I agree to the solution and will try to review the PR soon.

The argument is this: if parsedate_to_datetime raises an error, you get information about *why* the date was invalid, which you don't get from a 'None' return.  It is my thought that this would be the most useful behavior for the cases where you call it directly (otherwise, why call it directly?)

(And as far as the doc issue goes, you are correct Barry that the current docs don't document the difference in the error case; I noted in another issue that that "should be fixed"...which is only the case now if you agree to my argument above :)
msg301618 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-09-07 19:34
So, while we do have a conflict between consistency and utility, I think @r.david.murry 's last comment has convinced me that raising the exception is more helpful.  I think we should do that, fixing the documentation and giving up on the consistency issue.
msg330628 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-11-28 18:23
Reported again in issue #35342.

The existing PR is close to complete, but needs adjusted for the fact that we want (and want to document) that the utility raises errors (ie: catch the error in the header parser rather than having the utility return None).
msg330648 - (view) Author: Tim Bell (timb07) * Date: 2018-11-29 00:52
I've addressed the points in the last few comments and created a new PR (10783).
msg376384 - (view) Author: Georges (sim0n) * Date: 2020-09-04 16:24
As I think it is still important to have this fixed and considering the original PR was closed, I have created a new PR based on the original one while implementing the requested changes.

https://github.com/python/cpython/pull/22090
msg379381 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2020-10-22 23:27
@sim0n - I added a comment to your open PR.

My main question for the rest of the group is whether we can and should backport this.  Given the new defect class being introduced, it seems like this should only land in 3.10.  Thoughts?
msg379412 - (view) Author: Georges (sim0n) * Date: 2020-10-23 08:39
@barry Thank you for your input on the PR.

From what I understood this PR was nearly ready and only missing a small addition to the documentation which I added. So it took me a bit to go through it all :-).

I actually don't see how *parsedate_to_datetime* would ever return None. It is *_parsedate_tz* which returns None on bogus input, in which case *parsedate_to_datetime* raises a TypeError.
This is also covered in the tests, so those should be fine.

In order to continue I suggest to fix the documentation on *parsedate_to_datetime*, remove the mention of it returning None and replacing it with it possibly returning TypeError in case of an invalid date input.

Does that make sense ?


Regarding the backporting, as a user of this I must admit that it would be much appreciated if this could be backported :-).
msg379464 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2020-10-23 18:42
Aside: I noticed that on _parseaddr.py:68, there's a bare `return`.  That should really be `return None` (EIBTI).  Can you fix that in your PR?

I think it's confusing to raise both TypeError and ValueError.  I suggest we check the `None` return from _parsedate_tz() and raise ValueError explicitly in that case, avoiding the implicit TypeError on the failing tuple unpack.

+1 on removing the mention of returning None from the documentation.  Then with the above, it would document raising ValueError on invalid date input.

As for backporting, I'm nosing Ned and Ɓukasz to weigh in.  Given that the patch is adding a new defect class (which it should), this won't break existing code, but it does mean that existing code would have different semantics depending on the patch version release of 3.9, 3.8, and 3.7.  I'm not completely comfortable with that, but let's see what the RMs say.  I guess I'm currently -0 on backporting.
msg379706 - (view) Author: miss-islington (miss-islington) Date: 2020-10-27 00:31
New changeset 303aac8c56609290e122eecc14c038e9b1e4174a by Georges Toth in branch 'master':
bpo-30681: Support invalid date format or value in email Date header (GH-22090)
https://github.com/python/cpython/commit/303aac8c56609290e122eecc14c038e9b1e4174a
msg379742 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-10-27 07:37
>>> email.utils.parsedate_to_datetime(None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/serhiy/py/cpython/Lib/email/utils.py", line 200, in parsedate_to_datetime
    raise ValueError('Invalid date value or format "%s"' % str(data))
ValueError: Invalid date value or format "None"

First, the date value is None, not "None".

Second, why not just return None? parsedate() can be used in code like:

   parsedata(headers.get('Date'))

None is an expected argument if the header "Date" is absent. parsedate_to_datetime() is not compatible with parsedata() in this case.

It was a regression introduced in issue16714. Before that parsedate_to_datetime(None) returned None.
History
Date User Action Args
2020-10-27 07:37:48serhiy.storchakasetstatus: closed -> open
nosy: + serhiy.storchaka
messages: + msg379742

2020-10-27 00:31:37barrysetstatus: open -> closed
stage: patch review -> resolved
resolution: fixed
versions: - Python 3.8, Python 3.9
2020-10-27 00:31:14miss-islingtonsetnosy: + miss-islington
messages: + msg379706
2020-10-26 16:32:29barrylinkissue42155 superseder
2020-10-26 04:45:41ned.deilysetnosy: - ned.deily

versions: - Python 3.7
2020-10-23 18:42:32barrysetnosy: + ned.deily, lukasz.langa
messages: + msg379464
2020-10-23 08:39:29sim0nsetmessages: + msg379412
2020-10-22 23:27:09barrysetmessages: + msg379381
2020-09-04 16:24:46sim0nsetmessages: + msg376384
versions: + Python 3.9, Python 3.10
2020-09-04 16:22:46sim0nsetnosy: + sim0n
pull_requests: + pull_request21176
2018-11-29 00:52:08timb07setmessages: + msg330648
2018-11-29 00:28:45timb07setkeywords: + patch
stage: patch review
pull_requests: + pull_request10025
2018-11-28 18:23:33r.david.murraysetmessages: + msg330628
versions: + Python 3.7, Python 3.8, - Python 3.5, Python 3.6
2018-11-28 18:17:59r.david.murraylinkissue35342 superseder
2017-09-07 19:34:06barrysetmessages: + msg301618
2017-08-08 01:34:07ncoghlansetnosy: + ncoghlan
2017-06-19 17:33:22r.david.murraysetmessages: + msg296361
2017-06-17 07:44:10timb07setmessages: + msg296235
2017-06-17 03:53:48barrysetmessages: + msg296231
2017-06-17 01:52:49timb07setmessages: + msg296226
2017-06-17 01:38:04timb07setpull_requests: + pull_request2304
2017-06-16 17:44:50r.david.murraysetmessages: + msg296215
2017-06-16 03:19:14timb07setmessages: + msg296154
2017-06-16 03:17:23barrysetmessages: + msg296153
2017-06-16 01:25:02r.david.murraysetmessages: + msg296141
2017-06-16 00:52:19timb07setpull_requests: + pull_request2274
2017-06-15 23:12:43timb07create