Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

email.utils.parsedate_to_datetime() should return None when date cannot be parsed #74866

Open
timb07 mannequin opened this issue Jun 15, 2017 · 19 comments
Open

email.utils.parsedate_to_datetime() should return None when date cannot be parsed #74866

timb07 mannequin opened this issue Jun 15, 2017 · 19 comments
Labels
3.10 only security fixes stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@timb07
Copy link
Mannequin

timb07 mannequin commented Jun 15, 2017

BPO 30681
Nosy @warsaw, @ncoghlan, @bitdancer, @ambv, @serhiy-storchaka, @sim0nx, @timb07, @miss-islington
PRs
  • bpo-30681: Change error handling to return None in case of invalid date #2229
  • bpo-30681: Support invalid date format or value #2254
  • bpo-30681: Support invalid date format or value in email Date header #10783
  • bpo-30681: Support invalid date format or value in email Date header #22090
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2017-06-15.23:12:43.396>
    labels = ['type-bug', 'expert-email', '3.10']
    title = 'email.utils.parsedate_to_datetime() should return None when date cannot be parsed'
    updated_at = <Date 2020-10-27.07:37:48.458>
    user = 'https://github.com/timb07'

    bugs.python.org fields:

    activity = <Date 2020-10-27.07:37:48.458>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['email']
    creation = <Date 2017-06-15.23:12:43.396>
    creator = 'timb07'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 30681
    keywords = ['patch']
    message_count = 18.0
    messages = ['296137', '296141', '296153', '296154', '296215', '296226', '296231', '296235', '296361', '301618', '330628', '330648', '376384', '379381', '379412', '379464', '379706', '379742']
    nosy_count = 8.0
    nosy_names = ['barry', 'ncoghlan', 'r.david.murray', 'lukasz.langa', 'serhiy.storchaka', 'sim0n', 'timb07', 'miss-islington']
    pr_nums = ['2229', '2254', '10783', '22090']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue30681'
    versions = ['Python 3.10']

    @timb07
    Copy link
    Mannequin Author

    timb07 mannequin commented Jun 15, 2017

    Python 3.6 documentation for email.utils.parsedate_to_datetime() says "Performs the same function as parsedate(), but on success returns a datetime." The docs for parsedate() say "If it succeeds in parsing the date...; otherwise None will be returned." By implication, parsedate_to_datetime() should return None when the date can't be parsed.

    There are two different failure modes for parsedate_to_datetime():

    1. When _parsedate_tz() fails to parse the date and returns None:
    >>> from email.utils import parsedate_to_datetime
    >>> parsedate_to_datetime('0')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.6/email/utils.py", line 210, in parsedate_to_datetime
        *dtuple, tz = _parsedate_tz(data)
    TypeError: 'NoneType' object is not iterable
    1. When _parsedate_tz() succeeds, but conversion to datetime.datetime fails:
    >>> parsedate_to_datetime('Tue, 06 Jun 2017 27:39:33 +0600')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.6/email/utils.py", line 214, in parsedate_to_datetime
        tzinfo=datetime.timezone(datetime.timedelta(seconds=tz)))
    ValueError: hour must be in 0..23

    Note that this second case is the one that led me to this issue. I am using the email package to parse spam emails for subsequent analysis, and a certain group of spam emails contain invalid hour fields in their Date header. I don't require the invalid Date header to be converted to a datetime.datetime, but accessing email_message['date'] to access the header value as a string triggers the ValueError exception. I can work around this with a custom email policy, but the observed behaviour does seem to contradict the documented behaviour.

    Also, in relation to https://bugs.python.org/issue15925, r.david.murray commented "Oh, and I'm purposely allowing parsedate_to_datetime throw exceptions. I suppose that should be documented, but that's a separate issue." However, no argument for why parsedate_to_datetime throwing exceptions is desired was given.

    @timb07 timb07 mannequin added topic-email type-bug An unexpected behavior, bug, or error labels Jun 15, 2017
    @bitdancer
    Copy link
    Member

    The problem is that if it returns None on parse failure, then you can't tell the difference between the header not existing and the date not being parseable. I don't have a solution for this problem. Suggestions welcome. (Note that this is only a problem in the new policy, where the parsing is done automatically; in the compat32 policy you have to apply parsedate yourself, so you can tell the difference between a non-existent header and a failed date parse).

    @warsaw
    Copy link
    Member

    warsaw commented Jun 16, 2017

    I'm not sure it would be any better, but what about defining something like a DateFormatDefect and returning that?

    @timb07
    Copy link
    Mannequin Author

    timb07 mannequin commented Jun 16, 2017

    My proposed solution (in #2229) is two-part:

    1. change parsedate_to_datetime() to return None rather than raising an exception; and

    2. change headerregistry.DateHeader.parse() to check for None being returned from parsedate_to_datetime(), and to add a defect; the datetime attribute is set to None (as if the Date header were missing), but the header still evaluates as a string to the supplied header value.

    I'm not sure what the use case is for distinguishing between a missing Date header and an invalid date value, but can't that be distinguished by the different defects added to the header?

    In any case, if I'm not fully grasping the context and parsedate_to_datetime() should continue to throw exceptions, then a slightly different modification to DateHeader to catch those exceptions would seem sensible, and would address my use case.

    @bitdancer
    Copy link
    Member

    OK, I think I've reloaded my brain at least partially on this topic.

    I think my original reason for having prasedate_to_datetime raise errors is that it was my opinion that that is the way it should work, and that parsedate should work the same way (raise errors, not return None). The logic is that parsedate is not itself part of the *parser* and it is the parser that has a contract to not raise errors but instead register defects. When you call parsedate from your code (that is, not as part of the parser), it ought to raise an error, IMO, and so I made parsedate_to_datetime do that.

    I think I understand the logic behind the original behavior: None as the 'error value', thus being consistent with the parser in not raising errors. But I think our understanding of Python best practices has evolved (solidified?) since the days when the parsedate API was designed, and raising errors is better.

    *However*, consistency is also important, so if the consensus is that parsedate_to_datetime should parallel the parsedate API, I'm not going to argue with it.

    Regardless of that, however, I think your notion, Tim, that the *string* value of a date header with an invalid date in it should be the invalid string is a good one. One can check the validity by inspecting the datetime argument. Regardless of whether errors are reported via None or an exception, the headerregistry method should catch the error and set the value accordingly (to the invalid string on error, to the normalized string if valid).

    A couple of notes on the PR you submitted. (1) this change affects only the new policies, so the test should go somewhere in the new tests, not in test_email, which means you don't need to muck with the test support infrastructure in that file. There are already date header tests in test_headerregistry, so add the new test there. (2) I'm moving us away from putting 'test emails' in separate files, so include the text under test in the test file. You only need the date string in the date header test, but you can add your sample (changed to meet Brett's child filter, although I bet any children who will be looking at the python source code will already have seen many such spam emails) to test_inversion (which currently only contains one test message in msg_params, add yours to that list and make it two :)

    As for the decision on the return value vs exception, let's see which side Barry comes down on :)

    @timb07
    Copy link
    Mannequin Author

    timb07 mannequin commented Jun 17, 2017

    Thanks for the feedback. I've made a new pull request which addresses the points raised.

    @warsaw
    Copy link
    Member

    warsaw commented Jun 17, 2017

    Thanks for all the great detailed background, and the suggested approaches. I think there are a couple of constraints that would be good to resolve.

    • parsedate_to_datetime() is documented as "performing the same function as parsedate()" with an explicit difference in the good path return value, but no explicit difference in the bad path. So the implication is pretty strong that it should return None when the date cannot be parsed. Have a consistent API with parsedate() is important, and documented, so I think it's reasonable that the implementation should match.

    • Clearly, header parsing can't raise exceptions.

    • It should be easy to tell the difference between a missing Date header and a bogus date header. Yes, this is an important use case. For example, Mailman may do certain things when the Date header is missing (e.g. it could reject the message, or it could clobber the header with the server's time, etc.). Yet if the header exists and is bogus, then you might want to preserve the bogus header for forensic or idempotency reasons.

    It seems to me that the way to resolve this is to fix parsedate_to_datetime() so that it returns None on failure, but to add a (new) defect in DateHeader.parse() when that happens, e.g. InvalidDateDefect. Then, as Tim suggestions and it seems like RDM agrees, that the invalid string value be used as the string value of the header in that case.

    Thoughts?

    @timb07
    Copy link
    Mannequin Author

    timb07 mannequin commented Jun 17, 2017

    I've updated the pull request to incorporate Barry's suggestion of a new defect for this situation, InvalidDateDefect.

    @bitdancer
    Copy link
    Member

    I'll make one argument in favor of retaining the exception, and if that doesn't fly then I agree to the solution and will try to review the PR soon.

    The argument is this: if parsedate_to_datetime raises an error, you get information about *why* the date was invalid, which you don't get from a 'None' return. It is my thought that this would be the most useful behavior for the cases where you call it directly (otherwise, why call it directly?)

    (And as far as the doc issue goes, you are correct Barry that the current docs don't document the difference in the error case; I noted in another issue that that "should be fixed"...which is only the case now if you agree to my argument above :)

    @warsaw
    Copy link
    Member

    warsaw commented Sep 7, 2017

    So, while we do have a conflict between consistency and utility, I think @r.david.murry 's last comment has convinced me that raising the exception is more helpful. I think we should do that, fixing the documentation and giving up on the consistency issue.

    @bitdancer
    Copy link
    Member

    Reported again in issue bpo-35342.

    The existing PR is close to complete, but needs adjusted for the fact that we want (and want to document) that the utility raises errors (ie: catch the error in the header parser rather than having the utility return None).

    @bitdancer bitdancer added 3.7 (EOL) end of life 3.8 only security fixes labels Nov 28, 2018
    @timb07
    Copy link
    Mannequin Author

    timb07 mannequin commented Nov 29, 2018

    I've addressed the points in the last few comments and created a new PR (10783).

    @sim0nx
    Copy link
    Mannequin

    sim0nx mannequin commented Sep 4, 2020

    As I think it is still important to have this fixed and considering the original PR was closed, I have created a new PR based on the original one while implementing the requested changes.

    #22090

    @sim0nx sim0nx mannequin added 3.9 only security fixes 3.10 only security fixes labels Sep 4, 2020
    @warsaw
    Copy link
    Member

    warsaw commented Oct 22, 2020

    @sim0n - I added a comment to your open PR.

    My main question for the rest of the group is whether we can and should backport this. Given the new defect class being introduced, it seems like this should only land in 3.10. Thoughts?

    @sim0nx
    Copy link
    Mannequin

    sim0nx mannequin commented Oct 23, 2020

    @barry Thank you for your input on the PR.

    From what I understood this PR was nearly ready and only missing a small addition to the documentation which I added. So it took me a bit to go through it all :-).

    I actually don't see how *parsedate_to_datetime* would ever return None. It is *_parsedate_tz* which returns None on bogus input, in which case *parsedate_to_datetime* raises a TypeError.
    This is also covered in the tests, so those should be fine.

    In order to continue I suggest to fix the documentation on *parsedate_to_datetime*, remove the mention of it returning None and replacing it with it possibly returning TypeError in case of an invalid date input.

    Does that make sense ?

    Regarding the backporting, as a user of this I must admit that it would be much appreciated if this could be backported :-).

    @warsaw
    Copy link
    Member

    warsaw commented Oct 23, 2020

    Aside: I noticed that on _parseaddr.py:68, there's a bare return. That should really be return None (EIBTI). Can you fix that in your PR?

    I think it's confusing to raise both TypeError and ValueError. I suggest we check the None return from _parsedate_tz() and raise ValueError explicitly in that case, avoiding the implicit TypeError on the failing tuple unpack.

    +1 on removing the mention of returning None from the documentation. Then with the above, it would document raising ValueError on invalid date input.

    As for backporting, I'm nosing Ned and Łukasz to weigh in. Given that the patch is adding a new defect class (which it should), this won't break existing code, but it does mean that existing code would have different semantics depending on the patch version release of 3.9, 3.8, and 3.7. I'm not completely comfortable with that, but let's see what the RMs say. I guess I'm currently -0 on backporting.

    @ned-deily ned-deily removed the 3.7 (EOL) end of life label Oct 26, 2020
    @miss-islington
    Copy link
    Contributor

    New changeset 303aac8 by Georges Toth in branch 'master':
    bpo-30681: Support invalid date format or value in email Date header (GH-22090)
    303aac8

    @warsaw warsaw removed 3.8 only security fixes 3.9 only security fixes labels Oct 27, 2020
    @warsaw warsaw closed this as completed Oct 27, 2020
    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Oct 27, 2020

    >>> email.utils.parsedate_to_datetime(None)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/serhiy/py/cpython/Lib/email/utils.py", line 200, in parsedate_to_datetime
        raise ValueError('Invalid date value or format "%s"' % str(data))
    ValueError: Invalid date value or format "None"

    First, the date value is None, not "None".

    Second, why not just return None? parsedate() can be used in code like:

       parsedata(headers.get('Date'))

    None is an expected argument if the header "Date" is absent. parsedate_to_datetime() is not compatible with parsedata() in this case.

    It was a regression introduced in bpo-15925 (#60129). Before that parsedate_to_datetime(None) returned None.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @serhiy-storchaka
    Copy link
    Member

    @warsaw, @bitdancer, what is your opinion? Is it worth to make parsedate_to_datetime(None) returning None, or it is too late for this?

    @iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 23, 2023
    dhvcc added a commit to dhvcc/rss-parser that referenced this issue Feb 15, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    Successfully merging a pull request may close this issue.

    6 participants