This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: MessageIDHeader is too strict for message-id
Type: Stage:
Components: email Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, bpoaugust, eric.smith, r.david.murray
Priority: normal Keywords:

Created on 2022-01-16 00:16 by bpoaugust, last changed 2022-04-11 14:59 by admin.

Messages (8)
msg410665 - (view) Author: (bpoaugust) Date: 2022-01-16 00:16
The email headerregistry class MessageIDHeader is too strict when parsing existing Message-Ids. It can truncate Message-Ids that are valid according to the obsolete rules.

As the saying has it: 
"Be liberal in what you accept, and conservative in what you send."

I think the parser should be much closer to the UnstructuredHeader.
msg410693 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2022-01-16 10:13
In what way is it too strict? What "obsolete rules" are you referring to? What are some example Message-Ids should be considered valid that instead get truncated? What changes are you proposing?
msg410697 - (view) Author: (bpoaugust) Date: 2022-01-16 16:00
The easiest might be for me to provide some test cases, but I have not been able to work out where the existing unit tests are.

One failure which I believe should be permitted under current rules is:

<alphanum@aphanum > - i.e. trailing space
The space gets added AFTER the >

However the following is parsed correctly:

<alphanum@()aphanum > - i.e. trailing space but with previous comment

The obsolete rules I referred to are here:
https://datatracker.ietf.org/doc/html/rfc5322#section-4
msg410826 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2022-01-17 22:31
Note that the parser does attempt to accept obsolete syntax (registering defects for it), so if there is a bug in the implementation of the obsolete syntax handling it should be fixed.  And yes, there have been other bugs with whitespace handling in the parser, unfortunately.

Examples would be most helpful, even if you don't write unit tests.  Most of the tests, by the way, are in test__header_value_parser (search for message_id).  There aren't very many, so more would be good.
msg410854 - (view) Author: (bpoaugust) Date: 2022-01-18 11:38
When the library is being used to parse existing emails, I think it needs to do the minimum validation and canonicalisation.

It may be useful in some circumstances to report where the input is not syntactically correct, but I'm not sure it is helpful to truncate the input at the first syntax error.

When the library is used to generate emails, validation should be very strict.
msg410878 - (view) Author: (bpoaugust) Date: 2022-01-18 17:17
I think an id of the form

<A.A.A.A(A-A)@A.A.A>

should be allowed, but it generates

<A.A.A.A

i.e. stops at the '('

I read the syntax from RFC5322 as follows:
id-left => obs-id-left => local-part => obs-local-part => word *("." word)
word => atom => [CFWS] 1*atext [CFWS]

'<A.A.A@A.A (A A)>' should also be allowed but generates '<A.A.A@A.A> (A A)'
and '<A@A.A.A A.A>' gives '<A@A.A.A> '
msg410893 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2022-01-18 21:03
The general idea is that the string version of the header should contain all of the original information, but the parsed elements (the things returned by special header attributes) will contain the valid data, if any.  So if the string version of the header is being truncated or transformed (other than whitespace changes during re-folding), that is a bug.

Your examples involve comment fields, and I'm afraid that my development of the parser stopped before I did very much with comments.  Therefore I am not surprised that comments are handled incorrectly :( :(  They aren't very common in the wild, as far as I was able to tell. which is why they were my last priority.
msg410917 - (view) Author: (bpoaugust) Date: 2022-01-18 23:49
Sorry, I think '<A@A.A.A A.A>' is not valid, as spaces are not allowed between words.

However I am not seeing the original unfolded source if there is an error, unless I am misunderstanding the API.

For example:

--- cut here ---
import email.header
import email.utils
import email.policy

def test(test):
    msg_string = f"Message-id: {test}"
    message = email.message_from_string(msg_string, policy=email.policy.default)
    out = message['Message-id']
    print(test)
    print(out)

test('<A@A.A.A A.A>') # invalid
test('<A@A.A.AA.A>') # valid
--- cut here ---

This produces:

<A@A.A.A A.A>
<A@A.A.A> # truncated at error
<A@A.A.AA.A>
<A@A.A.AA.A>

i.e. the invalid input is truncated
History
Date User Action Args
2022-04-11 14:59:54adminsetgithub: 90550
2022-01-18 23:49:58bpoaugustsetmessages: + msg410917
2022-01-18 21:03:11r.david.murraysetmessages: + msg410893
2022-01-18 17:17:32bpoaugustsetmessages: + msg410878
2022-01-18 11:38:18bpoaugustsetmessages: + msg410854
2022-01-17 22:31:21r.david.murraysetmessages: + msg410826
2022-01-16 16:00:35bpoaugustsetmessages: + msg410697
2022-01-16 10:13:10eric.smithsetnosy: + eric.smith
messages: + msg410693
2022-01-16 00:16:30bpoaugustcreate