Issue 46392: MessageIDHeader is too strict for message-id

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90550

classification

Title:	MessageIDHeader is too strict for message-id
Type:		Stage:
Components:	email	Versions:

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, bpoaugust, eric.smith, r.david.murray
Priority:	normal	Keywords:

Created on 2022-01-16 00:16 by bpoaugust, last changed 2022-04-11 14:59 by admin.

Messages (8)
msg410665 - (view)	Author: (bpoaugust)	Date: 2022-01-16 00:16
The email headerregistry class MessageIDHeader is too strict when parsing existing Message-Ids. It can truncate Message-Ids that are valid according to the obsolete rules. As the saying has it: "Be liberal in what you accept, and conservative in what you send." I think the parser should be much closer to the UnstructuredHeader.
msg410693 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2022-01-16 10:13
In what way is it too strict? What "obsolete rules" are you referring to? What are some example Message-Ids should be considered valid that instead get truncated? What changes are you proposing?
msg410697 - (view)	Author: (bpoaugust)	Date: 2022-01-16 16:00
The easiest might be for me to provide some test cases, but I have not been able to work out where the existing unit tests are. One failure which I believe should be permitted under current rules is: <alphanum@aphanum > - i.e. trailing space The space gets added AFTER the > However the following is parsed correctly: <alphanum@()aphanum > - i.e. trailing space but with previous comment The obsolete rules I referred to are here: https://datatracker.ietf.org/doc/html/rfc5322#section-4
msg410826 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2022-01-17 22:31
Note that the parser does attempt to accept obsolete syntax (registering defects for it), so if there is a bug in the implementation of the obsolete syntax handling it should be fixed. And yes, there have been other bugs with whitespace handling in the parser, unfortunately. Examples would be most helpful, even if you don't write unit tests. Most of the tests, by the way, are in test__header_value_parser (search for message_id). There aren't very many, so more would be good.
msg410854 - (view)	Author: (bpoaugust)	Date: 2022-01-18 11:38
When the library is being used to parse existing emails, I think it needs to do the minimum validation and canonicalisation. It may be useful in some circumstances to report where the input is not syntactically correct, but I'm not sure it is helpful to truncate the input at the first syntax error. When the library is used to generate emails, validation should be very strict.
msg410878 - (view)	Author: (bpoaugust)	Date: 2022-01-18 17:17
I think an id of the form <A.A.A.A(A-A)@A.A.A> should be allowed, but it generates <A.A.A.A i.e. stops at the '(' I read the syntax from RFC5322 as follows: id-left => obs-id-left => local-part => obs-local-part => word ("." word) word => atom => [CFWS] 1atext [CFWS] '<A.A.A@A.A (A A)>' should also be allowed but generates '<A.A.A@A.A> (A A)' and '<A@A.A.A A.A>' gives '<A@A.A.A> '
msg410893 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2022-01-18 21:03
The general idea is that the string version of the header should contain all of the original information, but the parsed elements (the things returned by special header attributes) will contain the valid data, if any. So if the string version of the header is being truncated or transformed (other than whitespace changes during re-folding), that is a bug. Your examples involve comment fields, and I'm afraid that my development of the parser stopped before I did very much with comments. Therefore I am not surprised that comments are handled incorrectly :( :( They aren't very common in the wild, as far as I was able to tell. which is why they were my last priority.
msg410917 - (view)	Author: (bpoaugust)	Date: 2022-01-18 23:49
Sorry, I think '<A@A.A.A A.A>' is not valid, as spaces are not allowed between words. However I am not seeing the original unfolded source if there is an error, unless I am misunderstanding the API. For example: --- cut here --- import email.header import email.utils import email.policy def test(test): msg_string = f"Message-id: {test}" message = email.message_from_string(msg_string, policy=email.policy.default) out = message['Message-id'] print(test) print(out) test('<A@A.A.A A.A>') # invalid test('<A@A.A.AA.A>') # valid --- cut here --- This produces: <A@A.A.A A.A> <A@A.A.A> # truncated at error <A@A.A.AA.A> <A@A.A.AA.A> i.e. the invalid input is truncated

History
Date	User	Action	Args
2022-04-11 14:59:54	admin	set	github: 90550
2022-01-18 23:49:58	bpoaugust	set	messages: + msg410917
2022-01-18 21:03:11	r.david.murray	set	messages: + msg410893
2022-01-18 17:17:32	bpoaugust	set	messages: + msg410878
2022-01-18 11:38:18	bpoaugust	set	messages: + msg410854
2022-01-17 22:31:21	r.david.murray	set	messages: + msg410826
2022-01-16 16:00:35	bpoaugust	set	messages: + msg410697
2022-01-16 10:13:10	eric.smith	set	nosy: + eric.smith messages: + msg410693
2022-01-16 00:16:30	bpoaugust	create