Message 329376 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	r.david.murray
Recipients	barry, cnicodeme, jwilk, msapiro, r.david.murray
Date	2018-11-06.19:23:24
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1541532204.68.0.788709270274.issue34155@psf.upfronthosting.co.za>
In-reply-to

Content
>>> m = message_from_string("From: John Doe jdoe@example.com <other@example.net>\n\n", policy=default) >>> m['From'].addresses(Address(display_name='', username='John Doe jdoe', domain='example.com'),) The new policies have more error recovery for non-RFC compliant addresses than decode_header, but the two agree in this case. What is happening here is that (1) an unquoted/unencoded '@' is not allowed in a display name (2) if the address is not '<>' quoted, then everything before the @ is the username and (3) in the absence of a comma after the end of the fqdn (which is not allowed to contain blanks) any additional tokens are discarded. One could argue that we could treat the blank after the FQDN as a "missing comma", and there would be some merit to that argument. You could also argue that a "<>" quoted string would trump the occurrence of the @ earlier in the token list. However, the RFC822 grammar is designed to be parsed character by character, so that would not be a typical way for an RFC822 parser to try to do postel-style error recovery. So, I don't think there is a bug here, but I'd be curious what other email address parsing libraries do, and that could influence whether extensions to the "make a guess when the string doesn't conform to the RFC" code would be acceptable.

>>> m = message_from_string("From: John Doe jdoe@example.com <other@example.net>\n\n", policy=default)
    >>> m['From'].addresses(Address(display_name='', username='John Doe jdoe', domain='example.com'),)

The new policies have more error recovery for non-RFC compliant addresses than decode_header, but the two agree in this case.  What is happening here is that (1) an unquoted/unencoded '@' is not allowed in a display name (2) if the address is not '<>' quoted, then everything before the @ is the username and (3) in the absence of a comma after the end of the fqdn (which is not allowed to contain blanks) any additional tokens are discarded.

One could argue that we could treat the blank after the FQDN as a "missing comma", and there would be some merit to that argument.  You could also argue that a "<>" quoted string would trump the occurrence of the @ earlier in the token list.  However, the RFC822 grammar is designed to be parsed character by character, so that would not be a typical way for an RFC822 parser to try to do postel-style error recovery.

So, I don't think there is a bug here, but I'd be curious what other email address parsing libraries do, and that could influence whether extensions to the "make a guess when the string doesn't conform to the RFC" code would be acceptable.

History
Date	User	Action	Args
2018-11-06 19:23:24	r.david.murray	set	recipients: + r.david.murray, barry, msapiro, jwilk, cnicodeme
2018-11-06 19:23:24	r.david.murray	set	messageid: <1541532204.68.0.788709270274.issue34155@psf.upfronthosting.co.za>
2018-11-06 19:23:24	r.david.murray	link	issue34155 messages
2018-11-06 19:23:24	r.david.murray	create