classification
Title: email.utils.parseaddr returns garbage for invalid input
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: barry, eric.araujo, r.david.murray
Priority: normal Keywords: patch

Created on 2010-07-17 14:35 by eric.araujo, last changed 2010-12-19 23:53 by eric.araujo. This issue is now closed.

Files
File name Uploaded Description Edit
preserve_unquoted_white_space_in_local_part.diff r.david.murray, 2010-12-13 02:21
Messages (12)
msg110561 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-07-17 14:35
This behavior does not seem right to me:

parsing 'merwok'
 expected ('merwok', '')
 got      ('', 'merwok')

parsing 'merwok wok@rusty'
 expected ('', 'wok@rusty')
 got      ('', 'merwokwok@rusty')

(Generated with a small script just doing a loop and prints, not attached because boring.)

Are my expectations wrong? I don’t know if a string like “merwok” in my first example is a legal address in the relevant RFCs; Mark Sapiro replied in msg110556 that it could be consistent with most MUAs/MTAs.

I don’t know either if the folding done in the second example is okay; I’d like an exception here, or if parseaddr is designed to never fail, empty strings to indicate failure. I’m also okay with “garbage in, garbage out” as answer.
msg117813 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-01 16:21
In the first of your examples, parseaddr is correct (a lone token is considered a 'local' address per RFC).

The second one is prossibly wrong, but if so the correct way to interpret it is not clear.  If you read the RFC carefully (http://tools.ietf.org/html/rfc5322#section-4.4), spaces are allowed between the 'local part' and the domain in obsolete syntax (which must be accepted).  However, the space being elided here is between pieces of the local part.  Note that because the address is not in '<>', the whole string is the address, there's no name field.  The "correct" parse could be:

('', '"merwok wok"@rusty')

That is, we apply a 'be generous in what you accept' rule and assume the "s were forgotten.  However, perhaps a more sensible 'generous' rule would be to assume the '<>' were forgotten and return

('merwok', 'wok@rusty')

However, it is quite possible that the reason the space is being elided here has to do with handling the obsolete 'route' syntax.  If that is the case then parseaddr is probably correct.  It may be a while before I get around to understanding that part of the spec well enough to render a judgement, so in the meantime I'll assume parseaddr is correct.  Feel free to read the spec and render your own opinion :)
msg120983 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-11 23:11
Having no time to read email RFCs, I’ll defer to you here.  Please reject this report or save it for later as you prefer.
msg120985 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-11 23:24
In connection with another bug report I found a rather basic error in parseaddr, so I'm going to eventually dig far enough into the RFC to have a real opinion on the elided-space issue.
msg123856 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-13 02:21
OK, I've studied this more, and it looks to me like the legacy address format allows multiple atoms separated by white space in the local part of the address.  This means that the correct parse would be

  ('', 'merwok wok@rusty.com')

How useful this parse is is a good question.  It is arguably better than losing the white space; however, the fact that it represents a behavior change and there's no actual user bug against this argues against backport.  I do think it is better to conform to the RFC as much as possible, though, so I'd like to fix this in 3.2.

Attached is a patch to the parser that preserves whitespace runs in between unquoted atoms in the local part.

It would be interesting to know what other email programs do with such addresses.
msg124072 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-12-15 21:25
I have not read email RFCs, so I will defer to you.  One suggestion for the patch, though: Use example.org instead of rusty.com (see RFC 2606).

I tried the examples in Icedove (free Thunderbird), either it finds a matching contact or it refuses to send the message.
msg124078 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-15 22:24
I don't see any reason to use example.com in tests that are not talking to the network and aren't documentation.

The interesting question about the other mailers is, if you *receive* an email with such an address (1) what does it show you and (2) what does it put into the To: field when you do a 'reply'?  How you arrange to receive such a broken email, I'm not sure :)
msg124080 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-15 22:32
On the other hand, putting a real domain name that belongs to somebody else into our code base even as a test string is probably impolite without asking, so I'll change it when I commit.
msg124086 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-12-15 22:56
Yes, it’s either impolite or free advertisement.

Ideas to receive such a malformed email: Use a valid email in From but not in Reply-To; write it by hand and put it in your maildir.
msg124304 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-18 18:31
Committed in r87384.

Barry, I've added you as nosy in case you disagree with this fix.  The essential point is that before, parseaddr would turn 'merwok wok@example.com' into 'merwokwok@example.com', and now it preserves the whitespace.  My theory is that this loses data, that the obsolete syntax allows it, and that dropping the whitespace denies the application program the chance to apply its own heuristics.  However applications might currently be depending on the parsed local part being a single token, so I don't plan to backport this.
msg124323 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2010-12-18 22:05
On Dec 18, 2010, at 06:31 PM, R. David Murray wrote:

>Barry, I've added you as nosy in case you disagree with this fix.  The
>essential point is that before, parseaddr would turn 'merwok wok@example.com'
>into 'merwokwok@example.com', and now it preserves the whitespace.  My theory
>is that this loses data, that the obsolete syntax allows it, and that
>dropping the whitespace denies the application program the chance to apply
>its own heuristics.  However applications might currently be depending on the
>parsed local part being a single token, so I don't plan to backport this.

Thanks.  I agree with the fix, and not back porting it.
msg124371 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-12-19 23:53
Thanks for the explanations and fix!
History
Date User Action Args
2010-12-19 23:53:17eric.araujosetmessages: + msg124371
2010-12-18 22:05:31barrysetmessages: + msg124323
2010-12-18 18:31:42r.david.murraysetstatus: open -> closed

nosy: + barry
messages: + msg124304

resolution: fixed
stage: patch review -> resolved
2010-12-15 22:56:52eric.araujosetmessages: + msg124086
2010-12-15 22:32:18r.david.murraysetmessages: + msg124080
2010-12-15 22:24:34r.david.murraysetmessages: + msg124078
2010-12-15 21:25:35eric.araujosetmessages: + msg124072
2010-12-13 02:21:34r.david.murraysetfiles: + preserve_unquoted_white_space_in_local_part.diff
versions: - Python 3.1, Python 2.7
messages: + msg123856

keywords: + patch
stage: patch review
2010-11-11 23:24:57r.david.murraysetmessages: + msg120985
2010-11-11 23:11:46eric.araujosetmessages: + msg120983
versions: - Python 2.6
2010-10-01 16:21:01r.david.murraysetmessages: + msg117813
2010-07-17 22:47:09eric.araujosetassignee: r.david.murray
2010-07-17 14:35:58eric.araujocreate