Title: email.utils.parseaddr mistakenly parse an email
Type: behavior Stage: patch review
Components: email Versions: Python 3.8, Python 3.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Cyril Nicodème, Dain Dwarf, Windson Yang, barry, bortzmeyer, jpic, kal.sze, msapiro, ned.deily, nicoe, r.david.murray, vstinner, xtreak
Priority: deferred blocker Keywords: patch, security_issue

Created on 2018-07-19 14:53 by Cyril Nicodème, last changed 2019-05-04 13:02 by jpic.

File name Uploaded Description Edit
Screen Shot 2019-05-02 at 22.07.27.png barry, 2019-05-03 02:09
Pull Requests
URL Status Linked Edit
PR 13079 open python-dev, 2019-05-03 21:27
Messages (23)
msg321956 - (view) Author: Cyril Nicodème (Cyril Nicodème) Date: 2018-07-19 14:53

I'm trying to parse some emails, and I discovered that email.utils.parseaddr wrongly parse an email.

Here's the corresponding header:

> From: =?utf-8?Q?

Once this has been parsed via `decode_header`, we obtain this value:

> From:ゆ↑ゆゃゅぇぺぽぼ"\どづぢlだばとくKLいれるゆ>KLらJF <>

(I agree, not really a nice looking From email ...)

Then, when this value is given to parseaddr, here's the result:

> ('', 'ゆ↑ゆゃゅぇぺぽぼ')

But it should be:

> ('ゆ↑ゆゃゅぇぺぽぼ"\どづぢlだばとくKLいれるゆ>KLらJF', '')

(Note that the email in the "name" part is not the same as the email in the "email" part!)
msg321957 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-07-19 15:18
That does appear to be a bug.  Note that the new email API handles it correctly:

    >>> x = """
    ... > From: =?utf-8?Q?
    ...  =?utf-8?Q?=E3=82=83=E3=82=85=E3=81=87=E3=81=BA=E3=81=BD=E3=81=BC"\=E3?=
    ...  =?utf-8?Q?=81=A9=E3=81=A5=E3=81=A2l=E3=81=A0=E3=81=B0=E3=81=A8=E3=81?=
    ...  =?utf-8?Q?=8FKL=E3=81=84=E3=82=8C=E3=82=8B=E3=82=86>KL=E3=82=89JF?=
    ...  <>
    ... """
    >>> from email import message_from_string
    >>> from email.policy import default
    >>> m = message_from_string(x+'\n\ntest', policy=default)
    >>> m['from']
    '"ゆ↑ゆ ゃゅぇぺぽぼ\\"\\\\� ��づぢlだばと� �KLいれるゆ>KLらJF" <>'
    >>> m['from'].addresses[0].addr_spec
    >>> m['from'].addresses[0].display_name
    'ゆ↑ゆ ゃゅぇぺぽぼ"\\\udce3 \udc81\udca9づぢlだばと\udce3\udc81 \udc8fKLいれるゆ>KLらJF'

I'm not particularly interested myself in fixing parseaddr to handle this case correctly, since it is the legacy API, but if someone else wants to I'll review the patch.
msg321958 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-07-19 15:19
Oops, I left out a step in that cut and paste.  For completeness:

    >>> x = x[3:]
msg321959 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-07-19 15:21
Ah, maybe it doesn't handle it completely correctly; that decode looks different now that I look at it in detail.
msg321967 - (view) Author: Jakub Wilk (jwilk) Date: 2018-07-19 21:03
You should not use decode_header() on the whole From header, because that loses
information. You should parse the header first, then decode the parts that
could be RFC2047-encoded.

Quoting <>:

> NOTE: Decoding and display of encoded-words occurs *after* a
> structured field body is parsed into tokens.  It is therefore
> possible to hide 'special' characters in encoded-words which, when
> displayed, will be indistinguishable from 'special' characters in the
> surrounding text.  For this and other reasons, it is NOT generally
> possible to translate a message header containing 'encoded-word's to
> an unencoded form which can be parsed by an RFC 822 mail reader.

So I don't see a bug in parseaddr() here, except that the API is a bit of a
msg329372 - (view) Author: Mark Sapiro (msapiro) * (Python triager) Date: 2018-11-06 18:14
The issue is illustrated much more simply as follows:

email.utils.parseaddr('John Doe <>')


('', 'John Doe')

whereas it should return

('John Doe', '')

I'll look at developing a patch.
msg329376 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-11-06 19:23
>>> m = message_from_string("From: John Doe <>\n\n", policy=default)
    >>> m['From'].addresses(Address(display_name='', username='John Doe jdoe', domain=''),)

The new policies have more error recovery for non-RFC compliant addresses than decode_header, but the two agree in this case.  What is happening here is that (1) an unquoted/unencoded '@' is not allowed in a display name (2) if the address is not '<>' quoted, then everything before the @ is the username and (3) in the absence of a comma after the end of the fqdn (which is not allowed to contain blanks) any additional tokens are discarded.

One could argue that we could treat the blank after the FQDN as a "missing comma", and there would be some merit to that argument.  You could also argue that a "<>" quoted string would trump the occurrence of the @ earlier in the token list.  However, the RFC822 grammar is designed to be parsed character by character, so that would not be a typical way for an RFC822 parser to try to do postel-style error recovery.

So, I don't think there is a bug here, but I'd be curious what other email address parsing libraries do, and that could influence whether extensions to the "make a guess when the string doesn't conform to the RFC" code would be acceptable.
msg329377 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-11-06 19:24
The formatting of that doctest paragraph got messed up.  Let me try again:

    >>> m = message_from_string("From: John Doe <>\n\n", policy=default)
    >>> m['From'].addresses
    (Address(display_name='', username='John Doe jdoe', domain=''),)
msg329379 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2018-11-06 19:48
Is this a case of realname having @ inside an unquoted string? As I can see from the RFC the acceptable characters of an atom other than alphabets and digits that comprises a phrase are ['!', '#', '$', '%', '&', "'", '*', '+', '-', '/', '=', '?', '^', '_', '`', '{', '|', '}', '~'] . So just curious if it's a case of @ inside unquoted string as name?

>>> for char in accepted:
...     print(parseaddr(f'John Doe jdoe{char} <>'))
('John Doe jdoe!', '')
('John Doe', '')
('John Doe jdoe$', '')
('John Doe', '')
('John Doe jdoe&', '')
("John Doe jdoe'", '')
('John Doe jdoe*', '')
('John Doe', '')
('John Doe', '')
('John Doe jdoe/', '')
('John Doe', '')
('John Doe jdoe?', '')
('John Doe jdoe^', '')
('John Doe', '')
('John Doe jdoe`', '')
('John Doe jdoe{', '')
('John Doe jdoe|', '')
('John Doe jdoe}', '')
('John Doe', '')

>>> parseaddr('"John Doe" <>')
('John Doe', '')

>>> parseaddr('John Doe <>')
('', 'John Doe')
msg329380 - (view) Author: Mark Sapiro (msapiro) * (Python triager) Date: 2018-11-06 19:55
I agree that my example with an @ in the 'display name', although actually seen in the wild, is non-compliant, and that the behavior of parseaddr() in this case is not a bug.

Sorry for the noise.
msg329382 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2018-11-06 20:27
Ah sorry, I was typing so long and had an idle session that I didn't realize @r.david.murray added a comment with the explanation. Just to add I tried using Perl module ( that uses regex for parsing that returns me two addresses and the regex is also not much comprehensible.

use v5.14;
use Email::Address;

my $line = 'John Doe <>';
my @addresses = Email::Address->parse($line);
say $addresses[0];
say $addresses[1];

say "Angle address regex";
say $Email::Address::angle_addr;
Angle address regex

msg329463 - (view) Author: Kal Sze (kal.sze) Date: 2018-11-08 08:23
Another failure case:

>>> from email.utils import parseaddr
>>> parseaddr('')
('', 'fo@o')

If I understand the RFC correctly, the correct results should be ('', '') because there are two '@' signs. The first '@' would need to be quoted for the address to be valid.
msg340534 - (view) Author: Stéphane Bortzmeyer (bortzmeyer) Date: 2019-04-19 10:16
Note that this bug was used in an actual security attack so it is serious
msg340535 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2019-04-19 10:28
Relevant attack from matrix blog post.

> sydent uses python's email.utils.parseaddr function to parse the input email address before sending validation mail to it, but it turns out that if you hand parseaddr an malformed email address of form, it silently discards the prefix without error. The result of this is that if one requested a validation token for '', the token would be sent to '', but the address '' would be marked as validated. This release fixes this behaviour by asserting that the parsed email address is the same as the input email address.

I am marking this as a security issue.
msg340933 - (view) Author: jpic (jpic) * Date: 2019-04-26 16:42
Given the situation, could raising a SecurityWarning and a DeprecationWarning fix this issue ?
msg341069 - (view) Author: Dain Dwarf (Dain Dwarf) Date: 2019-04-29 11:42
Hello, kind of new here.

I just wanted to note that the issue that lead to Tchap's security attack still exists in the non-deprecated message_from_string function:

email.message_from_string('From:', policy=email.policy.default)['from'].addresses

(Address(display_name='', username='a', domain=''),)

So, deprecating parseaddr is not enough for security purpose, unless there is another ticket for the new email API.
msg341294 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2019-05-02 18:07
@barry, @r.david.murray, With the additional info about attacks in the wild, should we now consider this a security issue?  If so, someone needs to provide an actual PR.  (Raising the priority to "deferred blocker" pending evaluation.)
msg341320 - (view) Author: Windson Yang (Windson Yang) * Date: 2019-05-03 01:45
I found the issue located in

elif self.field[self.pos] in '.@':
    # email address is just an addrspec
    # this isn't very efficient since we start over
    self.pos = oldpos
    self.commentlist = oldcl
    addrspec = self.getaddrspec()
    returnlist = [(SPACE.join(self.commentlist), addrspec)]

The parseaddr function runs a for in loop over the input string, when it meets '.@' it will do something. That is why when the input string is '' will return ('', ''). One possible solution will be to check the string in the reverse order then we can always get the last '@' in the string.
msg341322 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2019-05-03 02:09
Well, at least we're not alone.  Here's a screen capture from Version 12.4 (3445.104.8).
msg341362 - (view) Author: jpic (jpic) * Date: 2019-05-03 23:57
I haven't found this specific case in an RFC, but checked Go's net/mail
library behavior and it just considers it broken:

$ cat mail.go
package main
import "fmt"
import "net/mail"
func main() {

$ go run mail.go
<> <nil>
<nil> mail: expected single address, got ""

That would fix the security issue but not the whole ticket.
msg341367 - (view) Author: jpic (jpic) * Date: 2019-05-04 01:10
The pull request has been updated to mimic net/mail's behavior rather than
trying to workaround user input.


    >>> email.message_from_string('From:',
    (Address(display_name='', username='a', domain=''),)

    >>> parseaddr('')
    ('', '')


    >>> email.message_from_string('From:',
    (Address(display_name='', username='', domain=''),)

    >>> parseaddr('')
    ('', 'a@')

I like what I saw under the hood, please feel free to hack me for other
tasks in the email stdlib.
msg341370 - (view) Author: Windson Yang (Windson Yang) * Date: 2019-05-04 02:16
Frome the answer from Alnitak ( Maybe we should raise an error when the address has multiple @ in it.
msg341381 - (view) Author: jpic (jpic) * Date: 2019-05-04 13:02
At is allowed in the local part if quoted, the proposed patch acts within
get_domain() and getdomain() and does not affect local part parsing. This
still works:

>>> parseaddr('"fo@bar"')
('', '"fo@bar"')

>>> email.message_from_string('From: "a@b"
(Address(display_name='', username='a@b', domain=''),)

I'm not against raising an exception in parseaddr, but you should know that
currently nothing in parseaddr seems to raise an exception:

jpic@ci:~/cpython$ grep raise Lib/email/

For example:

>>> parseaddr('aoeu')
('', 'aoeu')
>>> parseaddr('a@')
('', 'a@')

None of the above calls raised an exception. That is the reason why I did
not introduce a new Exception in the getdomain() change: I thought it would
be more consistent with the rest of the API as such.

As for the new API, the patch does raise a parse error:

 # this detect that the next caracter right after the parsed domain is
another @
if value and value[0] == '@':
      raise errors.HeaderParseError('Multiple domains')

But that's in the lower level API that is planned for going public later on
(probably when it will have unit tests), it's just the higher level API
that the user faces that swallows it. As a user you can still know about
that parse problem using the defects attribute:

>>> email.message_from_string('From:',
InvalidHeaderDefect('invalid address in address-list')
Date User Action Args
2019-05-04 13:02:09jpicsetmessages: + msg341381
2019-05-04 02:16:20Windson Yangsetmessages: + msg341370
2019-05-04 01:10:58jpicsetmessages: + msg341367
2019-05-03 23:57:59jpicsetmessages: + msg341362
2019-05-03 21:27:43python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request12994
2019-05-03 02:09:13barrysetfiles: + Screen Shot 2019-05-02 at 22.07.27.png

messages: + msg341322
2019-05-03 01:45:21Windson Yangsetnosy: + Windson Yang
messages: + msg341320
2019-05-02 18:07:09ned.deilysetpriority: normal -> deferred blocker
nosy: + ned.deily
messages: + msg341294

2019-04-29 12:03:35jwilksetnosy: - jwilk
2019-04-29 11:42:55Dain Dwarfsetnosy: + Dain Dwarf
messages: + msg341069
2019-04-26 16:42:43jpicsetnosy: + jpic
messages: + msg340933
2019-04-23 07:45:20vstinnersetnosy: + vstinner
2019-04-23 07:13:33vstinnersetnosy: - vstinner
2019-04-19 12:02:10nicoesetnosy: + nicoe
2019-04-19 10:28:43xtreaksetkeywords: + security_issue
nosy: + vstinner
messages: + msg340535

2019-04-19 10:16:21bortzmeyersetnosy: + bortzmeyer
messages: + msg340534
2018-11-08 08:23:11kal.szesetnosy: + kal.sze
messages: + msg329463
2018-11-06 20:27:34xtreaksetmessages: + msg329382
2018-11-06 19:55:26msapirosetmessages: + msg329380
2018-11-06 19:48:48xtreaksetnosy: + xtreak
messages: + msg329379
2018-11-06 19:24:59r.david.murraysetmessages: + msg329377
2018-11-06 19:23:24r.david.murraysetmessages: + msg329376
2018-11-06 18:14:44msapirosetnosy: + msapiro
messages: + msg329372
2018-07-19 21:03:38jwilksetnosy: + jwilk
messages: + msg321967
2018-07-19 15:21:25r.david.murraysetmessages: + msg321959
2018-07-19 15:19:13r.david.murraysetmessages: + msg321958
2018-07-19 15:18:03r.david.murraysetmessages: + msg321957
versions: + Python 3.7, Python 3.8, - Python 3.6
2018-07-19 14:53:43Cyril Nicodèmecreate