classification
Title: email.utils.parseaddr mistakenly parse an email
Type: behavior Stage:
Components: email Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Cyril Nicodème, Kal Sze2, barry, jwilk, msapiro, r.david.murray, xtreak
Priority: normal Keywords:

Created on 2018-07-19 14:53 by Cyril Nicodème, last changed 2018-11-08 08:23 by Kal Sze2.

Messages (12)
msg321956 - (view) Author: Cyril Nicodème (Cyril Nicodème) Date: 2018-07-19 14:53
Hi!

I'm trying to parse some emails, and I discovered that email.utils.parseaddr wrongly parse an email.

Here's the corresponding header:

> From: =?utf-8?Q?zq@redacted.com.cn=E3=82=86=E2=86=91=E3=82=86?=
 =?utf-8?Q?=E3=82=83=E3=82=85=E3=81=87=E3=81=BA=E3=81=BD=E3=81=BC"\=E3?=
 =?utf-8?Q?=81=A9=E3=81=A5=E3=81=A2l=E3=81=A0=E3=81=B0=E3=81=A8=E3=81?=
 =?utf-8?Q?=8FKL=E3=81=84=E3=82=8C=E3=82=8B=E3=82=86>KL=E3=82=89JF?=
 <mxvu@redacted2.com>

Once this has been parsed via `decode_header`, we obtain this value:

> From: zq@redacted.com.cnゆ↑ゆゃゅぇぺぽぼ"\どづぢlだばとくKLいれるゆ>KLらJF <mxvu@redacted2.com>

(I agree, not really a nice looking From email ...)

Then, when this value is given to parseaddr, here's the result:

> ('', 'zq@redacted.com.cnゆ↑ゆゃゅぇぺぽぼ')

But it should be:

> ('zq@redacted.com.cnゆ↑ゆゃゅぇぺぽぼ"\どづぢlだばとくKLいれるゆ>KLらJF', 'mxvu@redacted2.com')

(Note that the email in the "name" part is not the same as the email in the "email" part!)
msg321957 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-07-19 15:18
That does appear to be a bug.  Note that the new email API handles it correctly:

    >>> x = """
    ... > From: =?utf-8?Q?zq@redacted.com.cn=E3=82=86=E2=86=91=E3=82=86?=
    ...  =?utf-8?Q?=E3=82=83=E3=82=85=E3=81=87=E3=81=BA=E3=81=BD=E3=81=BC"\=E3?=
    ...  =?utf-8?Q?=81=A9=E3=81=A5=E3=81=A2l=E3=81=A0=E3=81=B0=E3=81=A8=E3=81?=
    ...  =?utf-8?Q?=8FKL=E3=81=84=E3=82=8C=E3=82=8B=E3=82=86>KL=E3=82=89JF?=
    ...  <mxvu@redacted2.com>
    ... """
    >>> from email import message_from_string
    >>> from email.policy import default
    >>> m = message_from_string(x+'\n\ntest', policy=default)
    >>> m['from']
    '"zq@redacted.com.cnゆ↑ゆ ゃゅぇぺぽぼ\\"\\\\� ��づぢlだばと� �KLいれるゆ>KLらJF" <mxvu@redacted2.com>'
    >>> m['from'].addresses[0].addr_spec
    'mxvu@redacted2.com'
    >>> m['from'].addresses[0].display_name
    'zq@redacted.com.cnゆ↑ゆ ゃゅぇぺぽぼ"\\\udce3 \udc81\udca9づぢlだばと\udce3\udc81 \udc8fKLいれるゆ>KLらJF'

I'm not particularly interested myself in fixing parseaddr to handle this case correctly, since it is the legacy API, but if someone else wants to I'll review the patch.
msg321958 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-07-19 15:19
Oops, I left out a step in that cut and paste.  For completeness:

    >>> x = x[3:]
msg321959 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-07-19 15:21
Ah, maybe it doesn't handle it completely correctly; that decode looks different now that I look at it in detail.
msg321967 - (view) Author: Jakub Wilk (jwilk) Date: 2018-07-19 21:03
You should not use decode_header() on the whole From header, because that loses
information. You should parse the header first, then decode the parts that
could be RFC2047-encoded.

Quoting <https://tools.ietf.org/html/rfc2047#section-6.2>:

> NOTE: Decoding and display of encoded-words occurs *after* a
> structured field body is parsed into tokens.  It is therefore
> possible to hide 'special' characters in encoded-words which, when
> displayed, will be indistinguishable from 'special' characters in the
> surrounding text.  For this and other reasons, it is NOT generally
> possible to translate a message header containing 'encoded-word's to
> an unencoded form which can be parsed by an RFC 822 mail reader.

So I don't see a bug in parseaddr() here, except that the API is a bit of a
footgun.
msg329372 - (view) Author: Mark Sapiro (msapiro) * Date: 2018-11-06 18:14
The issue is illustrated much more simply as follows:

email.utils.parseaddr('John Doe jdoe@example.com <other@example.net>')

returns

('', 'John Doe jdoe@example.com')

whereas it should return

('John Doe jdoe@example.com', 'other@example.net')

I'll look at developing a patch.
msg329376 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-11-06 19:23
>>> m = message_from_string("From: John Doe jdoe@example.com <other@example.net>\n\n", policy=default)
    >>> m['From'].addresses(Address(display_name='', username='John Doe jdoe', domain='example.com'),)

The new policies have more error recovery for non-RFC compliant addresses than decode_header, but the two agree in this case.  What is happening here is that (1) an unquoted/unencoded '@' is not allowed in a display name (2) if the address is not '<>' quoted, then everything before the @ is the username and (3) in the absence of a comma after the end of the fqdn (which is not allowed to contain blanks) any additional tokens are discarded.

One could argue that we could treat the blank after the FQDN as a "missing comma", and there would be some merit to that argument.  You could also argue that a "<>" quoted string would trump the occurrence of the @ earlier in the token list.  However, the RFC822 grammar is designed to be parsed character by character, so that would not be a typical way for an RFC822 parser to try to do postel-style error recovery.

So, I don't think there is a bug here, but I'd be curious what other email address parsing libraries do, and that could influence whether extensions to the "make a guess when the string doesn't conform to the RFC" code would be acceptable.
msg329377 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-11-06 19:24
The formatting of that doctest paragraph got messed up.  Let me try again:

    >>> m = message_from_string("From: John Doe jdoe@example.com <other@example.net>\n\n", policy=default)
    >>> m['From'].addresses
    (Address(display_name='', username='John Doe jdoe', domain='example.com'),)
msg329379 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2018-11-06 19:48
Is this a case of realname having @ inside an unquoted string? As I can see from the RFC the acceptable characters of an atom other than alphabets and digits that comprises a phrase are ['!', '#', '$', '%', '&', "'", '*', '+', '-', '/', '=', '?', '^', '_', '`', '{', '|', '}', '~'] . So just curious if it's a case of @ inside unquoted string as name?

>>> for char in accepted:
...     print(parseaddr(f'John Doe jdoe{char}example.com <other@example.net>'))
...
('John Doe jdoe!example.com', 'other@example.net')
('John Doe jdoe#example.com', 'other@example.net')
('John Doe jdoe$example.com', 'other@example.net')
('John Doe jdoe%example.com', 'other@example.net')
('John Doe jdoe&example.com', 'other@example.net')
("John Doe jdoe'example.com", 'other@example.net')
('John Doe jdoe*example.com', 'other@example.net')
('John Doe jdoe+example.com', 'other@example.net')
('John Doe jdoe-example.com', 'other@example.net')
('John Doe jdoe/example.com', 'other@example.net')
('John Doe jdoe=example.com', 'other@example.net')
('John Doe jdoe?example.com', 'other@example.net')
('John Doe jdoe^example.com', 'other@example.net')
('John Doe jdoe_example.com', 'other@example.net')
('John Doe jdoe`example.com', 'other@example.net')
('John Doe jdoe{example.com', 'other@example.net')
('John Doe jdoe|example.com', 'other@example.net')
('John Doe jdoe}example.com', 'other@example.net')
('John Doe jdoe~example.com', 'other@example.net')

>>> parseaddr('"John Doe jdoe@example.com" <other@example.net>')
('John Doe jdoe@example.com', 'other@example.net')

>>> parseaddr('John Doe jdoe@example.com <other@example.net>')
('', 'John Doe jdoe@example.com')
msg329380 - (view) Author: Mark Sapiro (msapiro) * Date: 2018-11-06 19:55
I agree that my example with an @ in the 'display name', although actually seen in the wild, is non-compliant, and that the behavior of parseaddr() in this case is not a bug.

Sorry for the noise.
msg329382 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2018-11-06 20:27
Ah sorry, I was typing so long and had an idle session that I didn't realize @r.david.murray added a comment with the explanation. Just to add I tried using Perl module (https://metacpan.org/release/Email-Address) that uses regex for parsing that returns me two addresses and the regex is also not much comprehensible.

use v5.14;
use Email::Address;

my $line = 'John Doe jdoe@example.com <other@example.net>';
my @addresses = Email::Address->parse($line);
say $addresses[0];
say $addresses[1];

say "Angle address regex";
say $Email::Address::angle_addr;


jdoe@example.com
other@example.net
Angle address regex
(?^:(?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*<(?^:(?^:(?^:(?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*(?^:[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*)(?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*)|(?^:(?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*"(?^:(?^:[^\\"])|(?^:\\(?^:[^\x0A\x0D])))*"(?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*))\@(?^:(?^:(?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*(?^:[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*)(?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*)|(?^:(?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*\[(?:\s*(?^:(?^:[^\[\]\\])|(?^:\\(?^:[^\x0A\x0D]))))*\s*\](?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*)))>(?^:(?^:\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*)|\s+)*)


Thanks
msg329463 - (view) Author: Kal Sze (Kal Sze2) Date: 2018-11-08 08:23
Another failure case:

>>> from email.utils import parseaddr
>>> parseaddr('fo@o@bar.com')
('', 'fo@o')

If I understand the RFC correctly, the correct results should be ('', '') because there are two '@' signs. The first '@' would need to be quoted for the address to be valid.
History
Date User Action Args
2018-11-08 08:23:11Kal Sze2setnosy: + Kal Sze2
messages: + msg329463
2018-11-06 20:27:34xtreaksetmessages: + msg329382
2018-11-06 19:55:26msapirosetmessages: + msg329380
2018-11-06 19:48:48xtreaksetnosy: + xtreak
messages: + msg329379
2018-11-06 19:24:59r.david.murraysetmessages: + msg329377
2018-11-06 19:23:24r.david.murraysetmessages: + msg329376
2018-11-06 18:14:44msapirosetnosy: + msapiro
messages: + msg329372
2018-07-19 21:03:38jwilksetnosy: + jwilk
messages: + msg321967
2018-07-19 15:21:25r.david.murraysetmessages: + msg321959
2018-07-19 15:19:13r.david.murraysetmessages: + msg321958
2018-07-19 15:18:03r.david.murraysetmessages: + msg321957
versions: + Python 3.7, Python 3.8, - Python 3.6
2018-07-19 14:53:43Cyril Nicodèmecreate