Title: email.utils.parseaddr fails on odd double quotes in multiline header
Type: behavior Stage: resolved
Components: email Versions: Python 3.5, Python 2.7
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: barry, r.david.murray, robertus
Priority: normal Keywords:

Created on 2017-07-31 14:43 by robertus, last changed 2017-08-01 15:34 by r.david.murray. This issue is now closed.

Messages (5)
msg299558 - (view) Author: Robert (robertus) Date: 2017-07-31 14:43
email.utils.parseaddr() does not successfully parse a
field value into a (comment, address) pair if the
FROM header has 2 lines (or more) containing odd number of double quotes in each of them. 
The address in such tuple is not e-mail address but a part of comment.

For example:

 =?UTF-8?Q?omo=C5=9Bci?=" <>

is parsed into:

('', '=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?=')

Full example on Python 2.7.12, email 4.0.2:

Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from email.utils import parseaddr
>>> parseaddr('"=?UTF8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?=\r\n =?UTF-8?Q?omo=C5=9Bci?=" <>')
('', '=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?=')

When double quotes or \r\n are removed, header is parsed without problems.

The same issue exists on python 3.5.2 and email 6.0.0a1.

From headers analysis I know that e-mail was made in Outlook 14.0 then send through Exim 4.87 to servers.
msg299568 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-07-31 15:32
parseaddr does what you expect if the message has been read using universal newline mode (ie: the linesep is \n):

>>> parseaddr('"=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?=\n =?UTF-8?Q?omo=C5=9Bci?=" <>"')
('=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?=\n =?UTF-8?Q?omo=C5=9Bci?=', '')

I suppose this wouldn't be *that* hard to fix.  If it isn't too complex and you want to propose a patch I'll take a look.

In any case it works fine in python3 using the new policies:

>>> from email import message_from_string as mfs
>>> from email.policy import default
>>> m = mfs('From: "=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?=\r\n =?UTF-8?Q?omo=C5=9Bci?=" <>"\r\n\r\ntest', policy=default)
>>> m['from'].addresses
(Address(display_name='Anita =W4\udc86iecklińska | PATO Nieruch omości', username='anita.wiecklinska', domain=''),)
msg299569 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-07-31 15:41
Ah, I take it back.  With \n it retains the \n in the decoded name field.

There is a bug of some sort here (\r\n should be treated the same as \n, I think, whatever way it is treated).  I don't think this is worth addressing, given that the new policies provide a much better API for interacting with Messages, and you can in fact easily unfold the line before parsing it if you need to do it in 2.7:

  >>> parseaddr(''.join(m['from'].splitlines()))
  ('=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?= =?UTF-8?Q?omo=C5=9Bci?=', '')
msg299616 - (view) Author: Robert (robertus) Date: 2017-08-01 09:29
RFC regarding this topic looks quite complicated to me, but I know that \r\n is used for line breaking in e-mail headers and \n is not. So in my opinion it shouldn't be treated the same like \n. The \r\n should be removed in parsed text, but \n should be preserved like any other character. So I don't think "universal newline mode" is correct approach to read raw e-mails.

I have tested policies in python3 - you have right - it works. But I cannot use it because of application incompatibility with python3.

I was hoping it will be easy to fix for some more experienced than me... If not - you can close issue and I will stay with present solution (removing \r\n).

Thanks for all your help!
msg299621 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-08-01 15:34
Yes, that is mostly likely why parseaddr operates the way it does.  The old email package does not do very much hand-holding, it expects you to understand the RFCs, which as you note is a rather daunting task.  The new email package (the new policies) in python3 aim to incorporate as much understanding of the RFCs into the library as possible and "do the right thing" automatically so you don't have to worry about it (it can't hide 100%, though...).

As for universal new line mode, you are correct that technically \n by itself is data per the RFC (and illegal in the middle of a quoted string like that), but the way Python handles "text" is to convert \r\n into \n internally.  So while parseaddr is doing the "right thing" per the RFC, the input parsing parts of the email package in fact accept \n or even mixed line endings to accommodate the difference between unix/python line endings and RFC line endings.
Date User Action Args
2017-08-01 15:34:23r.david.murraysetmessages: + msg299621
2017-08-01 09:29:36robertussetstatus: open -> closed
resolution: wont fix
messages: + msg299616

stage: resolved
2017-07-31 15:41:58r.david.murraysetmessages: + msg299569
2017-07-31 15:32:47r.david.murraysetmessages: + msg299568
2017-07-31 14:43:59robertussetnosy: + barry, r.david.murray
type: behavior
components: + email
2017-07-31 14:43:18robertuscreate