Issue 31089: email.utils.parseaddr fails on odd double quotes in multiline header

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/75272

classification

Title:	email.utils.parseaddr fails on odd double quotes in multiline header
Type:	behavior	Stage:	resolved
Components:	email	Versions:	Python 3.5, Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, r.david.murray, robertus
Priority:	normal	Keywords:

Created on 2017-07-31 14:43 by robertus, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (5)
msg299558 - (view)	Author: Robert (robertus)	Date: 2017-07-31 14:43
email.utils.parseaddr() does not successfully parse a field value into a (comment, address) pair if the FROM header has 2 lines (or more) containing odd number of double quotes in each of them. The address in such tuple is not e-mail address but a part of comment. For example: "=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?= =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl> is parsed into: ('', '=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?=') Full example on Python 2.7.12, email 4.0.2: Python 2.7.12 (default, Nov 19 2016, 06:48:10) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from email.utils import parseaddr >>> parseaddr('"=?UTF8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?=\r\n =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl>') ('', '=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?=') When double quotes or \r\n are removed, header is parsed without problems. The same issue exists on python 3.5.2 and email 6.0.0a1. From headers analysis I know that e-mail was made in Outlook 14.0 then send through Exim 4.87 to outlook.com servers.
msg299568 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-07-31 15:32
parseaddr does what you expect if the message has been read using universal newline mode (ie: the linesep is \n): >>> parseaddr('"=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?=\n =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl>"') ('=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?=\n =?UTF-8?Q?omo=C5=9Bci?=', 'anita.wiecklinska@pato.com.pl') I suppose this wouldn't be that hard to fix. If it isn't too complex and you want to propose a patch I'll take a look. In any case it works fine in python3 using the new policies: >>> from email import message_from_string as mfs >>> from email.policy import default >>> m = mfs('From: "=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?=\r\n =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl>"\r\n\r\ntest', policy=default) >>> m['from'].addresses (Address(display_name='Anita =W4\udc86iecklińska \| PATO Nieruch omości', username='anita.wiecklinska', domain='pato.com.pl'),)
msg299569 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-07-31 15:41
Ah, I take it back. With \n it retains the \n in the decoded name field. There is a bug of some sort here (\r\n should be treated the same as \n, I think, whatever way it is treated). I don't think this is worth addressing, given that the new policies provide a much better API for interacting with Messages, and you can in fact easily unfold the line before parsing it if you need to do it in 2.7: >>> parseaddr(''.join(m['from'].splitlines())) ('=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?= =?UTF-8?Q?omo=C5=9Bci?=', 'anita.wiecklinska@pato.com.pl')
msg299616 - (view)	Author: Robert (robertus)	Date: 2017-08-01 09:29
RFC regarding this topic looks quite complicated to me, but I know that \r\n is used for line breaking in e-mail headers and \n is not. So in my opinion it shouldn't be treated the same like \n. The \r\n should be removed in parsed text, but \n should be preserved like any other character. So I don't think "universal newline mode" is correct approach to read raw e-mails. I have tested policies in python3 - you have right - it works. But I cannot use it because of application incompatibility with python3. I was hoping it will be easy to fix for some more experienced than me... If not - you can close issue and I will stay with present solution (removing \r\n). Thanks for all your help!
msg299621 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-08-01 15:34
Yes, that is mostly likely why parseaddr operates the way it does. The old email package does not do very much hand-holding, it expects you to understand the RFCs, which as you note is a rather daunting task. The new email package (the new policies) in python3 aim to incorporate as much understanding of the RFCs into the library as possible and "do the right thing" automatically so you don't have to worry about it (it can't hide 100%, though...). As for universal new line mode, you are correct that technically \n by itself is data per the RFC (and illegal in the middle of a quoted string like that), but the way Python handles "text" is to convert \r\n into \n internally. So while parseaddr is doing the "right thing" per the RFC, the input parsing parts of the email package in fact accept \n or even mixed line endings to accommodate the difference between unix/python line endings and RFC line endings.

History
Date	User	Action	Args
2022-04-11 14:58:49	admin	set	github: 75272
2017-08-01 15:34:23	r.david.murray	set	messages: + msg299621
2017-08-01 09:29:36	robertus	set	status: open -> closed resolution: wont fix messages: + msg299616 stage: resolved
2017-07-31 15:41:58	r.david.murray	set	messages: + msg299569
2017-07-31 15:32:47	r.david.murray	set	messages: + msg299568
2017-07-31 14:43:59	robertus	set	nosy: + barry, r.david.murray type: behavior components: + email
2017-07-31 14:43:18	robertus	create