Issue 35547: email.parser / email.policy does not correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/79728

classification

Title:	email.parser / email.policy does not correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers
Type:	behavior	Stage:
Components:	email	Versions:	Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Jeffrey.Kintscher, barry, era, mjpieters, r.david.murray
Priority:	normal	Keywords:

Created on 2018-12-20 17:54 by mjpieters, last changed 2022-04-11 14:59 by admin.

Messages (6)
msg332243 - (view)	Author: Martijn Pieters (mjpieters) *	Date: 2018-12-20 17:54
The From header in the following email headers is not correctly decoded; both the subject and from headers contain UTF-8 encoded data encoded with RFC2047 encoded-words, in both cases a multi-byte UTF-8 codepoint has been split between the two encoded-word tokens: >>> msgdata = '''\ From: =?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?= =?utf-8?b?seGbiw==?= <martijn@example.com> Subject: =?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?= =?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?= ''' >>> from io import StringIO >>> from email.parser import Parser >>> from email import policy >>> msg = Parser(policy=policy.default).parse(StringIO(msgdata)) >>> print(msg['Subject']) # correct sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ >>> print(msg['From']) # incorrect ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖ� �ᛋ <martijn@example.com> Note the two FFFD placeholders in the From line. The issue is that the raw value of the From and Subject contain the folding space at the start of the continuation lines: >>> for name, value in msg.raw_items(): ... if name in {'Subject', 'From'}: ... print(name, repr(value)) ... From '=?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?=\n =?utf-8?b?seGbiw==?= <martijn@example.com>' Subject '=?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?=\n =?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?=' For the Subject header, _header_value_parser.get_unstructured is used, which expects there to be spaces between encoded words; it inserts EWWhiteSpaceTerminal tokens in between which are turned into empty strings. But for the From header, AddressHeader parser does not, the space at the start of the line is retained, and the surrogate escapes at the end of one encoded-word and the start start of the next encoded-word never ajoin, so the later handling of turning surrogates back into proper data fails. Since unstructured header parsing doesn't mind if a space is missing between encoded-word atoms, the work-around is to explicitly remove the space at the start of every line; this can be done in a custom policy: import re from email.policy import EmailPolicy class UnfoldingHeaderEmailPolicy(EmailPolicy): def header_fetch_parse(self, name, value): # remove any leading whitespace from header lines # before further processing value = re.sub(r'(?<=[\n\r])([\t ])', '', value) return super().header_fetch_parse(name, value) custom_policy = UnfoldingHeaderEmailPolicy() after which the From header comes out without placeholders: >>> msg = Parser(policy=custom_policy).parse(StringIO(msgdata)) >>> msg['from'] 'ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖᚱᛋ <martijn@example.com>' >>> msg['subject'] 'sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ' This issue was found by way of https://stackoverflow.com/q/53868584/100297
msg332276 - (view)	Author: Martijn Pieters (mjpieters) *	Date: 2018-12-21 00:52
Right, re-educating myself on the MIME RFCs, and found https://bugs.python.org/issue1372770 where the same issue is being discussed for previous incarnations of the email library. Removing the FWS after CRLF is the wrong thing to do, unless RFC2047 separating encoded-word tokens. The work-around regex is a bit more complicated, but ideally the EW handling should use a specialist FWS token to delimit encoded-word sections that renders to '' as is done in unstructured headers, but everywhere. Because in practice, there are email clients out there that use EW in structured headers, regardless. Regex to work around this # crude CRLF-FWS-between-encoded-word matching value = re.sub(r'(?<=\?=(\r\n\|\n\|\r))([\t ]+)(?==\?)', '', value)
msg332277 - (view)	Author: Martijn Pieters (mjpieters) *	Date: 2018-12-21 01:11
That regex is incorrect, I should not post untested code from a mobile phone. Corrected workaround with more context: import re from email.policy import EmailPolicy class UnfoldingEncodedStringHeaderPolicy(EmailPolicy): def header_fetch_parse(self, name, value): # remove any leading whitespace from header lines # that separates apparent encoded-word token before further processing # using somewhat crude CRLF-FWS-between-encoded-word matching value = re.sub(r'(?<=\?=)((?:\r\n\|[\r\n])[\t ]+)(?==\?)', '', value) return super().header_fetch_parse(name, value)
msg332279 - (view)	Author: (era)	Date: 2018-12-21 05:52
I don't think this is a bug. My impression is that encoded words should be decodable in isolation.
msg332290 - (view)	Author: Martijn Pieters (mjpieters) *	Date: 2018-12-21 11:34
While RFC2047 clearly states that an encoder MUST not split multi-byte encodings in the middle of a character (section 5, "Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded-word's.), it also states that to fit length restrictions, CRLF SPACE is used as a delimiter between encoded words (section 2, "If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used."). In section 6.2 it states When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.) (linear-white-space is the RFC822 term for foldable whitespace). The parser is leaving spaces between two encoded-word tokens in place, where it must remove them instead. And it is doing so correctly for unstructured headers, just not in get_bare_quoted_string, get_atom and get_dot_atom. Then there is Postel's law (be liberal in what you accept from others), and the email package already applies that principle to RFC2047 elsewhere; RFC2047 also states that "An 'encoded-word' MUST NOT appear within a 'quoted-string'." yet email._header_value_parser's handling of quoted-string will process EW sections.
msg332296 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2018-12-21 17:49
Here's a patch that makes the example work correctly. This is not a fix, a real fix will be more complicated. This just demonstrates the kind of thing that needs fixing and where. The existing parser produces a sub-optimal parse tree as its result...the parse tree is hard to inspect and manipulate because there are so many special cases. A good fix here would create some sort of function that could be passed an existing TokenList, the new token to add to that list, and the function would check all the special cases and do the EWWhiteSpaceTerminal substitution when and as appropriate. This could then be used in the unstructured parser as well as Phrase...and some thought should be given to where else it might be needed. It has been long enough since I've held the RFCs in my head that I don't remember if there is anywhere else. I haven't looked at the actual character string, so I don't know if we need to also be detecting and posting a defect about a split character or not, but we don't have to answer that question to fix this. diff --git a/Lib/email/_header_value_parser.py b/Lib/email/_header_value_parser.py index e805a75..d5d5986 100644 --- a/Lib/email/_header_value_parser.py +++ b/Lib/email/_header_value_parser.py @@ -199,6 +199,10 @@ class CFWSList(WhiteSpaceTokenList): class Atom(TokenList): + @property + def has_encoded_word(self): + return any(t.token_type=='encoded-word' for t in self) + token_type = 'atom' @@ -1382,6 +1386,12 @@ def get_phrase(value): "comment found without atom")) else: raise + if token.has_encoded_word: + assert phrase[-1].token_type == 'atom', phrase[-1] + assert phrase[-1][-1].token_type == 'cfws' + assert phrase[-1][-1][-1].token_type == 'fws' + if phrase[-1].has_encoded_word: + phrase[-1][-1] = EWWhiteSpaceTerminal(phrase[-1][-1][-1], 'fws') phrase.append(token) return phrase, value

History
Date	User	Action	Args
2022-04-11 14:59:09	admin	set	github: 79728
2019-05-31 06:45:27	Jeffrey.Kintscher	set	nosy: + Jeffrey.Kintscher
2018-12-21 17:49:59	r.david.murray	set	messages: + msg332296
2018-12-21 11:34:21	mjpieters	set	messages: + msg332290 title: email.parser / email.policy does correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers -> email.parser / email.policy does not correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers
2018-12-21 05:52:06	era	set	nosy: + era messages: + msg332279
2018-12-21 01:11:19	mjpieters	set	messages: + msg332277
2018-12-21 00:52:03	mjpieters	set	messages: + msg332276
2018-12-21 00:15:26	mjpieters	set	nosy: + barry, r.david.murray type: behavior components: + email versions: + Python 3.7
2018-12-20 17:54:07	mjpieters	create