Issue 1079: decode_header does not follow RFC 2047

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45420

classification

Title:	decode_header does not follow RFC 2047
Type:	behavior	Stage:	resolved
Components:	email	Versions:	Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	r.david.murray	Nosy List:	barry, ggenellina, ishimoto, jafo, kael, leromarinvit, phahn, python-dev, r.david.murray, runtux, tkikuchi, tlynn, tony_nelson
Priority:	normal	Keywords:	patch

Created on 2007-09-01 08:56 by kael, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
header_encwd_nows.patch	tony_nelson, 2009-04-04 02:08	encoded words don't always require whitespace
issue1079-test.patch	r.david.murray, 2009-04-04 22:21	test case for encoded word followed immediately by a 'special'
rfc2047_embed.patch	leromarinvit, 2010-04-10 14:46	embed encoded words in surrounding text if there's no whitespace around them
u2u_decode.py	tkikuchi, 2010-09-27 01:18	a recipe for decoding non-compliant RFC2047 header
python.patch	runtux, 2012-01-03 15:25		review
python.patch.2	runtux, 2012-01-03 16:11		review
python.patch3	runtux, 2012-01-03 21:44

Messages (23)
msg55555 - (view)	Author: Mickaël Guérin (kael)	Date: 2007-09-01 08:56
email.header.decode_header expect a space or end of line after the end of an encoded word ("?="). There is nothing about that thing in RFC 2047. Python 2.5.1 ChangeLog seems to indicate that this bug has been solved. Unfortunately, the function still don't work. A visible effet of the bad regex used has the consequence found in Issue 1467619 it seems there are 2 different regex with the same purpose in two different files (ecre in header.py & ecre in utils.py). the one in utils.py seems to give better results.
msg55951 - (view)	Author: Sean Reifschneider (jafo) *	Date: 2007-09-17 06:46
Can you provide an example of an address that triggers this? Preferably in a code sample that can be used to reproduce it? Uber-ideally, a patch to the email module test suite would be great.
msg81068 - (view)	Author: Tom Lynn (tlynn)	Date: 2009-02-03 17:01
The only difference between the two regexps is that the email/header.py version looks for:: (?=[ \t]\|$) # whitespace or the end of the string at the end (with re.MULTILINE, so $ also matches '\n'). To expand on "There is nothing about that thing in RFC 2047", it says:: IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's by an RFC 822 parser. RFC 822 says:: atom = 1*<any CHAR except specials, SPACE and CTLs> ... specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted- / "," / ";" / ":" / "\" / <"> ; string, to use / "." / "[" / "]" ; within a word. So an example of mis-parsing is:: >>> import email.header >>> h = '=?utf-8?q?=E2=98=BA?=(unicode white smiling face)' >>> email.header.decode_header(h) [('=?utf-8?q?=E2=98=BA?=(unicode white smiling face)', None)] The correct result would be:: >>> email.header.decode_header(h) [('\xe2\x98\xba', 'utf-8'), ('(unicode white smiling face)', None)] which is what you get if you insert a space before the '(' in h.
msg85366 - (view)	Author: Tony Nelson (tony_nelson)	Date: 2009-04-04 02:08
I think the problem is best viewed as headers are not being parsed according to RFC2822 and decoded after that, so the recognition of encoded words should be looser, and not require whitespace around them, as it is not required in all contexts. Patch and test, tested on 2.6.1, 2.7trunk. The test mostly just reverses the sense of test_rfc2047_without_whitespace().
msg85441 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-04-04 22:21
Tony, I don't think I agree with your reading of the RFC. IMO, your inversion of test_rfc2047_without_whitespace is not correct. '=' is not a 'special' in RFC[2]822 terms, so the atom does not end at the apparent end of the encoded word. I say apparent because if I'm interpreting the RFC correctly it isn't a valid encoded word. I presume you are thinking that once you've got an atom composed of several encoded words, there's no reason not to parse them into individual words (and I'm inclined to agree with you), but the RFC BNF doesn't support that interpretation as far as I can see. That is, there is no indication I could find that an atom can be composed of multiple encoded words. The encoded word followed by a 'special' is more subtle. In section 5, the RFC says: An 'encoded-word' that appears within a 'phrase' MUST be separated from any adjacent 'word', 'text' or 'special' by 'linear-white-space'. This would apply to encoded words in names in To and From headers. But in other places where an encoded word can appear the requirement of white-space separation from specials is not asserted. It's not clear how to make this RFC compliant without implementing a full BNF parser :( It would probably be reasonable to fix this case so that a word followed by a special with no intervening white space was allowed. I've attached a test case for the 'special' example based on that idea.
msg85448 - (view)	Author: Tony Nelson (tony_nelson)	Date: 2009-04-04 23:17
The email package does not follow the RFCs in anything to do with header parsing or decoding. This is a known deficiency. So no, I am not thinking of atoms at all -- and neither is email.header.decode_header()! :-( Until email.header actually parses headers into atoms and then decodes atoms, it doesn't matter what parsed atoms would look like. Currently, email.header.decode_header() just stumbles through raw text, and doesn't know if it is looking at atoms or not, or usually even what header the text came from. In order to interpret the RFC correctly, email.header.decode_header() needs either a parser and the name of the header it is decoding, or parsed header data. I think the latter is being considered for a redesign of the email package for 3.1 or 3.2 (3 months to a year or so, and not for 2.x at all), but until then, it is better to decode every likely encoded-word than to skip encoded-words that, for example, have a parenthesis on one side or the other.
msg85763 - (view)	Author: Atsuo Ishimoto (ishimoto) *	Date: 2009-04-08 07:34
+1 for Tony's patch. This patch reverts fix for Issue1582282 filed by tkikuchi. I cannot understand the rationale for solution proposed in Issue1582282. How does the fix make easier to read mails from Entourage?
msg102781 - (view)	Author: Oliver Martin (leromarinvit)	Date: 2010-04-10 14:46
I got bitten by this too. In addition to not decoding encoded words without whitespace after them, it throws an exception if there is a valid encoded word later in the string and the first encoded word is followed by something that isn't a hex number: >>> decode_header('aaa=?iso-8859-1?q?bbb?=xxx asdf =?iso-8859-1?q?jkl?=') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/email/header.py", line 93, in decode_header dec = email.quoprimime.header_decode(encoded) File "/usr/lib/python2.5/email/quoprimime.py", line 336, in header_decode return re.sub(r'=\w{2}', _unquote_match, s) File "/usr/lib/python2.5/re.py", line 150, in sub return _compile(pattern, 0).sub(repl, string, count) File "/usr/lib/python2.5/email/quoprimime.py", line 324, in _unquote_match return unquote(s) File "/usr/lib/python2.5/email/quoprimime.py", line 106, in unquote return chr(int(s[1:3], 16)) ValueError: invalid literal for int() with base 16: 'xx' I think it should join the encoded words with the surrounding text if there's no whitespace in between. That seems to be consistent with what the non-RFC-compliant MUAs out there mean when they send such things. Reverting the change from Issue 1582282 doesn't seem to be a good idea, since it was introduced in response to problems with mailman (see https://bugs.launchpad.net/mailman/+bug/266370). Instead of leaving "Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord" as it is, my patch converts it to [('Sm\xf6rg\xe5sbord', 'iso-8859-1')]. This shouldn't reintroduce the problem mailman was having while fixing ours.
msg117434 - (view)	Author: Tokio Kikuchi (tkikuchi)	Date: 2010-09-27 01:18
Hi, all I am against applying these patches because they will insert space separations in re-composed header (with str() function). Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord -> [('Sm', None), ('\xf6', 'iso-8859-1'), ('rg', None), ('\xe5', 'iso-8859-1'), ('sbord', None)] -> Sm =?iso-8859-1?q?=F6?= rg =?iso-8859-1?q?=E5?= sbord Instead, I submit a small recipe for decoding non-compliant RFC2047 header where space separation is not properly inserted between encoded_word and us-ascii characters. Please try!
msg150461 - (view)	Author: Ralf Schlatterbeck (runtux) *	Date: 2012-01-02 16:09
maybe it would be a good start to include the examples at the end of RFC2047 into the regression tests? These examples at least support the case that a '?' may immediately follow an encoded string: encoded form displayed as (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=) (ab) when trying this in python 2.7: >>> decode_header ('(=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)') [('(', None), ('a', 'iso-8859-1'), ('=?ISO-8859-1?Q?b?=)', None)] this fails. So I consider this a bug. Note that although RFC2047 is vague concerning the interpretation if two encoded strings could follow each other without a whitespace, these are seen in the wild and are interpreted correctly by the mailers I've tested: mutt, thunderbird, exchange in various versions, even lotus notes seems to get this right. So I guess python should be "liberal in what you accept" and parse something like '(=?ISO-8859-1?Q?a?==?ISO-8859-1?Q?b?=)' into [ ('(', None) , ('a', 'iso-8859-1') , ('b', 'iso-8859-1') , (')', None) ]
msg150471 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-01-02 18:14
The RFC isn't at all vague about encoded words not separated by white space. That isn't allowed by the BNF. As you say, though, they occur in the wild and should be parsed correctly. In your other point I think you mean "immediately followed by a )", right? Yes, that is allowed and no, we don't currently parse that correctly. Adding the RFC tests would be great (patches gladly accepted). Fixes for ones we fail would be great, too, but at the very least we can mark them as expected failures. I don't usually like adding tests that we expect to fail, but in the case of externally defined tests such as the RFC examples I think it is worthwhile, so that we can check in a complete test set.
msg150501 - (view)	Author: Ralf Schlatterbeck (runtux) *	Date: 2012-01-03 15:25
Fine, I see what you mean, this involves very careful reading of the RFC and could have been a little more verbose ... Right. Should have been a ')' > Adding the RFC tests would be great (patches gladly accepted). Fixes > for ones we fail would be great, too, but at the very least we can > mark them as expected failures. I don't usually like adding tests > that we expect to fail, but in the case of externally defined tests > such as the RFC examples I think it is worthwhile, so that we can > check in a complete test set. Patch attached (against current tip, 74241:120a79b8bb11). We currently fail all of the tests in the RFC due to the same problem, the closing ')', I've marked them accordingly. I've made the 5th test (with newline in the string) two cases, one with \r\n for the newline, one with only \n. They fail differently. I plan to look into this a little more, my current plan is to make the outer regex non-greedy (if possible) and remove the trailing whitespace. That would involve parsing (and ignoring) additional whitespace between encoded words but not at the boundary to a non-encoded word. Any objections/further infos? Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting www: http://www.runtux.com Reichergasse 131, A-3411 Weidling email: office@runtux.com osAlliance member email: rsc@osalliance.com
msg150506 - (view)	Author: Ralf Schlatterbeck (runtux) *	Date: 2012-01-03 16:11
enclosed please find a fixed patch -- decode_header consolidates multiple encoded strings with the same encoding into a single entry in the returned parts. -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting www: http://www.runtux.com Reichergasse 131, A-3411 Weidling email: office@runtux.com osAlliance member email: rsc@osalliance.com
msg150508 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-01-03 16:17
Well, a caution that tweaking the regex can have unexpected consequences as past issues have proven (but by all means go for it), and a note that the parsing strategy is going to change completely in email6 (see http://pypi.python.org/email and http://hg.python.org/features/email6). I think your tests should pass on that branch; I'll be interested to try it when I get some time. (Note: I'm removing 3.1 from versions since it doesn't get bug fixes any more.) Also, I'm not sure the (non-essential) change to consolidate like-charset encoded words is appropriate for a bug fix. It's hard to see how it would break anything, but why take the risk if it isn't needed to fix the bug.
msg150509 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-01-03 16:18
Gah, that's what I get for not reading carefully (or looking at the patch first). Your test change is fine, of course.
msg150535 - (view)	Author: Ralf Schlatterbeck (runtux) *	Date: 2012-01-03 21:44
Attached please find a patch that - keeps all spaces between non-encoded and encoded parts - doesn't create spaces between non-encoded and encoded parts in case these are already there or not needed (because they are non-ctext characters of RFC822 like ')') in the methods "encode" and "__str__" of class Header. in all other cases spaces are still inserted, this keeps many tests happy and probably won't break too much existing code. I've re-read RFC2047 (and parts of 822) and now share your opinion that it requires that encoded parts must be followed by a 'linear-white-space' if the following (or preceding) token is text or ctext. (p.7 5. Use of encoded-words in message headers) With the special-casing of ctext characters mentioned above, roundtripping is now possible, so if you parse a normalized string consisting of encoded and non-encoded parts, (even multiple) whitespace is preserved. I still think we should do it like everyone else and not automatically insert whitespace at boundaries between encoded and non-encoded words, even if the RFC requires it. Someone wanting to create headers consisting of mixed encoded/non-encoded parts without whitespace must know what to do anyway -- the previous implementation also didn't check for all border cases. I've not yet tested this against the email6 branch you mentioned. Note that I didn't have to make the regex non-greedy, it already was. I've just removed the whitespace at the end of the regex. I've changed all the tests that test for removal of whitespace between non-encoded and encoded parts. Obviously I've also changed a test that relied on failing to parse adjacent encoded strings. But please look at my changes of the tests. The rfc2047 tests now all pass. The patch also fixes issue1467619 "Header.decode_header eats up spaces" Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting www: http://www.runtux.com Reichergasse 131, A-3411 Weidling email: office@runtux.com osAlliance member email: rsc@osalliance.com
msg161803 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-05-28 20:15
Ralf, thanks very much for this patch. I'm considering applying it. Given that the current code breaks on parsing various legitimate constructs, it seems like the behavior change (preserving whitespace in the non-EW parts...which IMO is correct) should be an acceptable tradeoff. Could you please submit a contributor agreement? (http://www.python.org/psf/contrib/)
msg161861 - (view)	Author: Ralf Schlatterbeck (runtux) *	Date: 2012-05-29 09:03
On Mon, May 28, 2012 at 08:15:05PM +0000, R. David Murray wrote: > > R. David Murray <rdmurray@bitdance.com> added the comment: > > Ralf, thanks very much for this patch. I'm considering applying it. > Given that the current code breaks on parsing various legitimate > constructs, it seems like the behavior change (preserving whitespace > in the non-EW parts...which IMO is correct) should be an acceptable > tradeoff. > > Could you please submit a contributor agreement? > (http://www.python.org/psf/contrib/) Thanks for considering my patch. I've just sent the agreement. Ralf
msg162181 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-06-02 21:57
New changeset 8c03fe231877 by R David Murray in branch 'default': #1079: Fix parsing of encoded words. http://hg.python.org/cpython/rev/8c03fe231877
msg162183 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-06-02 21:59
I've applied this to 3.3. Because the preservation of spaces around the ascii parts is a visible behavior change that could cause working programs to break, I don't think I can backport it. I'm going to leave this open until I can consult with Barry to see if he thinks a backport is justified. Anyone else can feel free to chime in with an opinion as well :)
msg162216 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-06-03 16:06
On Jun 02, 2012, at 09:59 PM, R. David Murray wrote: >I've applied this to 3.3. Because the preservation of spaces around the >ascii parts is a visible behavior change that could cause working programs to >break, I don't think I can backport it. I'm going to leave this open until I >can consult with Barry to see if he thinks a backport is justified. Anyone >else can feel free to chime in with an opinion as well :) I think a backport is risky.
msg162218 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-06-03 16:10
OK, I'm closing this, then, and will close the related issues as well. Thanks again for the patch, Ralf.
msg162221 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-06-03 16:27
New changeset 0808cb8c60fd by R David Murray in branch 'default': #2658: Add test for issue fixed by fix for #1079. http://hg.python.org/cpython/rev/0808cb8c60fd

History
Date	User	Action	Args
2022-04-11 14:56:26	admin	set	github: 45420
2012-06-03 22:11:21	r.david.murray	link	issue10574 superseder
2012-06-03 16:28:41	r.david.murray	link	issue2658 superseder
2012-06-03 16:27:21	python-dev	set	messages: + msg162221
2012-06-03 16:14:50	r.david.murray	link	issue1467619 superseder
2012-06-03 16:10:24	r.david.murray	set	status: open -> closed versions: - Python 2.7, Python 3.2 messages: + msg162218 resolution: fixed stage: patch review -> resolved
2012-06-03 16:06:32	barry	set	messages: + msg162216
2012-06-02 21:59:26	r.david.murray	set	messages: + msg162183
2012-06-02 21:57:11	python-dev	set	nosy: + python-dev messages: + msg162181
2012-05-29 09:03:27	runtux	set	messages: + msg161861
2012-05-28 20:15:04	r.david.murray	set	messages: + msg161803 components: + email, - Library (Lib)
2012-04-20 14:43:03	phahn	set	nosy: + phahn
2012-01-03 21:44:13	runtux	set	files: + python.patch3 messages: + msg150535
2012-01-03 16:18:26	r.david.murray	set	messages: + msg150509
2012-01-03 16:17:11	r.david.murray	set	messages: + msg150508 versions: - Python 3.1
2012-01-03 16:11:58	runtux	set	files: + python.patch.2 messages: + msg150506
2012-01-03 15:25:30	runtux	set	files: + python.patch messages: + msg150501
2012-01-02 18:14:27	r.david.murray	set	messages: + msg150471
2012-01-02 16:09:52	runtux	set	nosy: + runtux messages: + msg150461
2011-03-13 23:34:04	r.david.murray	set	nosy: barry, jafo, ishimoto, tlynn, ggenellina, tkikuchi, tony_nelson, kael, r.david.murray, leromarinvit versions: + Python 3.3
2010-11-30 16:50:46	r.david.murray	set	assignee: r.david.murray
2010-09-27 01:18:19	tkikuchi	set	files: + u2u_decode.py messages: + msg117434
2010-09-18 13:19:47	BreamoreBoy	set	stage: needs patch -> patch review versions: + Python 3.2, - Python 2.6, Python 3.0
2010-04-10 14:46:07	leromarinvit	set	files: + rfc2047_embed.patch nosy: + leromarinvit messages: + msg102781
2009-12-30 02:32:51	r.david.murray	link	issue4491 superseder
2009-04-08 07:34:18	ishimoto	set	nosy: + tkikuchi, ishimoto messages: + msg85763
2009-04-04 23:17:22	tony_nelson	set	messages: + msg85448
2009-04-04 22:21:04	r.david.murray	set	files: + issue1079-test.patch type: behavior versions: + Python 3.0, Python 3.1, - Python 2.5 nosy: + r.david.murray messages: + msg85441 stage: needs patch
2009-04-04 02:08:50	tony_nelson	set	files: + header_encwd_nows.patch versions: + Python 2.6, Python 2.7 nosy: + barry, tony_nelson messages: + msg85366 keywords: + patch
2009-03-27 20:50:57	amaury.forgeotdarc	link	issue4958 superseder
2009-02-04 22:02:59	ggenellina	set	nosy: + ggenellina
2009-02-03 17:02:00	tlynn	set	nosy: + tlynn messages: + msg81068
2007-09-17 06:46:32	jafo	set	priority: normal nosy: + jafo messages: + msg55951
2007-09-01 08:56:59	kael	create