classification
Title: email.header.decode_header fails if the string contains multiple directives
Type: behavior Stage: resolved
Components: email, Library (Lib) Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: decode_header does not follow RFC 2047
View: 1079
Assigned To: Nosy List: barry, invisibleroads, r.david.murray
Priority: normal Keywords:

Created on 2010-11-29 04:58 by invisibleroads, last changed 2012-07-04 10:11 by invisibleroads. This issue is now closed.

Messages (8)
msg122772 - (view) Author: Roy Hyunjin Han (invisibleroads) * Date: 2010-11-29 04:58
email.header.decode_header fails for the following message subject:
::

    email.header.decode_header('=?UTF-8?B?MjAxMSBBVVRNIENBTEwgZm9yIE5PTUlO?==?UTF-8?B?QVRJT05TIG9mIFZQIGZvciBNZW1iZXJz?==?UTF-8?B?aGlw?=')


If the directives are removed and the padding problems are fixed, the subject parses correctly.
::

    email.header.decode_header('=?UTF-8?B?%s==?=' % '=?UTF-8?B?MjAxMSBBVVRNIENBTEwgZm9yIE5PTUlO?==?UTF-8?B?QVRJT05TIG9mIFZQIGZvciBNZW1iZXJz?==?UTF-8?B?aGlw?='.replace('=?UTF-8?B?', '').replace('?', '').replace('=', ''))
msg122773 - (view) Author: Roy Hyunjin Han (invisibleroads) * Date: 2010-11-29 05:12
Currently using the following workaround.

import re
import email.header

def decodeSafely(x):
    match = re.search('(=\?.*?\?B\?)', x)
    if not match:
        return x
    encoding = match.group(1)
    return email.header.decode_header('%s%s==?=' % (encoding, x.replace(encoding, '').replace('?', '').replace('=', '')))
msg122774 - (view) Author: Roy Hyunjin Han (invisibleroads) * Date: 2010-11-29 05:59
Improved workaround to handle another degenerate case where the encoded string is in between non-encoded strings.

import re
import email.header

pattern_ecre = re.compile(r'((=\?.*?\?[qb]\?).*\?=)', re.VERBOSE | re.IGNORECASE | re.MULTILINE)

def decodeSafely(x):
    match = pattern_ecre.search(x)
    if not match:
        return x
    string, encoding = match.groups()
    stringBefore, string, stringAfter = x.partition(string)
    return stringBefore + email.header.decode_header('%s%s==?=' % (encoding, string.replace(encoding, '').replace('?', '').replace('=', '')))[0][0] + stringAfter

print decodeSafely('=?UTF-8?B?MjAxMSBBVVRNIENBTEwgZm9yIE5PTUlO?==?UTF-8?B?QVRJT05TIG9mIFZQIGZvciBNZW1iZXJz?==?UTF-8?B?aGlw?=')
print decodeSafely('"=?UTF-8?B?QVVUTSBIZWFkcXVhcnRlcnM=?="<info@autm.net>')
msg122776 - (view) Author: Roy Hyunjin Han (invisibleroads) * Date: 2010-11-29 06:45
The following code seems to solve the first case just as well.  It seems that it is a problem of missing whitespace.

email.header.decode_header('=?UTF-8?B?MjAxMSBBVVRNIENBTEwgZm9yIE5PTUlO?==?UTF-8?B?QVRJT05TIG9mIFZQIGZvciBNZW1iZXJz?==?UTF-8?B?aGlw?='.replace('?==?', '?= =?'))
msg122918 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-30 17:05
Note that none of your examples are valid encoded words, so given that email currently does strict parsing, the fact that it is not attempting to decode those words is technically correct.  

However, I agree that it would be better for it to do a "best guess" decoding of the invalid encoded words.

It should be possible to "fix" this case by simply replacing '?==?' with '?= =?' before decoding (blanks between encoded words are ignored when decoding, per the RFC, which the author of the package producing these invalid headers probably didn't realize).

See also #1079 and #8132.

I have to think about whether or not all of these can be considered fixes (based on Postel's law) or if tolerant parsing should be considered a feature request.  I'll probably combine these into a single issue at some point.

Out of curiosity, which email program is it that is producing these invalid headers?
msg126838 - (view) Author: Roy Hyunjin Han (invisibleroads) * Date: 2011-01-22 14:31
2010/11/30 R. David Murray <report@bugs.python.org>:
> Out of curiosity, which email program is it that is producing these invalid headers?

I lost the headers for the original email, so I don't know which email
program created the invalid headers.

On searching for messages from the same address, it seems most of the
messages originate from a marketing company called informz.net, but in
rare instances there is a non-standard X-Mailer header:
- ColdFusion 8 Application Server (via JavaMail)
- IBM Lotus Domino Access for MS Outlook (2003) Release 7.0.2 September 26, 2006

Messages sent via informz.net generally parse correctly, so I am
guessing it might have been one of the X-Mailers above.
msg162237 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-03 22:11
This is fixed by the fix to issue 1079, but we have decided that fix can't be backported because it is a behavior change that might break existing working programs.
msg164640 - (view) Author: Roy Hyunjin Han (invisibleroads) * Date: 2012-07-04 10:11
> This is fixed by the fix to issue 1079, but we have decided that fix can't be backported because it is a behavior change that might break existing working programs.

Thanks for this update.
History
Date User Action Args
2012-07-04 10:11:38invisibleroadssetmessages: + msg164640
2012-06-03 22:11:21r.david.murraysetstatus: open -> closed
superseder: decode_header does not follow RFC 2047
messages: + msg162237

resolution: duplicate
stage: resolved
2012-05-16 02:00:06r.david.murraysetassignee: r.david.murray ->

nosy: + barry
components: + email
versions: - Python 3.1
2011-03-14 03:41:36r.david.murraysetversions: + Python 3.2, Python 3.3, - Python 2.6
2011-01-22 14:31:54invisibleroadssetmessages: + msg126838
2010-11-30 17:05:53r.david.murraysetassignee: r.david.murray

messages: + msg122918
nosy: + r.david.murray
2010-11-29 06:45:58invisibleroadssetmessages: + msg122776
2010-11-29 05:59:32invisibleroadssetmessages: + msg122774
2010-11-29 05:12:24invisibleroadssetmessages: + msg122773
2010-11-29 04:58:57invisibleroadscreate