classification
Title: The decode_header() function decodes raw part to bytes or str, depending on encoded part
Type: behavior Stage: patch review
Components: email, Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jelle Zijlstra, barry, dlenski, iritkatriel, py.user, r.david.murray
Priority: normal Keywords: patch

Created on 2014-11-10 02:42 by py.user, last changed 2022-01-21 01:48 by Jelle Zijlstra.

Pull Requests
URL Status Linked Edit
PR 30548 open dlenski, 2022-01-11 21:29
Messages (8)
msg230932 - (view) Author: py.user (py.user) * Date: 2014-11-10 02:42
It depends on encoded part in the header, what email.header.decode_header() returns.
If the header has both raw part and encoded part, the function returns (bytes, None) for the raw part. But if the header has only raw part, the function returns (str, None) for it.

>>> import email.header
>>> 
>>> s = 'abc=?koi8-r?q?\xc1\xc2\xd7?='
>>> email.header.decode_header(s)
[(b'abc', None), (b'\xc1\xc2\xd7', 'koi8-r')]
>>> 
>>> s = 'abc'
>>> email.header.decode_header(s)
[('abc', None)]
>>>

There should be (bytes, None) for both cases.
msg230962 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-11-10 14:50
This is a duplicate of issue 6302.  Re-reading that issue (again), I'm not quite sure why we didn't fix it, but it may be too late to fix it now for backward compatibility reasons.

Since that issue strayed off into other topics, I'm going to leave this one open to consider whether or not we can/should fix this.  The new email API does avoid this problem, though.  Is there a reason you are choosing not to use the new API?
msg230976 - (view) Author: py.user (py.user) * Date: 2014-11-10 21:17
R. David Murray wrote:
"Is there a reason you are choosing not to use the new API?"

My program is for Python 3.x. I need to decode wild headers to pretty unicode strings. Now, I do it by decode_header() and try...except for AttributeError, since a unicode string has no .decode() method.

I don't know what is "new API", but I guess it's not compatible with Python 3.0.
msg230981 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-11-10 22:59
Certainly not with 3.0, but nobody in their right mind should be using that version any more :).

The new API for decoding headers is available as of Python 3.3, with additional new API features in 3.4.  See 

https://docs.python.org/3/library/email-examples.html#examples-using-the-provisional-api

for an example.  Note that although the API is 'provisional', I anticipate no non-trivial changes when it becomes final in 3.5.  (The only API change that has happened has been done such that you get warnings if you use it "wrong" in 3.4, and is in a relatively obscure method (is_attachment).
msg408376 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-12-12 11:30
Reproduced on 3.11.
msg409391 - (view) Author: Daniel Lenski (dlenski) * Date: 2021-12-30 22:30
I recently ran into this bug as well.

For those looking for a reliable workaround, here's an implementation of a 'decode_header_to_string' function which should Just Work™ in all possible cases:

    #!/usr/bin/python3
    import email.header

    # Workaround for https://bugs.python.org/issue22833
    def decode_header_to_string(header):
        '''Decodes an email message header (possibly RFC2047-encoded)
        into a string, while working around https://bugs.python.org/issue22833'''

        return ''.join(
            alleged_string if isinstance(alleged_string, str) else alleged_string.decode(
                alleged_charset or 'ascii')
            for alleged_string, alleged_charset in email.header.decode_header(header))


    for header in ('=?utf-8?B?ZsOzbw==',
                   '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
                   'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
                   'plain string',):
        print("Header value: %r" % header)
        print("email.header.decode_header(...) -> %r" % email.header.decode_header(header))
        print("decode_header_to_string(...)    -> %r" % decode_header_to_string(header))
        print("-------")

Outputs:

    Header value: '=?utf-8?B?ZsOzbw=='
    email.header.decode_header(...) -> [('=?utf-8?B?ZsOzbw==', None)]
    decode_header_to_string(...)    -> '=?utf-8?B?ZsOzbw=='
    -------
    Header value: '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?='
    email.header.decode_header(...) -> [(b'hello', 'ascii'), (b'f\xc3\xb3o', 'utf-8')]
    decode_header_to_string(...)    -> 'hellofóo'
    -------
    Header value: 'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?='
    email.header.decode_header(...) -> [(b'bar', None), (b'hello', 'ascii'), (b'f\xc3\xb3o', 'utf-8')]
    decode_header_to_string(...)    -> 'barhellofóo'
    -------
    Header value: 'plain string'
    email.header.decode_header(...) -> [('plain string', None)]
    decode_header_to_string(...)    -> 'plain string'
    -------
    Header value: 'foo=?blah?Q??='
    email.header.decode_header(...) -> [(b'foo', None), (b'', 'blah')]
    decode_header_to_string(...)    -> 'foo'
    -------
msg409392 - (view) Author: Daniel Lenski (dlenski) * Date: 2021-12-30 23:08
Due to this bug, any user of this function in Python 3.0+ *already* has to be able to handle all of the following outputs in order to use it reliably:

   decode_header(...) -> [(str, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]

== Fix str/bytes inconsistency ==

We could eliminate the inconsistency, and make the function only ever return bytes instead of str, with the following changes to https://github.com/python/cpython/blob/3.10/Lib/email/header.py.

```
diff --git a/Lib/email/header.py.orig b/Lib/email/header.py
index 4ab0032..41e91f2 100644
--- a/Lib/email/header.py
+++ b/Lib/email/header.py
@@ -61,7 +61,7 @@ _max_append = email.quoprimime._max_append
 def decode_header(header):
     """Decode a message header value without converting charset.
 
-    Returns a list of (string, charset) pairs containing each of the decoded
+    Returns a list of (bytes, charset) pairs containing each of the decoded
     parts of the header.  Charset is None for non-encoded parts of the header,
     otherwise a lower-case string containing the name of the character set
     specified in the encoded string.
@@ -78,7 +78,7 @@ def decode_header(header):
                     for string, charset in header._chunks]
     # If no encoding, just return the header with no charset.
     if not ecre.search(header):
-        return [(header, None)]
+        return [header.encode(), None)]
     # First step is to parse all the encoded parts into triplets of the form
     # (encoded_string, encoding, charset).  For unencoded strings, the last
     # two parts will be None.
```

With these changes, decode_header() would return one of the following:

   decode_header(...) -> [(bytes, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]


== Ensure that charset is always str, never None ==

A couple more small changes:

```
@@ -92,7 +92,7 @@ def decode_header(header):
                 unencoded = unencoded.lstrip()
                 first = False
             if unencoded:
-                words.append((unencoded, None, None))
+                words.append((unencoded, None, 'ascii'))
             if parts:
                 charset = parts.pop(0).lower()
                 encoding = parts.pop(0).lower()
@@ -133,7 +133,8 @@ def decode_header(header):
     # Now convert all words to bytes and collapse consecutive runs of
     # similarly encoded words.
     collapsed = []
-    last_word = last_charset = None
+    last_word = None
+    last_charset = 'ascii'
     for word, charset in decoded_words:
         if isinstance(word, str):
             word = bytes(word, 'raw-unicode-escape')
```

With these changes, decode_header() would return only:

   decode_header(...) -> List[(bytes, str)]
msg411069 - (view) Author: Jelle Zijlstra (Jelle Zijlstra) * (Python triager) Date: 2022-01-21 01:48
This behavior is definitely unfortunate, but by now it's also been baked into more than a decade of Python 3 releases, so backward compatibility constraints make it difficult to fix.

How can we be sure this change won't break users' code?

For reference, here are a few uses of the function I found in major open-source packages:
- https://github.com/httplib2/httplib2/blob/cde9e87d8b2c4c5fc966431965998ed5f45d19c7/python3/httplib2/__init__.py#L1608 - this assumes it only ever hits the (bytes, encoding) case.
- https://github.com/cherrypy/cherrypy/blob/98929b519fbca003cbf7b14a6b370a3cabc9c412/cherrypy/lib/httputil.py#L258 - this assumes it only gets (str, None) or (bytes, encoding) pairs, which seems unsafe. But if it currently sees (str, None) and would see (bytes, None) with this change, it would break.

An alternative solution could be a new function with a sane return type.

Even if we decide to not change anything, we should document the surprising return type at https://docs.python.org/3.10/library/email.header.html.
History
Date User Action Args
2022-01-21 01:48:16Jelle Zijlstrasetnosy: + Jelle Zijlstra
messages: + msg411069
2022-01-11 21:29:49dlenskisetkeywords: + patch
stage: patch review
pull_requests: + pull_request28748
2021-12-30 23:08:21dlenskisetmessages: + msg409392
2021-12-30 22:30:20dlenskisetnosy: + dlenski
messages: + msg409391
2021-12-12 11:30:42iritkatrielsetnosy: + iritkatriel

messages: + msg408376
versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.4, Python 3.5
2014-11-10 22:59:23r.david.murraysetmessages: + msg230981
2014-11-10 21:17:38py.usersetmessages: + msg230976
2014-11-10 14:50:46r.david.murraysetmessages: + msg230962
versions: - Python 3.3
2014-11-10 02:42:56py.usersettype: behavior
2014-11-10 02:42:40py.usercreate