This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author dlenski
Recipients barry, dlenski, iritkatriel, py.user, r.david.murray
Date 2021-12-30.23:08:21
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1640905701.53.0.857462247688.issue22833@roundup.psfhosted.org>
In-reply-to
Content
Due to this bug, any user of this function in Python 3.0+ *already* has to be able to handle all of the following outputs in order to use it reliably:

   decode_header(...) -> [(str, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]

== Fix str/bytes inconsistency ==

We could eliminate the inconsistency, and make the function only ever return bytes instead of str, with the following changes to https://github.com/python/cpython/blob/3.10/Lib/email/header.py.

```
diff --git a/Lib/email/header.py.orig b/Lib/email/header.py
index 4ab0032..41e91f2 100644
--- a/Lib/email/header.py
+++ b/Lib/email/header.py
@@ -61,7 +61,7 @@ _max_append = email.quoprimime._max_append
 def decode_header(header):
     """Decode a message header value without converting charset.
 
-    Returns a list of (string, charset) pairs containing each of the decoded
+    Returns a list of (bytes, charset) pairs containing each of the decoded
     parts of the header.  Charset is None for non-encoded parts of the header,
     otherwise a lower-case string containing the name of the character set
     specified in the encoded string.
@@ -78,7 +78,7 @@ def decode_header(header):
                     for string, charset in header._chunks]
     # If no encoding, just return the header with no charset.
     if not ecre.search(header):
-        return [(header, None)]
+        return [header.encode(), None)]
     # First step is to parse all the encoded parts into triplets of the form
     # (encoded_string, encoding, charset).  For unencoded strings, the last
     # two parts will be None.
```

With these changes, decode_header() would return one of the following:

   decode_header(...) -> [(bytes, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]


== Ensure that charset is always str, never None ==

A couple more small changes:

```
@@ -92,7 +92,7 @@ def decode_header(header):
                 unencoded = unencoded.lstrip()
                 first = False
             if unencoded:
-                words.append((unencoded, None, None))
+                words.append((unencoded, None, 'ascii'))
             if parts:
                 charset = parts.pop(0).lower()
                 encoding = parts.pop(0).lower()
@@ -133,7 +133,8 @@ def decode_header(header):
     # Now convert all words to bytes and collapse consecutive runs of
     # similarly encoded words.
     collapsed = []
-    last_word = last_charset = None
+    last_word = None
+    last_charset = 'ascii'
     for word, charset in decoded_words:
         if isinstance(word, str):
             word = bytes(word, 'raw-unicode-escape')
```

With these changes, decode_header() would return only:

   decode_header(...) -> List[(bytes, str)]
History
Date User Action Args
2021-12-30 23:08:21dlenskisetrecipients: + dlenski, barry, r.david.murray, py.user, iritkatriel
2021-12-30 23:08:21dlenskisetmessageid: <1640905701.53.0.857462247688.issue22833@roundup.psfhosted.org>
2021-12-30 23:08:21dlenskilinkissue22833 messages
2021-12-30 23:08:21dlenskicreate