Message 409392 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dlenski
Recipients	barry, dlenski, iritkatriel, py.user, r.david.murray
Date	2021-12-30.23:08:21
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1640905701.53.0.857462247688.issue22833@roundup.psfhosted.org>
In-reply-to

Content
Due to this bug, any user of this function in Python 3.0+ already has to be able to handle all of the following outputs in order to use it reliably: decode_header(...) -> [(str, None)] or decode_header(...) -> [(bytes, str)] or decode_header(...) -> [(bytes, (str\|None)), (bytes, (str\|None)), ...] == Fix str/bytes inconsistency == We could eliminate the inconsistency, and make the function only ever return bytes instead of str, with the following changes to https://github.com/python/cpython/blob/3.10/Lib/email/header.py. ``` diff --git a/Lib/email/header.py.orig b/Lib/email/header.py index 4ab0032..41e91f2 100644 --- a/Lib/email/header.py +++ b/Lib/email/header.py @@ -61,7 +61,7 @@ _max_append = email.quoprimime._max_append def decode_header(header): """Decode a message header value without converting charset. - Returns a list of (string, charset) pairs containing each of the decoded + Returns a list of (bytes, charset) pairs containing each of the decoded parts of the header. Charset is None for non-encoded parts of the header, otherwise a lower-case string containing the name of the character set specified in the encoded string. @@ -78,7 +78,7 @@ def decode_header(header): for string, charset in header._chunks] # If no encoding, just return the header with no charset. if not ecre.search(header): - return [(header, None)] + return [header.encode(), None)] # First step is to parse all the encoded parts into triplets of the form # (encoded_string, encoding, charset). For unencoded strings, the last # two parts will be None. ``` With these changes, decode_header() would return one of the following: decode_header(...) -> [(bytes, None)] or decode_header(...) -> [(bytes, str)] or decode_header(...) -> [(bytes, (str\|None)), (bytes, (str\|None)), ...] == Ensure that charset is always str, never None == A couple more small changes: ``` @@ -92,7 +92,7 @@ def decode_header(header): unencoded = unencoded.lstrip() first = False if unencoded: - words.append((unencoded, None, None)) + words.append((unencoded, None, 'ascii')) if parts: charset = parts.pop(0).lower() encoding = parts.pop(0).lower() @@ -133,7 +133,8 @@ def decode_header(header): # Now convert all words to bytes and collapse consecutive runs of # similarly encoded words. collapsed = [] - last_word = last_charset = None + last_word = None + last_charset = 'ascii' for word, charset in decoded_words: if isinstance(word, str): word = bytes(word, 'raw-unicode-escape') ``` With these changes, decode_header() would return only: decode_header(...) -> List[(bytes, str)]

Due to this bug, any user of this function in Python 3.0+ *already* has to be able to handle all of the following outputs in order to use it reliably:

   decode_header(...) -> [(str, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]

== Fix str/bytes inconsistency ==

We could eliminate the inconsistency, and make the function only ever return bytes instead of str, with the following changes to https://github.com/python/cpython/blob/3.10/Lib/email/header.py.

```
diff --git a/Lib/email/header.py.orig b/Lib/email/header.py
index 4ab0032..41e91f2 100644
--- a/Lib/email/header.py
+++ b/Lib/email/header.py
@@ -61,7 +61,7 @@ _max_append = email.quoprimime._max_append
 def decode_header(header):
     """Decode a message header value without converting charset.
 
-    Returns a list of (string, charset) pairs containing each of the decoded
+    Returns a list of (bytes, charset) pairs containing each of the decoded
     parts of the header.  Charset is None for non-encoded parts of the header,
     otherwise a lower-case string containing the name of the character set
     specified in the encoded string.
@@ -78,7 +78,7 @@ def decode_header(header):
                     for string, charset in header._chunks]
     # If no encoding, just return the header with no charset.
     if not ecre.search(header):
-        return [(header, None)]
+        return [header.encode(), None)]
     # First step is to parse all the encoded parts into triplets of the form
     # (encoded_string, encoding, charset).  For unencoded strings, the last
     # two parts will be None.
```

With these changes, decode_header() would return one of the following:

   decode_header(...) -> [(bytes, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]


== Ensure that charset is always str, never None ==

A couple more small changes:

```
@@ -92,7 +92,7 @@ def decode_header(header):
                 unencoded = unencoded.lstrip()
                 first = False
             if unencoded:
-                words.append((unencoded, None, None))
+                words.append((unencoded, None, 'ascii'))
             if parts:
                 charset = parts.pop(0).lower()
                 encoding = parts.pop(0).lower()
@@ -133,7 +133,8 @@ def decode_header(header):
     # Now convert all words to bytes and collapse consecutive runs of
     # similarly encoded words.
     collapsed = []
-    last_word = last_charset = None
+    last_word = None
+    last_charset = 'ascii'
     for word, charset in decoded_words:
         if isinstance(word, str):
             word = bytes(word, 'raw-unicode-escape')
```

With these changes, decode_header() would return only:

   decode_header(...) -> List[(bytes, str)]

History
Date	User	Action	Args
2021-12-30 23:08:21	dlenski	set	recipients: + dlenski, barry, r.david.murray, py.user, iritkatriel
2021-12-30 23:08:21	dlenski	set	messageid: <1640905701.53.0.857462247688.issue22833@roundup.psfhosted.org>
2021-12-30 23:08:21	dlenski	link	issue22833 messages
2021-12-30 23:08:21	dlenski	create