Issue 22833: The decode_header() function decodes raw part to bytes or str, depending on encoded part

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67022

classification

Title:	The decode_header() function decodes raw part to bytes or str, depending on encoded part
Type:	behavior	Stage:	patch review
Components:	email, Library (Lib)	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	JelleZijlstra, barry, dlenski, iritkatriel, py.user, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2014-11-10 02:42 by py.user, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 30548	open	dlenski, 2022-01-11 21:29

Messages (8)
msg230932 - (view)	Author: py.user (py.user) *	Date: 2014-11-10 02:42
It depends on encoded part in the header, what email.header.decode_header() returns. If the header has both raw part and encoded part, the function returns (bytes, None) for the raw part. But if the header has only raw part, the function returns (str, None) for it. >>> import email.header >>> >>> s = 'abc=?koi8-r?q?\xc1\xc2\xd7?=' >>> email.header.decode_header(s) [(b'abc', None), (b'\xc1\xc2\xd7', 'koi8-r')] >>> >>> s = 'abc' >>> email.header.decode_header(s) [('abc', None)] >>> There should be (bytes, None) for both cases.
msg230962 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-11-10 14:50
This is a duplicate of issue 6302. Re-reading that issue (again), I'm not quite sure why we didn't fix it, but it may be too late to fix it now for backward compatibility reasons. Since that issue strayed off into other topics, I'm going to leave this one open to consider whether or not we can/should fix this. The new email API does avoid this problem, though. Is there a reason you are choosing not to use the new API?
msg230976 - (view)	Author: py.user (py.user) *	Date: 2014-11-10 21:17
R. David Murray wrote: "Is there a reason you are choosing not to use the new API?" My program is for Python 3.x. I need to decode wild headers to pretty unicode strings. Now, I do it by decode_header() and try...except for AttributeError, since a unicode string has no .decode() method. I don't know what is "new API", but I guess it's not compatible with Python 3.0.
msg230981 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-11-10 22:59
Certainly not with 3.0, but nobody in their right mind should be using that version any more :). The new API for decoding headers is available as of Python 3.3, with additional new API features in 3.4. See https://docs.python.org/3/library/email-examples.html#examples-using-the-provisional-api for an example. Note that although the API is 'provisional', I anticipate no non-trivial changes when it becomes final in 3.5. (The only API change that has happened has been done such that you get warnings if you use it "wrong" in 3.4, and is in a relatively obscure method (is_attachment).
msg408376 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-12-12 11:30
Reproduced on 3.11.
msg409391 - (view)	Author: Daniel Lenski (dlenski) *	Date: 2021-12-30 22:30
I recently ran into this bug as well. For those looking for a reliable workaround, here's an implementation of a 'decode_header_to_string' function which should Just Work™ in all possible cases: #!/usr/bin/python3 import email.header # Workaround for https://bugs.python.org/issue22833 def decode_header_to_string(header): '''Decodes an email message header (possibly RFC2047-encoded) into a string, while working around https://bugs.python.org/issue22833''' return ''.join( alleged_string if isinstance(alleged_string, str) else alleged_string.decode( alleged_charset or 'ascii') for alleged_string, alleged_charset in email.header.decode_header(header)) for header in ('=?utf-8?B?ZsOzbw==', '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=', 'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=', 'plain string',): print("Header value: %r" % header) print("email.header.decode_header(...) -> %r" % email.header.decode_header(header)) print("decode_header_to_string(...) -> %r" % decode_header_to_string(header)) print("-------") Outputs: Header value: '=?utf-8?B?ZsOzbw==' email.header.decode_header(...) -> [('=?utf-8?B?ZsOzbw==', None)] decode_header_to_string(...) -> '=?utf-8?B?ZsOzbw==' ------- Header value: '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=' email.header.decode_header(...) -> [(b'hello', 'ascii'), (b'f\xc3\xb3o', 'utf-8')] decode_header_to_string(...) -> 'hellofóo' ------- Header value: 'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=' email.header.decode_header(...) -> [(b'bar', None), (b'hello', 'ascii'), (b'f\xc3\xb3o', 'utf-8')] decode_header_to_string(...) -> 'barhellofóo' ------- Header value: 'plain string' email.header.decode_header(...) -> [('plain string', None)] decode_header_to_string(...) -> 'plain string' ------- Header value: 'foo=?blah?Q??=' email.header.decode_header(...) -> [(b'foo', None), (b'', 'blah')] decode_header_to_string(...) -> 'foo' -------
msg409392 - (view)	Author: Daniel Lenski (dlenski) *	Date: 2021-12-30 23:08
Due to this bug, any user of this function in Python 3.0+ already has to be able to handle all of the following outputs in order to use it reliably: decode_header(...) -> [(str, None)] or decode_header(...) -> [(bytes, str)] or decode_header(...) -> [(bytes, (str\|None)), (bytes, (str\|None)), ...] == Fix str/bytes inconsistency == We could eliminate the inconsistency, and make the function only ever return bytes instead of str, with the following changes to https://github.com/python/cpython/blob/3.10/Lib/email/header.py. ``` diff --git a/Lib/email/header.py.orig b/Lib/email/header.py index 4ab0032..41e91f2 100644 --- a/Lib/email/header.py +++ b/Lib/email/header.py @@ -61,7 +61,7 @@ _max_append = email.quoprimime._max_append def decode_header(header): """Decode a message header value without converting charset. - Returns a list of (string, charset) pairs containing each of the decoded + Returns a list of (bytes, charset) pairs containing each of the decoded parts of the header. Charset is None for non-encoded parts of the header, otherwise a lower-case string containing the name of the character set specified in the encoded string. @@ -78,7 +78,7 @@ def decode_header(header): for string, charset in header._chunks] # If no encoding, just return the header with no charset. if not ecre.search(header): - return [(header, None)] + return [header.encode(), None)] # First step is to parse all the encoded parts into triplets of the form # (encoded_string, encoding, charset). For unencoded strings, the last # two parts will be None. ``` With these changes, decode_header() would return one of the following: decode_header(...) -> [(bytes, None)] or decode_header(...) -> [(bytes, str)] or decode_header(...) -> [(bytes, (str\|None)), (bytes, (str\|None)), ...] == Ensure that charset is always str, never None == A couple more small changes: ``` @@ -92,7 +92,7 @@ def decode_header(header): unencoded = unencoded.lstrip() first = False if unencoded: - words.append((unencoded, None, None)) + words.append((unencoded, None, 'ascii')) if parts: charset = parts.pop(0).lower() encoding = parts.pop(0).lower() @@ -133,7 +133,8 @@ def decode_header(header): # Now convert all words to bytes and collapse consecutive runs of # similarly encoded words. collapsed = [] - last_word = last_charset = None + last_word = None + last_charset = 'ascii' for word, charset in decoded_words: if isinstance(word, str): word = bytes(word, 'raw-unicode-escape') ``` With these changes, decode_header() would return only: decode_header(...) -> List[(bytes, str)]
msg411069 - (view)	Author: Jelle Zijlstra (JelleZijlstra) *	Date: 2022-01-21 01:48
This behavior is definitely unfortunate, but by now it's also been baked into more than a decade of Python 3 releases, so backward compatibility constraints make it difficult to fix. How can we be sure this change won't break users' code? For reference, here are a few uses of the function I found in major open-source packages: - https://github.com/httplib2/httplib2/blob/cde9e87d8b2c4c5fc966431965998ed5f45d19c7/python3/httplib2/__init__.py#L1608 - this assumes it only ever hits the (bytes, encoding) case. - https://github.com/cherrypy/cherrypy/blob/98929b519fbca003cbf7b14a6b370a3cabc9c412/cherrypy/lib/httputil.py#L258 - this assumes it only gets (str, None) or (bytes, encoding) pairs, which seems unsafe. But if it currently sees (str, None) and would see (bytes, None) with this change, it would break. An alternative solution could be a new function with a sane return type. Even if we decide to not change anything, we should document the surprising return type at https://docs.python.org/3.10/library/email.header.html.

History
Date	User	Action	Args
2022-04-11 14:58:10	admin	set	github: 67022
2022-01-21 01:48:16	JelleZijlstra	set	nosy: + JelleZijlstra messages: + msg411069
2022-01-11 21:29:49	dlenski	set	keywords: + patch stage: patch review pull_requests: + pull_request28748
2021-12-30 23:08:21	dlenski	set	messages: + msg409392
2021-12-30 22:30:20	dlenski	set	nosy: + dlenski messages: + msg409391
2021-12-12 11:30:42	iritkatriel	set	nosy: + iritkatriel messages: + msg408376 versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.4, Python 3.5
2014-11-10 22:59:23	r.david.murray	set	messages: + msg230981
2014-11-10 21:17:38	py.user	set	messages: + msg230976
2014-11-10 14:50:46	r.david.murray	set	messages: + msg230962 versions: - Python 3.3
2014-11-10 02:42:56	py.user	set	type: behavior
2014-11-10 02:42:40	py.user	create