Issue22833
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014-11-10 02:42 by py.user, last changed 2022-04-11 14:58 by admin.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 30548 | open | dlenski, 2022-01-11 21:29 |
Messages (8) | |||
---|---|---|---|
msg230932 - (view) | Author: py.user (py.user) * | Date: 2014-11-10 02:42 | |
It depends on encoded part in the header, what email.header.decode_header() returns. If the header has both raw part and encoded part, the function returns (bytes, None) for the raw part. But if the header has only raw part, the function returns (str, None) for it. >>> import email.header >>> >>> s = 'abc=?koi8-r?q?\xc1\xc2\xd7?=' >>> email.header.decode_header(s) [(b'abc', None), (b'\xc1\xc2\xd7', 'koi8-r')] >>> >>> s = 'abc' >>> email.header.decode_header(s) [('abc', None)] >>> There should be (bytes, None) for both cases. |
|||
msg230962 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2014-11-10 14:50 | |
This is a duplicate of issue 6302. Re-reading that issue (again), I'm not quite sure why we didn't fix it, but it may be too late to fix it now for backward compatibility reasons. Since that issue strayed off into other topics, I'm going to leave this one open to consider whether or not we can/should fix this. The new email API does avoid this problem, though. Is there a reason you are choosing not to use the new API? |
|||
msg230976 - (view) | Author: py.user (py.user) * | Date: 2014-11-10 21:17 | |
R. David Murray wrote: "Is there a reason you are choosing not to use the new API?" My program is for Python 3.x. I need to decode wild headers to pretty unicode strings. Now, I do it by decode_header() and try...except for AttributeError, since a unicode string has no .decode() method. I don't know what is "new API", but I guess it's not compatible with Python 3.0. |
|||
msg230981 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2014-11-10 22:59 | |
Certainly not with 3.0, but nobody in their right mind should be using that version any more :). The new API for decoding headers is available as of Python 3.3, with additional new API features in 3.4. See https://docs.python.org/3/library/email-examples.html#examples-using-the-provisional-api for an example. Note that although the API is 'provisional', I anticipate no non-trivial changes when it becomes final in 3.5. (The only API change that has happened has been done such that you get warnings if you use it "wrong" in 3.4, and is in a relatively obscure method (is_attachment). |
|||
msg408376 - (view) | Author: Irit Katriel (iritkatriel) * | Date: 2021-12-12 11:30 | |
Reproduced on 3.11. |
|||
msg409391 - (view) | Author: Daniel Lenski (dlenski) * | Date: 2021-12-30 22:30 | |
I recently ran into this bug as well. For those looking for a reliable workaround, here's an implementation of a 'decode_header_to_string' function which should Just Work™ in all possible cases: #!/usr/bin/python3 import email.header # Workaround for https://bugs.python.org/issue22833 def decode_header_to_string(header): '''Decodes an email message header (possibly RFC2047-encoded) into a string, while working around https://bugs.python.org/issue22833''' return ''.join( alleged_string if isinstance(alleged_string, str) else alleged_string.decode( alleged_charset or 'ascii') for alleged_string, alleged_charset in email.header.decode_header(header)) for header in ('=?utf-8?B?ZsOzbw==', '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=', 'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=', 'plain string',): print("Header value: %r" % header) print("email.header.decode_header(...) -> %r" % email.header.decode_header(header)) print("decode_header_to_string(...) -> %r" % decode_header_to_string(header)) print("-------") Outputs: Header value: '=?utf-8?B?ZsOzbw==' email.header.decode_header(...) -> [('=?utf-8?B?ZsOzbw==', None)] decode_header_to_string(...) -> '=?utf-8?B?ZsOzbw==' ------- Header value: '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=' email.header.decode_header(...) -> [(b'hello', 'ascii'), (b'f\xc3\xb3o', 'utf-8')] decode_header_to_string(...) -> 'hellofóo' ------- Header value: 'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=' email.header.decode_header(...) -> [(b'bar', None), (b'hello', 'ascii'), (b'f\xc3\xb3o', 'utf-8')] decode_header_to_string(...) -> 'barhellofóo' ------- Header value: 'plain string' email.header.decode_header(...) -> [('plain string', None)] decode_header_to_string(...) -> 'plain string' ------- Header value: 'foo=?blah?Q??=' email.header.decode_header(...) -> [(b'foo', None), (b'', 'blah')] decode_header_to_string(...) -> 'foo' ------- |
|||
msg409392 - (view) | Author: Daniel Lenski (dlenski) * | Date: 2021-12-30 23:08 | |
Due to this bug, any user of this function in Python 3.0+ *already* has to be able to handle all of the following outputs in order to use it reliably: decode_header(...) -> [(str, None)] or decode_header(...) -> [(bytes, str)] or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...] == Fix str/bytes inconsistency == We could eliminate the inconsistency, and make the function only ever return bytes instead of str, with the following changes to https://github.com/python/cpython/blob/3.10/Lib/email/header.py. ``` diff --git a/Lib/email/header.py.orig b/Lib/email/header.py index 4ab0032..41e91f2 100644 --- a/Lib/email/header.py +++ b/Lib/email/header.py @@ -61,7 +61,7 @@ _max_append = email.quoprimime._max_append def decode_header(header): """Decode a message header value without converting charset. - Returns a list of (string, charset) pairs containing each of the decoded + Returns a list of (bytes, charset) pairs containing each of the decoded parts of the header. Charset is None for non-encoded parts of the header, otherwise a lower-case string containing the name of the character set specified in the encoded string. @@ -78,7 +78,7 @@ def decode_header(header): for string, charset in header._chunks] # If no encoding, just return the header with no charset. if not ecre.search(header): - return [(header, None)] + return [header.encode(), None)] # First step is to parse all the encoded parts into triplets of the form # (encoded_string, encoding, charset). For unencoded strings, the last # two parts will be None. ``` With these changes, decode_header() would return one of the following: decode_header(...) -> [(bytes, None)] or decode_header(...) -> [(bytes, str)] or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...] == Ensure that charset is always str, never None == A couple more small changes: ``` @@ -92,7 +92,7 @@ def decode_header(header): unencoded = unencoded.lstrip() first = False if unencoded: - words.append((unencoded, None, None)) + words.append((unencoded, None, 'ascii')) if parts: charset = parts.pop(0).lower() encoding = parts.pop(0).lower() @@ -133,7 +133,8 @@ def decode_header(header): # Now convert all words to bytes and collapse consecutive runs of # similarly encoded words. collapsed = [] - last_word = last_charset = None + last_word = None + last_charset = 'ascii' for word, charset in decoded_words: if isinstance(word, str): word = bytes(word, 'raw-unicode-escape') ``` With these changes, decode_header() would return only: decode_header(...) -> List[(bytes, str)] |
|||
msg411069 - (view) | Author: Jelle Zijlstra (JelleZijlstra) * | Date: 2022-01-21 01:48 | |
This behavior is definitely unfortunate, but by now it's also been baked into more than a decade of Python 3 releases, so backward compatibility constraints make it difficult to fix. How can we be sure this change won't break users' code? For reference, here are a few uses of the function I found in major open-source packages: - https://github.com/httplib2/httplib2/blob/cde9e87d8b2c4c5fc966431965998ed5f45d19c7/python3/httplib2/__init__.py#L1608 - this assumes it only ever hits the (bytes, encoding) case. - https://github.com/cherrypy/cherrypy/blob/98929b519fbca003cbf7b14a6b370a3cabc9c412/cherrypy/lib/httputil.py#L258 - this assumes it only gets (str, None) or (bytes, encoding) pairs, which seems unsafe. But if it currently sees (str, None) and would see (bytes, None) with this change, it would break. An alternative solution could be a new function with a sane return type. Even if we decide to not change anything, we should document the surprising return type at https://docs.python.org/3.10/library/email.header.html. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:10 | admin | set | github: 67022 |
2022-01-21 01:48:16 | JelleZijlstra | set | nosy:
+ JelleZijlstra messages: + msg411069 |
2022-01-11 21:29:49 | dlenski | set | keywords:
+ patch stage: patch review pull_requests: + pull_request28748 |
2021-12-30 23:08:21 | dlenski | set | messages: + msg409392 |
2021-12-30 22:30:20 | dlenski | set | nosy:
+ dlenski messages: + msg409391 |
2021-12-12 11:30:42 | iritkatriel | set | nosy:
+ iritkatriel messages: + msg408376 versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.4, Python 3.5 |
2014-11-10 22:59:23 | r.david.murray | set | messages: + msg230981 |
2014-11-10 21:17:38 | py.user | set | messages: + msg230976 |
2014-11-10 14:50:46 | r.david.murray | set | messages:
+ msg230962 versions: - Python 3.3 |
2014-11-10 02:42:56 | py.user | set | type: behavior |
2014-11-10 02:42:40 | py.user | create |