classification
Title: email.header.decode_header sometimes returns bytes, sometimes str
Type: enhancement Stage: needs patch
Components: Documentation, email, Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: 6302 Superseder:
Assigned To: docs@python Nosy List: barry, berker.peksag, docs@python, ezio.melotti, jim_minter, louis.abraham@yahoo.fr, r.david.murray
Priority: normal Keywords:

Created on 2014-05-13 09:08 by jim_minter, last changed 2019-06-07 18:44 by terry.reedy.

Messages (5)
msg218419 - (view) Author: Jim Minter (jim_minter) Date: 2014-05-13 09:08
Python 3.3.2 (default, Mar  5 2014, 08:21:05) 
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email.header
>>> email.header.decode_header("foo")
[('foo', None)]
>>> email.header.decode_header("foo=?windows-1252?Q?bar?=")
[(b'foo', None), (b'bar', 'windows-1252')]

I may well be wrong, but I believe it's erroneous that in the second example above, b'foo' is returned instead of the expected 'foo'.
msg218442 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2014-05-13 11:44
> >>> email.header.decode_header("foo")
> [('foo', None)]

email.header.decode_header() implements rfc-2047 and the "foo" header doesn't match the syntax described in rfc-2047 (see "2. Syntax of encoded-words").

See the code for more information:

* http://hg.python.org/cpython/file/default/Lib/email/header.py#l34
* http://hg.python.org/cpython/file/default/Lib/email/header.py#l81

See the "6.1. Recognition of 'encoded-word's in message headers" section at http://www.rfc-base.org/txt/rfc-2047.txt
msg218450 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-05-13 13:11
The error is actually the first case returning string rather than bytes.  See issue 6302.
msg218451 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-05-13 13:15
Hmm.  It looks like we decided that we couldn't fix the behavior for backward compatibility reasons.  In 3.4 you can use the new email policies to get automatic, correct stringification of headers.
msg344522 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2019-06-04 04:20
If we can't fix the behavior, it should at least be documented.

Currently the docs says "This function returns a list of (decoded_string, charset) pairs containing each of the decoded parts of the header.".  One could assume that this means that a Unicode string is returned, but and as far as I can tell, "decoded_string" means decoded from the format used by the header, not from bytes -- in fact the example below shows a byte string.
#24797 suggest an alternative solution, but there is no indications about it in the docs except an easy-to-miss note about the new API at the top.

Coincidentally as I was reporting this issue I also found the recently opened #37139.  There are also a few other reports: #24797, #37139, #32975, #6302, #4661.

If this method is not actually deprecated, I would document the current behavior (i.e. sometimes it returns bytes, sometimes unicode -- bonus points if there's a simple rule to predict which one), explain that it exists for legacy/backward-compatibility reasons, and point to the alternatives.


FWIW here are 3 more samples that show the inconsistency.

>>> from email.header import decode_header
>>> # str + None
>>> h = '\x80SOKCrGxsbw===== <hello@example.com>'; decode_header(h)
[('\x80SOKCrGxsbw===== <hello@example.com>', None)]
>>> # bytes + '', bytes + None
>>> h = '=??b?SOKCrGxsbw=====?= <hello@example.com>'; decode_header(h)
[(b'H\xe2\x82\xacllo', ''), (b' <hello@example.com>', None)]
>>> # bytes + 'utf8', bytes + None
>>> h = '=?utf8?b?SOKCrGxsbw==?= <hello@example.com>'; decode_header(h)
[(b'H\xe2\x82\xacllo', 'utf8'), (b' <hello@example.com>', None)]
History
Date User Action Args
2019-06-07 18:44:38terry.reedysetversions: + Python 3.7, Python 3.8, Python 3.9, - Python 3.3
2019-06-04 04:20:03ezio.melottisetstatus: closed -> open

type: behavior -> enhancement
assignee: docs@python
components: + Documentation

nosy: + ezio.melotti, louis.abraham@yahoo.fr, docs@python
messages: + msg344522
resolution: duplicate ->
stage: resolved -> needs patch
2015-08-05 15:49:44r.david.murraylinkissue24797 superseder
2014-05-13 13:15:12r.david.murraysetmessages: + msg218451
2014-05-13 13:11:38r.david.murraysetstatus: open -> closed
messages: + msg218450

dependencies: + Add decode_header_as_string method to email.utils
resolution: duplicate
stage: resolved
2014-05-13 11:44:48berker.peksagsetnosy: + barry, berker.peksag, r.david.murray
messages: + msg218442
components: + email
2014-05-13 09:08:05jim_mintercreate