This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: [doc] email.header.Header.__unicode__ does not decode header
Type: behavior Stage: patch review
Components: Documentation, email Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: JelleZijlstra, barry, docs@python, hniksic, r.david.murray, vidhya
Priority: normal Keywords: easy, patch

Created on 2013-03-21 07:46 by hniksic, last changed 2022-04-11 14:57 by admin.

Pull Requests
URL Status Linked Edit
PR 31765 open vidhya, 2022-03-08 15:20
Messages (12)
msg184856 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2013-03-21 07:46
The __unicode__ method is documented to "return the header as a Unicode string". For this to be useful, I would expect it to decode a string such as "=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=" into a Unicode string that can be displayed to the user, in this case u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'.

However, unicode(header) returns the not so useful u"=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=". Looking at the code of __unicode__, it appears that the code does attempt to decode the header into Unicode, but this fails for Headers initialized from a single MIME-quoted string, as is done by the message parser. In other words, __unicode__ is failing to call decode_header.

Here is a minimal example demonstrating the problem:

>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
>>> unicode(msg['subject'])
u'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='

Expected output of the last line:
u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'

To get the fully decoded Unicode string, one must use something like:
>>> u''.join(unicode(s, c) for s, c in email.header.decode_header(msg['subject']))

which is unintuitive and hard to teach to new users of the email package. (And looking at the source of __unicode__ it's not even obvious that it's correct — it appears that a space must be added before us-ascii-coded chunks. The joining is non-trivial.)

The same problem occurs in Python 3.3 with str(msg['subject']).
msg184894 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2013-03-21 18:01
An example of the confusion that lack of a clear "convert to unicode" method creates is illustrated by this StackOverflow question: http://stackoverflow.com/q/15516958/1600898
msg184896 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-03-21 18:41
I agree that this is not the worlds best API.  However, it is the API that we have in 2.7/3.2, and we can't change how Header.__unicode__ behaves without breaking backward compatibility.  

What we could do is add an example of how to use this API to get unicode strings to the top of the docs:

   >>>  unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=')))
   u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'

But you already know about that.

In Python 3.3 you get this:

   >>> msg = message_from_string("subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\n", policy=default)
   >>> msg['subject']
   '这是中文测试!'

So, I'll make this a doc bug.
msg184897 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-03-21 18:49
Erg, somehow I failed to read the second half of your message before writing mine...clearly you *didn't* know about that idiom, so the doc patch is obviously an important thing to do.

To clarify about the 3.3 example: the policy=default is key, it tells the email package to use the new (currently provisional) policy code to provide improved handling of header decoding and encoding.
msg185028 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2013-03-23 06:48
Thanks for pointing out the make_header(decode_header(...)) idiom, which I was indeed not aware of.  It solves the problem perfectly.

I agree that it is a doc bug.  While make_header is documented on the same place as decode_header and Header itself, it is not explained *why* I should call it if I already have in hand a perfectly valid Header instance.  Specifically, it is not at all clear that while unicode(h) and unicode(make_header(decode_header(h)) will return different things -- I would have expected make_header(decode_header(h)) to return an object indistinguishable from h.

Also, the policy=default parameter in Python 3 sounds great, it's exactly what one would expect.
msg414228 - (view) Author: Vidhya (vidhya) * Date: 2022-03-01 01:11
[Entry level contributor seeking guidance] If this is still open, I like to work on this.

Also, planning to add the following(if no PR yet created) at make_header API at https://docs.python.org/3/library/email.header.html :

To get unicode strings use the API as shown below:
 unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=')))

If email policy parameter is set as 'policy.default' then the default policy, for that Python version, is used for header encoding and decoding. 

Please correct me if anything wrong.
msg414234 - (view) Author: Jelle Zijlstra (JelleZijlstra) * (Python committer) Date: 2022-03-01 02:59
The messages above are very old and seem to be discussing Python 2. There is no `__unicode__` method any more, for example, though there is a `__str__` method which presumably does what `__unicode__` used to do. It is documented now at https://docs.python.org/3.10/library/email.header.html#email.header.Header.__str__ . You'll have to do some more digging to figure out whether the OP's complaint still applies.
msg414273 - (view) Author: Vidhya (vidhya) * Date: 2022-03-01 15:33
The latest versions 3.9, 3.10 and 3.11 are updated in the issue. So I thought like it still applies.

@irit: Any suggestions on what needs to be done for current revisions?
msg414508 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2022-03-04 08:27
> Any suggestions on what needs to be done for current revisions?

Hi! I'm the person who submitted this issue back in 2013. Let's take a look at how things are in Python 3.10:

Python 3.10.2 (main, Jan 13 2022, 19:06:22) [GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
>>> msg['Subject']
'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='

So the headers are still not decoded by default. The `unicode()` invocation in the original description was just an attempt to get a Unicode string out of a byte string (assuming it was correctly decoded from MIME, which it wasn't). Since Python 3 strings are Unicode already, I'd expect to just get the decoded subject - but that still doesn't happen.

The correct way to make it happen is to specify `policy=email.policy.default`:

>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n', policy=email.policy.default)
>>> msg['Subject']
'这是中文测试!'

The docs should point out that you really _want_ to specify the "default" policy (strangely named, since it's not in fact the default). The current docs only say that `message_from_string()` is "equivalent to Parser().parsestr(s)." and that `policy` is interpreted "as with the Parser class constructor". The docs of the Parser constructor don't document `policy` at all, except for the version when it was added.

So, if you want to work for this, my suggestion would be to improve the docs in the following ways:

* in message_from_string() docs, explain that `policy=email.policy.default` is what you want to send to get the headers decoded.

* in Parser docs, explain what _class and policy arguments do in the constructor, which policies are possible, etc. (These things seem to be explained in the BytesFeedParser, so you might want to just link to that, or include a shortened version.)
msg414530 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2022-03-04 14:32
The policy is named 'default' because it was intended to become the default two feature releases after the new email code became non-provisional (first: deprecate not specifying an explicit policy, next release make default the default policy and make the deprecation only cover compat32).  However, for various reasons that switchover did not happen (one big factor being my reduced time spent doing python development).  It can happen any time someone steps forward to guide it through the release process.
msg414542 - (view) Author: Vidhya (vidhya) * Date: 2022-03-04 18:04
@hniksic: Thanks for your suggestions. I will look into BytesFeedParser documents.
@david.murray: I can help you for the switch over to the default in the default policy and update the deprecation as well. It will be good if someone can guide me on this. Being a beginner, I am not sure if we are allowed to change the python code.
msg414758 - (view) Author: Vidhya (vidhya) * Date: 2022-03-08 15:23
The PR for the email parser doc update is: 
https://github.com/python/cpython/pull/31765

Can someone review it pls.
History
Date User Action Args
2022-04-11 14:57:43adminsetgithub: 61707
2022-03-08 15:23:43vidhyasetmessages: + msg414758
2022-03-08 15:20:22vidhyasetkeywords: + patch
stage: patch review
pull_requests: + pull_request29874
2022-03-04 18:04:58vidhyasetmessages: + msg414542
2022-03-04 14:32:51r.david.murraysetmessages: + msg414530
2022-03-04 08:27:09hniksicsetmessages: + msg414508
2022-03-01 15:33:20vidhyasetmessages: + msg414273
2022-03-01 02:59:16JelleZijlstrasetnosy: + JelleZijlstra
messages: + msg414234
2022-03-01 01:11:45vidhyasetnosy: + vidhya
messages: + msg414228
2021-12-13 18:40:21iritkatrielsetkeywords: + easy
title: email.header.Header.__unicode__ does not decode header -> [doc] email.header.Header.__unicode__ does not decode header
versions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.7, Python 3.2, Python 3.3, Python 3.4
2013-03-23 06:48:13hniksicsetmessages: + msg185028
2013-03-21 18:49:17r.david.murraysetmessages: + msg184897
2013-03-21 18:41:46r.david.murraysetversions: + Python 3.2, Python 3.3, Python 3.4
nosy: + docs@python

messages: + msg184896

assignee: docs@python
components: + Documentation
2013-03-21 18:01:08hniksicsetmessages: + msg184894
2013-03-21 07:47:28hniksicsettype: behavior
2013-03-21 07:46:53hniksiccreate