Issue 17505: [doc] email.header.Header.__unicode__ does not decode header

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61707

classification

Title:	[doc] email.header.Header.__unicode__ does not decode header
Type:	behavior	Stage:	patch review
Components:	Documentation, email	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	JelleZijlstra, barry, docs@python, hniksic, r.david.murray, vidhya
Priority:	normal	Keywords:	easy, patch

Created on 2013-03-21 07:46 by hniksic, last changed 2022-04-11 14:57 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 31765	open	vidhya, 2022-03-08 15:20

Messages (12)
msg184856 - (view)	Author: Hrvoje Nikšić (hniksic) *	Date: 2013-03-21 07:46
The __unicode__ method is documented to "return the header as a Unicode string". For this to be useful, I would expect it to decode a string such as "=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=" into a Unicode string that can be displayed to the user, in this case u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'. However, unicode(header) returns the not so useful u"=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=". Looking at the code of __unicode__, it appears that the code does attempt to decode the header into Unicode, but this fails for Headers initialized from a single MIME-quoted string, as is done by the message parser. In other words, __unicode__ is failing to call decode_header. Here is a minimal example demonstrating the problem: >>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n') >>> unicode(msg['subject']) u'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=' Expected output of the last line: u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01' To get the fully decoded Unicode string, one must use something like: >>> u''.join(unicode(s, c) for s, c in email.header.decode_header(msg['subject'])) which is unintuitive and hard to teach to new users of the email package. (And looking at the source of __unicode__ it's not even obvious that it's correct — it appears that a space must be added before us-ascii-coded chunks. The joining is non-trivial.) The same problem occurs in Python 3.3 with str(msg['subject']).
msg184894 - (view)	Author: Hrvoje Nikšić (hniksic) *	Date: 2013-03-21 18:01
An example of the confusion that lack of a clear "convert to unicode" method creates is illustrated by this StackOverflow question: http://stackoverflow.com/q/15516958/1600898
msg184896 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-03-21 18:41
I agree that this is not the worlds best API. However, it is the API that we have in 2.7/3.2, and we can't change how Header.__unicode__ behaves without breaking backward compatibility. What we could do is add an example of how to use this API to get unicode strings to the top of the docs: >>> unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='))) u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01' But you already know about that. In Python 3.3 you get this: >>> msg = message_from_string("subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\n", policy=default) >>> msg['subject'] '这是中文测试！' So, I'll make this a doc bug.
msg184897 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-03-21 18:49
Erg, somehow I failed to read the second half of your message before writing mine...clearly you didn't know about that idiom, so the doc patch is obviously an important thing to do. To clarify about the 3.3 example: the policy=default is key, it tells the email package to use the new (currently provisional) policy code to provide improved handling of header decoding and encoding.
msg185028 - (view)	Author: Hrvoje Nikšić (hniksic) *	Date: 2013-03-23 06:48
Thanks for pointing out the make_header(decode_header(...)) idiom, which I was indeed not aware of. It solves the problem perfectly. I agree that it is a doc bug. While make_header is documented on the same place as decode_header and Header itself, it is not explained why I should call it if I already have in hand a perfectly valid Header instance. Specifically, it is not at all clear that while unicode(h) and unicode(make_header(decode_header(h)) will return different things -- I would have expected make_header(decode_header(h)) to return an object indistinguishable from h. Also, the policy=default parameter in Python 3 sounds great, it's exactly what one would expect.
msg414228 - (view)	Author: Vidhya (vidhya) *	Date: 2022-03-01 01:11
[Entry level contributor seeking guidance] If this is still open, I like to work on this. Also, planning to add the following(if no PR yet created) at make_header API at https://docs.python.org/3/library/email.header.html : To get unicode strings use the API as shown below: unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='))) If email policy parameter is set as 'policy.default' then the default policy, for that Python version, is used for header encoding and decoding. Please correct me if anything wrong.
msg414234 - (view)	Author: Jelle Zijlstra (JelleZijlstra) *	Date: 2022-03-01 02:59
The messages above are very old and seem to be discussing Python 2. There is no `__unicode__` method any more, for example, though there is a `__str__` method which presumably does what `__unicode__` used to do. It is documented now at https://docs.python.org/3.10/library/email.header.html#email.header.Header.__str__ . You'll have to do some more digging to figure out whether the OP's complaint still applies.
msg414273 - (view)	Author: Vidhya (vidhya) *	Date: 2022-03-01 15:33
The latest versions 3.9, 3.10 and 3.11 are updated in the issue. So I thought like it still applies. @irit: Any suggestions on what needs to be done for current revisions?
msg414508 - (view)	Author: Hrvoje Nikšić (hniksic) *	Date: 2022-03-04 08:27
> Any suggestions on what needs to be done for current revisions? Hi! I'm the person who submitted this issue back in 2013. Let's take a look at how things are in Python 3.10: Python 3.10.2 (main, Jan 13 2022, 19:06:22) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import email >>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n') >>> msg['Subject'] '=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=' So the headers are still not decoded by default. The `unicode()` invocation in the original description was just an attempt to get a Unicode string out of a byte string (assuming it was correctly decoded from MIME, which it wasn't). Since Python 3 strings are Unicode already, I'd expect to just get the decoded subject - but that still doesn't happen. The correct way to make it happen is to specify `policy=email.policy.default`: >>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n', policy=email.policy.default) >>> msg['Subject'] '这是中文测试！' The docs should point out that you really _want_ to specify the "default" policy (strangely named, since it's not in fact the default). The current docs only say that `message_from_string()` is "equivalent to Parser().parsestr(s)." and that `policy` is interpreted "as with the Parser class constructor". The docs of the Parser constructor don't document `policy` at all, except for the version when it was added. So, if you want to work for this, my suggestion would be to improve the docs in the following ways: * in message_from_string() docs, explain that `policy=email.policy.default` is what you want to send to get the headers decoded. * in Parser docs, explain what _class and policy arguments do in the constructor, which policies are possible, etc. (These things seem to be explained in the BytesFeedParser, so you might want to just link to that, or include a shortened version.)
msg414530 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2022-03-04 14:32
The policy is named 'default' because it was intended to become the default two feature releases after the new email code became non-provisional (first: deprecate not specifying an explicit policy, next release make default the default policy and make the deprecation only cover compat32). However, for various reasons that switchover did not happen (one big factor being my reduced time spent doing python development). It can happen any time someone steps forward to guide it through the release process.
msg414542 - (view)	Author: Vidhya (vidhya) *	Date: 2022-03-04 18:04
@hniksic: Thanks for your suggestions. I will look into BytesFeedParser documents. @david.murray: I can help you for the switch over to the default in the default policy and update the deprecation as well. It will be good if someone can guide me on this. Being a beginner, I am not sure if we are allowed to change the python code.
msg414758 - (view)	Author: Vidhya (vidhya) *	Date: 2022-03-08 15:23
The PR for the email parser doc update is: https://github.com/python/cpython/pull/31765 Can someone review it pls.

History
Date	User	Action	Args
2022-04-11 14:57:43	admin	set	github: 61707
2022-03-08 15:23:43	vidhya	set	messages: + msg414758
2022-03-08 15:20:22	vidhya	set	keywords: + patch stage: patch review pull_requests: + pull_request29874
2022-03-04 18:04:58	vidhya	set	messages: + msg414542
2022-03-04 14:32:51	r.david.murray	set	messages: + msg414530
2022-03-04 08:27:09	hniksic	set	messages: + msg414508
2022-03-01 15:33:20	vidhya	set	messages: + msg414273
2022-03-01 02:59:16	JelleZijlstra	set	nosy: + JelleZijlstra messages: + msg414234
2022-03-01 01:11:45	vidhya	set	nosy: + vidhya messages: + msg414228
2021-12-13 18:40:21	iritkatriel	set	keywords: + easy title: email.header.Header.__unicode__ does not decode header -> [doc] email.header.Header.__unicode__ does not decode header versions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.7, Python 3.2, Python 3.3, Python 3.4
2013-03-23 06:48:13	hniksic	set	messages: + msg185028
2013-03-21 18:49:17	r.david.murray	set	messages: + msg184897
2013-03-21 18:41:46	r.david.murray	set	versions: + Python 3.2, Python 3.3, Python 3.4 nosy: + docs@python messages: + msg184896 assignee: docs@python components: + Documentation
2013-03-21 18:01:08	hniksic	set	messages: + msg184894
2013-03-21 07:47:28	hniksic	set	type: behavior
2013-03-21 07:46:53	hniksic	create