classification
Title: email.header.Header.__unicode__ does not decode header
Type: behavior Stage:
Components: Documentation, email Versions: Python 3.2, Python 3.3, Python 3.4, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: barry, docs@python, hniksic, r.david.murray
Priority: normal Keywords:

Created on 2013-03-21 07:46 by hniksic, last changed 2013-03-23 06:48 by hniksic.

Messages (5)
msg184856 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2013-03-21 07:46
The __unicode__ method is documented to "return the header as a Unicode string". For this to be useful, I would expect it to decode a string such as "=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=" into a Unicode string that can be displayed to the user, in this case u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'.

However, unicode(header) returns the not so useful u"=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=". Looking at the code of __unicode__, it appears that the code does attempt to decode the header into Unicode, but this fails for Headers initialized from a single MIME-quoted string, as is done by the message parser. In other words, __unicode__ is failing to call decode_header.

Here is a minimal example demonstrating the problem:

>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
>>> unicode(msg['subject'])
u'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='

Expected output of the last line:
u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'

To get the fully decoded Unicode string, one must use something like:
>>> u''.join(unicode(s, c) for s, c in email.header.decode_header(msg['subject']))

which is unintuitive and hard to teach to new users of the email package. (And looking at the source of __unicode__ it's not even obvious that it's correct — it appears that a space must be added before us-ascii-coded chunks. The joining is non-trivial.)

The same problem occurs in Python 3.3 with str(msg['subject']).
msg184894 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2013-03-21 18:01
An example of the confusion that lack of a clear "convert to unicode" method creates is illustrated by this StackOverflow question: http://stackoverflow.com/q/15516958/1600898
msg184896 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-03-21 18:41
I agree that this is not the worlds best API.  However, it is the API that we have in 2.7/3.2, and we can't change how Header.__unicode__ behaves without breaking backward compatibility.  

What we could do is add an example of how to use this API to get unicode strings to the top of the docs:

   >>>  unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=')))
   u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'

But you already know about that.

In Python 3.3 you get this:

   >>> msg = message_from_string("subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\n", policy=default)
   >>> msg['subject']
   '这是中文测试!'

So, I'll make this a doc bug.
msg184897 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-03-21 18:49
Erg, somehow I failed to read the second half of your message before writing mine...clearly you *didn't* know about that idiom, so the doc patch is obviously an important thing to do.

To clarify about the 3.3 example: the policy=default is key, it tells the email package to use the new (currently provisional) policy code to provide improved handling of header decoding and encoding.
msg185028 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2013-03-23 06:48
Thanks for pointing out the make_header(decode_header(...)) idiom, which I was indeed not aware of.  It solves the problem perfectly.

I agree that it is a doc bug.  While make_header is documented on the same place as decode_header and Header itself, it is not explained *why* I should call it if I already have in hand a perfectly valid Header instance.  Specifically, it is not at all clear that while unicode(h) and unicode(make_header(decode_header(h)) will return different things -- I would have expected make_header(decode_header(h)) to return an object indistinguishable from h.

Also, the policy=default parameter in Python 3 sounds great, it's exactly what one would expect.
History
Date User Action Args
2013-03-23 06:48:13hniksicsetmessages: + msg185028
2013-03-21 18:49:17r.david.murraysetmessages: + msg184897
2013-03-21 18:41:46r.david.murraysetversions: + Python 3.2, Python 3.3, Python 3.4
nosy: + docs@python

messages: + msg184896

assignee: docs@python
components: + Documentation
2013-03-21 18:01:08hniksicsetmessages: + msg184894
2013-03-21 07:47:28hniksicsettype: behavior
2013-03-21 07:46:53hniksiccreate