Message 184856 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	hniksic
Recipients	barry, hniksic, r.david.murray
Date	2013-03-21.07:46:53
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1363852013.63.0.253070231302.issue17505@psf.upfronthosting.co.za>
In-reply-to

Content
The __unicode__ method is documented to "return the header as a Unicode string". For this to be useful, I would expect it to decode a string such as "=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=" into a Unicode string that can be displayed to the user, in this case u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'. However, unicode(header) returns the not so useful u"=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=". Looking at the code of __unicode__, it appears that the code does attempt to decode the header into Unicode, but this fails for Headers initialized from a single MIME-quoted string, as is done by the message parser. In other words, __unicode__ is failing to call decode_header. Here is a minimal example demonstrating the problem: >>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n') >>> unicode(msg['subject']) u'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=' Expected output of the last line: u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01' To get the fully decoded Unicode string, one must use something like: >>> u''.join(unicode(s, c) for s, c in email.header.decode_header(msg['subject'])) which is unintuitive and hard to teach to new users of the email package. (And looking at the source of __unicode__ it's not even obvious that it's correct — it appears that a space must be added before us-ascii-coded chunks. The joining is non-trivial.) The same problem occurs in Python 3.3 with str(msg['subject']).

The __unicode__ method is documented to "return the header as a Unicode string". For this to be useful, I would expect it to decode a string such as "=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=" into a Unicode string that can be displayed to the user, in this case u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'.

However, unicode(header) returns the not so useful u"=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=". Looking at the code of __unicode__, it appears that the code does attempt to decode the header into Unicode, but this fails for Headers initialized from a single MIME-quoted string, as is done by the message parser. In other words, __unicode__ is failing to call decode_header.

Here is a minimal example demonstrating the problem:

>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
>>> unicode(msg['subject'])
u'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='

Expected output of the last line:
u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'

To get the fully decoded Unicode string, one must use something like:
>>> u''.join(unicode(s, c) for s, c in email.header.decode_header(msg['subject']))

which is unintuitive and hard to teach to new users of the email package. (And looking at the source of __unicode__ it's not even obvious that it's correct — it appears that a space must be added before us-ascii-coded chunks. The joining is non-trivial.)

The same problem occurs in Python 3.3 with str(msg['subject']).

History
Date	User	Action	Args
2013-03-21 07:46:53	hniksic	set	recipients: + hniksic, barry, r.david.murray
2013-03-21 07:46:53	hniksic	set	messageid: <1363852013.63.0.253070231302.issue17505@psf.upfronthosting.co.za>
2013-03-21 07:46:53	hniksic	link	issue17505 messages
2013-03-21 07:46:53	hniksic	create