This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author r.david.murray
Recipients barry, mlalic, r.david.murray, serhiy.storchaka
Date 2013-06-20.19:32:39
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1371756760.04.0.455880862597.issue18271@psf.upfronthosting.co.za>
In-reply-to
Content
The python3 email package's handling of 8bit definitely has quirks.  (So did the python2 email package's, but they were different quirks. :)

You can't correctly handle 8bit unless you use message_from_bytes and take the input from a byte string.  It is a good question what should be done with a unicode string that claims its payload is 8bit...since that situation can't arise on the wire (or in a disk file), perhaps it should produce an exception ("message must be parsed as binary data"?)  The problem with that idea is that the email parser promises to never raise errors, but always produce *some* sort of model from the input, possibly with defects attached.

All that aside, here is what you want to be doing:

>>> from email import message_from_bytes
>>> message = message_from_bytes(b"""MIME-Version: 1.0
... Content-Type: text/plain; charset=utf-8
... Content-Disposition: inline
... Content-Transfer-Encoding: 8bit
... 
... \xc3\xbcnic\xc3\xb6de data..""")
>>> message.get_content_charset()
'utf-8'
>>> message.get_payload(decode=True)
b'\xc3\xbcnic\xc3\xb6de data..'
>>> message.get_payload(decode=True).decode('utf-8')
'ünicöde data..'
>>> message.get_payload()
'ünicöde data..'

You will note that get_payload without the decode automatically does the charset decode.  I know this is counter-intuitive, but we are dealing with a legacy API that I had to retrofit.  Think of decode=True as "produce binary from the wire content transfer encoding", and decode=False as "produce the string representation of the payload".  For ASCII content-transfer-encodings, this is more intuitive (the raw quoted printable, for example), but for 8bit we can only produce a python string if we do the unicode decode...so that's what we do.

You will also note that the payload in this case really *is* utf-8, whereas in your example it was unicode...and what the python3 email package does with a unicode payload is not well defined and is definitely buggy.

I'm going to close this issue, because dealing with the vagaries of 8bit with string input is on my master list of things to tackle this summer, and will be dealt with in the context of other changes.
History
Date User Action Args
2013-06-20 19:32:40r.david.murraysetrecipients: + r.david.murray, barry, serhiy.storchaka, mlalic
2013-06-20 19:32:40r.david.murraysetmessageid: <1371756760.04.0.455880862597.issue18271@psf.upfronthosting.co.za>
2013-06-20 19:32:40r.david.murraylinkissue18271 messages
2013-06-20 19:32:39r.david.murraycreate