Issue 18271: get_payload method returns bytes which cannot be decoded using the message's charset

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62471

classification

Title:	get_payload method returns bytes which cannot be decoded using the message's charset
Type:	behavior	Stage:	resolved
Components:	email	Versions:	Python 3.4

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, mlalic, r.david.murray, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2013-06-20 15:39 by mlalic, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (6)
msg191526 - (view)	Author: Marko Lalic (mlalic)	Date: 2013-06-20 15:39
When the message's Content-Transfer-Encoding is set to 8bit, the get_payload(decode=True) method returns the payload encoded using raw-unicode-escape. This means that it is impossible to decode the returned bytes using the content charset obtained by the get_content_charset method. It seems this should be fixed so that get_payload returns the bytes as found in the payload when Content-Transfer-Encoding is 8bit, exactly like Python2.7 handles it. >>> from email import message_from_string >>> message = message_from_string("""MIME-Version: 1.0 ... Content-Type: text/plain; charset=utf-8 ... Content-Disposition: inline ... Content-Transfer-Encoding: 8bit ... ... ünicöde data..""") >>> message.get_content_charset() 'utf-8' >>> message.get_payload(decode=True) b'\xfcnic\xf6de data..' >>> message.get_payload(decode=True).decode(message.get_content_charset()) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 0: invalid start byte >>> message.get_payload(decode=True).decode('raw-unicode-escape') 'ünicöde data..'
msg191532 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-06-20 18:59
>>> message.get_payload(decode=True).decode('latin1') 'ünicöde data..'
msg191534 - (view)	Author: Marko Lalic (mlalic)	Date: 2013-06-20 19:15
That will work fine as long as the characters are actually latin. We cannot forget the rest of the unicode character planes. Consider:: >>> message = message_from_string("""MIME-Version: 1.0 ... Content-Type: text/plain; charset=utf-8 ... Content-Disposition: inline ... Content-Transfer-Encoding: 8bit ... ... 한글ᥡ╥ສए""") >>> message.get_payload(decode=True).decode('latin1') '\\ud55c\\uae00\\u1961\\u2565\\u0eaa\\u090f' >>> message.get_payload(decode=True).decode('raw-unicode-escape') '한글ᥡ╥ສए' However, even if latin1 did work, the main point is that a different encoding than the one the message specifies must be used in order to decode the bytes to a unicode string.
msg191536 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-06-20 19:32
The python3 email package's handling of 8bit definitely has quirks. (So did the python2 email package's, but they were different quirks. :) You can't correctly handle 8bit unless you use message_from_bytes and take the input from a byte string. It is a good question what should be done with a unicode string that claims its payload is 8bit...since that situation can't arise on the wire (or in a disk file), perhaps it should produce an exception ("message must be parsed as binary data"?) The problem with that idea is that the email parser promises to never raise errors, but always produce some sort of model from the input, possibly with defects attached. All that aside, here is what you want to be doing: >>> from email import message_from_bytes >>> message = message_from_bytes(b"""MIME-Version: 1.0 ... Content-Type: text/plain; charset=utf-8 ... Content-Disposition: inline ... Content-Transfer-Encoding: 8bit ... ... \xc3\xbcnic\xc3\xb6de data..""") >>> message.get_content_charset() 'utf-8' >>> message.get_payload(decode=True) b'\xc3\xbcnic\xc3\xb6de data..' >>> message.get_payload(decode=True).decode('utf-8') 'ünicöde data..' >>> message.get_payload() 'ünicöde data..' You will note that get_payload without the decode automatically does the charset decode. I know this is counter-intuitive, but we are dealing with a legacy API that I had to retrofit. Think of decode=True as "produce binary from the wire content transfer encoding", and decode=False as "produce the string representation of the payload". For ASCII content-transfer-encodings, this is more intuitive (the raw quoted printable, for example), but for 8bit we can only produce a python string if we do the unicode decode...so that's what we do. You will also note that the payload in this case really is utf-8, whereas in your example it was unicode...and what the python3 email package does with a unicode payload is not well defined and is definitely buggy. I'm going to close this issue, because dealing with the vagaries of 8bit with string input is on my master list of things to tackle this summer, and will be dealt with in the context of other changes.
msg191540 - (view)	Author: Marko Lalic (mlalic)	Date: 2013-06-20 21:25
Thank you for your reply. Unfortunately, I have a use case where message_from_bytes has a pretty great disadvantage. I have to parse the received message and then forward it completely unchanged, apart from possibly adding a few new headers. The problem with message_from_bytes is that it changes the Content-Transfer-Encoding header to base64 (and consequently base64 encodes the content). Do you possibly have a suggestion how to currently go about solving this problem? A possible solution I can spot from your answer is to check the Content-Transfer-Encoding before getting the payload and use the version without decode=True when it is 8bit. Maybe there is something more elegant? Thank you in advance.
msg191542 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-06-20 22:37
If all you are changing is headers (and you con't change the CTE), then when you use BytesGenerator to re-serialize the message, it is supposed to preserve the existing CTE/payload. (Whether or not you call get_payload, regardless of arguments, does not matter; get_payload does not modify the Message object...though set_payload does, of course). If you have a case where the payload is being re-encoded even though you have not changed the content-type or content-transfer-encoding headers or the payload, then that is a bug. Of course, if you use just Generator (which is what str uses), the output message must be in ASCII, so in that case it does indeed transcode 8bit payloads to base64.

History
Date	User	Action	Args
2022-04-11 14:57:47	admin	set	github: 62471
2013-06-20 22:37:27	r.david.murray	set	messages: + msg191542
2013-06-20 21:25:10	mlalic	set	messages: + msg191540
2013-06-20 19:32:40	r.david.murray	set	status: open -> closed versions: - Python 3.3 messages: + msg191536 resolution: not a bug stage: resolved
2013-06-20 19:15:23	mlalic	set	messages: + msg191534
2013-06-20 18:59:19	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg191532 versions: + Python 3.4
2013-06-20 15:39:02	mlalic	create