Issue 25545: email parsing docs: clarify that only ASCII strings are supported

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69731

classification

Title:	email parsing docs: clarify that only ASCII strings are supported
Type:	behavior	Stage:	needs patch
Components:	Library (Lib)	Versions:	Python 3.6, Python 3.4, Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, immerrr again, jaraco, jayvdb, r.david.murray, tanzer@swing.co.at
Priority:	normal	Keywords:

Created on 2015-11-03 14:43 by tanzer@swing.co.at, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
email_get_payload__test.py	tanzer@swing.co.at, 2015-11-03 14:43
parse-text.py	jaraco, 2018-12-05 20:36

Messages (19)
msg253994 - (view)	Author: Christian Tanzer (tanzer@swing.co.at)	Date: 2015-11-03 14:43
For an email message with `Content-type: text/plain; charset=utf-8`, in Python 3.5, get_payload returns a bytes object encoded with `latin-1`. Python 2.7 returns a str object encoded with `utf-8` as expected. Running the attached test script `email_get_payload__test.py` with Python 2.7 and 3.5 shows the difference. Python 2.7:: 2.7.10.final.0 * utf8 * From: Christian Tanzer <tanzer@swing.co.at> To: Christian Tanzer <tanzer@swing.co.at> Content-type: text/plain; charset=utf-8 Sehr geehrte Damen und Herren, ... Danke und mit freundlichen Grüssen, -- Christian Tanzer http://www.c-tanzer.at/ Python 3.5:: 3.5.0.final.0 * latin-1 * From: Christian Tanzer <tanzer@swing.co.at> To: Christian Tanzer <tanzer@swing.co.at> Content-type: text/plain; charset=utf-8 Sehr geehrte Damen und Herren, ... Danke und mit freundlichen Grüssen, -- Christian Tanzer http://www.c-tanzer.at/ In both Python versions, `msg.get_content_charset()` returns None, which is not correct, either.
msg254014 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-11-03 19:59
Your problem is that your input email is ia unicode string. A unicode string has no RFC defintion as an email, so things do not work right, as you observed. Whether or not email should throw an error when fed a non-ascii unicode string is an interesting question, but it hasn't in the past and so for backward compatibility reasons we won't change that. If you add an "encode('utf-8')" to the end of your email string, and then use message_from_bytes, you will get the correct result. You might also be interested in the newer email API, currently documented in the 'contentmanager' and 'policy' chapters of the documentation. It says it is provisional, but the changes (other than bug fixes) between the current API and what will be final in 3.6 are trivial. get_content_charset is None because you don't have any actual headers in your message, just body. This is because of the leading newline in your triple quoted string, which the email package takes as the end of the headers.
msg254041 - (view)	Author: Christian Tanzer (tanzer@swing.co.at)	Date: 2015-11-04 09:04
R. David Murray wrote at Tue, 03 Nov 2015 19:59:53 +0000: > Your problem is that your input email is ia unicode string. A unicode > string has no RFC defintion as an email, so things do not work right, > as you observed. Whether or not email should throw an error when fed > a non-ascii unicode string is an interesting question, but it hasn't > in the past and so for backward compatibility reasons we won't change > that. Excuse me, I am using `email.message_from_string` which is documented to convert a unicode string to an email object. If you are serious `message_from_string` should not even exist! As long as it is there and documented as:: email.message_from_string(s, _class=email.message.Message, *, policy=policy.compat32) Return a message object structure from a string. This is exactly equivalent to Parser().parsestr(s). _class and policy are interpreted as with the Parser class constructor. Changed in version 3.3: Removed the strict argument. Added the policy keyword. your argument is unfounded and this is definitely a serious bug! > You might also be interested in the newer email API, currently > documented in the 'contentmanager' and 'policy' chapters of the > documentation. It says it is provisional, but the changes (other than > bug fixes) between the current API and what will be final in 3.6 are > trivial. I'm using Python 2.7 and only just exploring 3.5. Unfortunately, there are many bugs and your response is a typical example why moving from 2.7 to 3.x is hard. There is gratuitous breakage but the reaction is:: resolution: -> not a bug I would ask you to reconsider that stance. As long as my code needs to support 2.7, use of any new API doesn't fly. After an eventual switch to 3.5 (probably years in the future), I might use new APIs for new code but changing existing code that used to work won't be in the cards > get_content_charset is None because you don't have any actual headers > in your message, just body. This is because of the leading newline in > your triple quoted string, which the email package takes as the end of > the headers. Thanks for the hint. BTW, removing the leading newline doesn't change the buggy behavior of `message_from_string`!
msg254058 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-11-04 15:36
There is no problem with supporting both 2.7 and python3 with the same email API as long as your input strings are ASCII only, which is what is required by the email RFCs (as I said, they do not support unicode...even the new one only supports utf8 (a unicode encoding) not unicode itself). So if your input is RFC compliant (using content transfer encoding to encode non-ASCII characters), things will work fine. Just think of unicode as a 7-bit transmission channel (which is what it is from email's perspective). Otherwise the bytes/string issues are no different than they are for any other shared-code-base application. I have an extensive doc rewrite in process, but I'm not sure when it will land. I thought I had already added the note about ASCII-only to the parser docs, but I see that I did not. I'll reopen this issue to remind myself to do that, since the doc rewrite will only apply to 3.6 (when the new API will no longer be provisional).
msg254066 - (view)	Author: Christian Tanzer (tanzer@swing.co.at)	Date: 2015-11-04 17:41
R. David Murray wrote at Wed, 04 Nov 2015 15:36:27 +0000: > There is no problem with supporting both 2.7 and python3 with the same > email API as long as your input strings are ASCII only, which is what > is required by the email RFCs (as I said, they do not support > unicode...even the new one only supports utf8 (a unicode encoding) not > unicode itself). You are talking about byte strings. And of course the email RFCs only talk about byte strings. But the email package offers the use of unicode strings for various functions, including `email.message_from_string`, `email.Message.as_string`, and `email.Message.__str__`. These functions could be useful (and were useful in Python 2) but aren't in Python 3. Assume I load an email satisfying all relevant RFCs from a file. Say that email contains three MIMEText parts with content-transfer-encoding "8bit", all with different encodings: * I don't see any use for `as_string` to obfuscate that by re-encoding each of the three to content-transfer-encoding "base64", which is completely unreadable when it could be converted painlessly to a real unicode string. One of my usage scenarios is something of the form:: >>> print(msg) Of course, in this case I'll better use `utf-8` as my output encoding otherwise the print might fail. If I wanted to output a RFC-compliant byte string, I should have used `as_bytes`, not `as_string`. But that would be a different usage scenario. * The same argument applies in reverse to `message_from_string`. If one wants RFC compliance one should use `message_from_bytes`. But if one builds up a unicode string for an email in Python, it should be possible to convert that to a `email.Message` instance via `message_from_string`. I have several use cases where I want to convert an `email.Message` to a unicode string without any embedded content-transfer-encodings like "base64", do some transformations on that string and then convert that back into an `email.Message` instance. > I have an extensive doc rewrite in process, but I'm not sure when it > will land. I thought I had already added the note about ASCII-only to > the parser docs, but I see that I did not. I'll reopen this issue to > remind myself to do that, since the doc rewrite will only apply to 3.6 > (when the new API will no longer be provisional). I don't see any point in the semantics of the string-functions as they are currently implemented, after all one can do things like easily `message_from_string(...).decode("latin-1")` or `msg.as_bytes().encode("latin-1")` if one really wants to convert an RFC-compatible byte-string to/from unicode strings as-is. But this as-is conversion normally isn't very useful because it isn't * human-readable * well suited to search and replace operations or any other text transformations So documenting the current situation would improve the situation slightly but it's more like putting lipstick on a pig.
msg254067 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-11-04 18:28
Yes, the port from python2 to python3 of the email package was...suboptimal. (I wasn't a contributor when that happened, and the person who did it simply did not have time to do the needed rewrite...he had to settle for just making it more-or-less work.) The whole concept of using unicode as a 7bit data channel only is just...weird. But, we are now stuck with maintaining that API for backward compatibility reasons. To fix it, I rewrote significant parts of the email package, which is the new API. And even with that the internals are more than a bit hackish and I'd love to make further changes. I probably won't have time, though, since what we have now works and I'm not (currently) getting paid to work on it. It also is...fraught with the danger of bugs...to talk about serializing an email message as a string, transforming it, and then trying to re-parse it as an email message. If your transformations are simple, it will probably work, but anything at all complex runs the risk of breaking the message. And having non-ascii bodies counts as non-trivial. The whole point of the Message model is to allow you to transform an email message and be able to produce an RFC valid serialization as the output after you are done. You do have to conditionalize your 2/3 code to use the bytes parser and generator if you are dealing with 8-bit messages. There's just no way around that.
msg254095 - (view)	Author: Christian Tanzer (tanzer@swing.co.at)	Date: 2015-11-05 09:58
> Yes, the port from python2 to python3 of the email package > was...suboptimal. > ... > The whole concept of using unicode as a 7bit data channel only is > just...weird. +100 to both. > But, we are now stuck with maintaining that API for backward > compatibility reasons. That's a weird definition of backward compatibility, though. The API breaks backward compatibility to Python 2. Any Python 3 user shouldn't use the broken API anyway, IMHO. > To fix it, I rewrote significant parts of the email package, which > is the new API. Which unfortunately isn't any help if one needs to stay compatible to 2.7. > It also is...fraught with the danger of bugs...to talk about > serializing an email message as a string, transforming it, and then > trying to re-parse it as an email message. If your transformations > are simple, it will probably work, but anything at all complex runs > the risk of breaking the message. One of Python's mottos used to be: We are all consenting adults here. But there are other uses for converting a message instance to a unicode string. Display, printing, and grepping come to mind. > And having non-ascii bodies counts as non-trivial. For anybody living in a non-ascii country that statement sounds very strange. To start with, I have many friends with names that contain non-ascii characters. > You do have to conditionalize your 2/3 code to use the bytes parser > and generator if you are dealing with 8-bit messages. There's just no > way around that. I did that yesterday. There are problems with that though: * Recognizing the problem for what it is. Trying to run Python 2.7 code that should run under 3.5 but breaks with weird errors wastes a lot of time. Multiply with the number of Python programmers that want to migrate and you get a problem. If `message_as_string` and `as_string` just weren't there in 3.x it would be much less of a problem (clear documentation would also help but not as much). * Lots of ugly workarounds for the same problem. Most of them (mine certainly included) are done quick and ad-hoc and probably break in many ways. The question then arises: why should one use the email package at all. But of course that way lies madness. Just more roadblocks for the move to Python 3.
msg254131 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-11-05 17:35
I agree that the situation is not the best, but it is the one we have. I can't delete those methods now, they've existed in Python3 for too long, and initially were the only thing that worked (albeit only with ASCII only strings). If you can suggest ways of improving the string support without breaking existing python3 code that may be using it (most likely wrongly, but working for them), then I will happily review them. As for "that sounds strange" about non-ascii bodies being non-trivial, remember that the context is the byte-string serialization protocol defined in RFC 5322. This is the evolution of a protocol that started out ascii only, learned something about 8-bit data, then learned something about using bytes for handling other languages. It is an evolutionary mess that has lots of pitfalls. You can't simply serialize a message to unicode, preserving the RFC 5322/MIME markup, and have a valid email, unless you make it a 7-bit clean (ascii only) representation. And that is what the email package does. So, conversely, email can only parse (as a string) a 7-bit, ASCII only, representation. To do what you appear to want, to be able to represent non-ascii as the equivalent unicode cannot work, because email messages may contain binary data which cannot be represented in printable unicode. So, it is unfortunate that a non-ascii body is non-trivial in email, but there's no getting around the fact that it is. The new API in python3 aims to make it as simple as possible, but of course that doesn't help python2 users. But, making unicode easier is one big reason python3 exists (the biggest one, in practice).
msg254179 - (view)	Author: Christian Tanzer (tanzer@swing.co.at)	Date: 2015-11-06 09:59
> If you can suggest ways of improving the string support without > breaking existing python3 code that may be using it (most likely > wrongly, but working for them), then I will happily review them. At the moment, I'm mainly interested in having code that runs correctly in both python2.7 and python3. Having the same method behave totally differently in the two versions is what triggered this bug report. Adding new methods won't help with 2.7. > To do what you appear to want, to be able to represent non-ascii as > the equivalent unicode cannot work, because email messages may > contain binary data which cannot be represented in printable > unicode. I have no problem whatsoever if, and would actually expect that, binary message parts are encoded as necessary for RFS compliance. My beef is with message parts that are text and are naturally represented as unicode not as charset- and transfer-encoded 7-bit strings! I also don't see how such a representation would break existing python3 code but that might just be another example of famous last words. > But, making unicode easier is one big reason python3 exists (the > biggest one, in practice). >From what I have seen up to now, that has failed (spectacularly, in my opinion, if you consider things like unpickling python2-created pickles with binary strings, e.g., datetime instances). Using unicode in python2 worked well enough although there was the problem that one couldn't specify which strings were supposed to be binary. Exactly those strings are a big problem for code that wants to run in both python2 and python3. python3 solves the problem of binary strings, though badly because of the various missing string functions. But there seem to be bugs all over the standard library and in third party modules. That library APIs still haven't settled down yet in python3 is even worse! Maybe python3 would work well if one threw away all existing code and started with completely new code but I don't think that was the intention.
msg254189 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-11-06 13:20
Python3 is easier to do unicode in for programs that start with a clear bytes/string split. Yes, porting from python2 has bumps arising from the places where bytes and string are blurred. Yes if we could redo python3 knowing what we know now we could improve matters. But IMO we did a pretty good job given that we didn't know what we know now. This is not the forum to discuss such matters further :)
msg254262 - (view)	Author: Christian Tanzer (tanzer@swing.co.at)	Date: 2015-11-07 08:24
Terry J. Reedy wrote at Fri, 06 Nov 2015 22:49:57 +0000: > email parsing docs: clarify that only ASCII strings are supported If that is the decision, `message_from_string` should raise an exception if it gets a non-ASCII argument!
msg254315 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-11-08 00:56
Except that that might break code that is currently working, so I can't do that, even though I'd like to.
msg254322 - (view)	Author: John Mark Vandenberg (jayvdb) *	Date: 2015-11-08 05:20
Could it issue a UnicodeWarning?
msg254325 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-11-08 05:55
Issuing a warning is an interesting idea. Basically, deprecate using a non-ASCII string with message_from_string etc formally by issuing a deprecation warning as well as the doc note.
msg331159 - (view)	Author: Jason R. Coombs (jaraco) *	Date: 2018-12-05 20:36
I don't think this ticket should be implemented as described. Consider the use-case in importlib_metadata, which loads metadata from a package, metadata known to be of a specified encoding. It already knows the encoding and has decoded the full message to text and now wants to parse it. It seems very much in the remit of something like email.parser to parse already-decoded content. Yes, the RFCs describe how to decode bytes content, but that shouldn't preclude the e-mail module from supporting parsing from Unicode text. And in fact, it does seem that the library is able to parse non-ascii Unicode text, especially on Python 3. Consider 'parse-text.py', attached. It illustrates that the parser currently mostly meets my expectation - on Python 2.7 and 3.7, e-mail messages are parsed from unicode text without any indication of an encoding, and returning unicode text on both Python 2 and Python 3. Python 2 is deficient in that message_from_string will get a UnicodeEncodeError constructing a bytes-oriented StringIO from the input, which is easily worked-around by using the text-oriented io.StringIO. Still, I would argue the current behavior is desirable and shouldn't be deprecated.
msg331183 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2018-12-05 22:21
The problem comes from thinking you can parse an arbitrary email message if it is in unicode form. YOU CANNOT DO THAT in the general case (ie: non-ascii attachments). That said, the new email package API is designed to facilitate "off label" uses. I would have no problem with the definition of a policy object[] that was basically "use this to parse messages in unicode form as long as they don't use MIME". As soon as you start parsing MIME headers, the input had better be binary or pure ascii, or the headers won't make sense. You break the MIME API contract if you use MIME with a non-ascii unicode string. [] that policy might be a clone of one of the existing policies and not actually do anything to prevent the input having mime headers...ideally it would, but I just don't want to say it is OK to use the standard email policies to do this and expect it to continue to work in the future. It probably will, but we should not document it that way! :)
msg340841 - (view)	Author: immerrr again (immerrr again)	Date: 2019-04-25 14:01
Hi everyone, It's the first time I'm using this bugtracker, so apologies in advance if I manage to break something from the first go. Not sure if it's the right place to report this, but I have the following repro that involves email.message_from_bytes: In [128]: import email ...: msg_bytes = ( ...: b'MIME-Version: 1.0\r\n' ...: b'Content-Type: text/plain;\r\n' ...: b' charset=utf-8\r\n' ...: b'Content-Transfer-Encoding: 8bit\r\n' ...: b'Content-Disposition: attachment;\r\n' ...: b' filename="camper_store.csv"\r\n\r\n' ...: ) + 'Beyoğlu-İst'.encode('utf8') ...: email.message_from_bytes(msg_bytes).get_payload(decode=True) Out[128]: b'Beyo\xc4\x9flu-\xc4\xb0st' I have read this and some previous bug reports where it was clearly explained that message_from_string has its limitations and message_from_bytes should be used for better results. And if I'm not mistaken my repro should have it all set up correctly: CTE=8bit, body encoded in utf8 which is explicitly indicated as the content charset, yet the result is still encoded with 'raw-unicode-escape'. Is there something wrong with the input or is it a bug? Thanks!
msg340931 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-04-26 16:22
This is one of the infelicities of the translation of the old API to python3: 'get_payload(decode=True)' actually means 'give me the bytes version of this payload", which in this case is the utf-8, which is what you got. get_payload() means "give me the payload as a string without doing CTE decoding". In a sort of accident-of-translation this turns out to mean "give me the unicode" in this particular case. If the payload had been base64 encoded, you'd have gotten a unicode string containing the base64 characters. Which I grant you is all very confusing. For a more consistent API, use the new one: >>> import email.policy >>> m = email.message_from_bytes(msg_bytes, policy=email.policy.default) >>> bytes(m) b'MIME-Version: 1.0\nContent-Type: text/plain;\n charset=utf-8\nContent-Transfer-Encoding: 8bit\nContent-Disposition: attachment;\n filename="camper_store.csv"\n\nBeyo\xc4\x9flu-\xc4\xb0st' >>> m.get_content() 'Beyoğlu-İst' Here we don't even pretend that you have any use for the encoded version, either CTE encoding or binary encoding: get_content gives you the "fully decoded" payload (decoded from CTE and decoded to unicode).
msg340934 - (view)	Author: immerrr again (immerrr again)	Date: 2019-04-26 16:47
Oh, wow, confusing indeed, but in historical context it makes slightly more sense. Thank you for the explanation!

History
Date	User	Action	Args
2022-04-11 14:58:23	admin	set	github: 69731
2019-04-26 16:47:53	immerrr again	set	messages: + msg340934
2019-04-26 16:22:30	r.david.murray	set	messages: + msg340931
2019-04-25 14:01:12	immerrr again	set	nosy: + immerrr again messages: + msg340841
2018-12-05 22:21:01	r.david.murray	set	messages: + msg331183
2018-12-05 20:36:43	jaraco	set	files: + parse-text.py nosy: + barry, jaraco messages: + msg331159
2015-11-08 05:55:37	r.david.murray	set	messages: + msg254325
2015-11-08 05:20:31	jayvdb	set	nosy: + jayvdb messages: + msg254322
2015-11-08 00:56:08	r.david.murray	set	messages: + msg254315
2015-11-07 08:24:40	tanzer@swing.co.at	set	messages: + msg254262
2015-11-06 22:49:57	terry.reedy	set	title: email parsing docs need to be clear that only ASCII strings are supported -> email parsing docs: clarify that only ASCII strings are supported
2015-11-06 13:20:57	r.david.murray	set	messages: + msg254189
2015-11-06 09:59:59	tanzer@swing.co.at	set	messages: + msg254179
2015-11-05 17:35:37	r.david.murray	set	messages: + msg254131
2015-11-05 09:58:19	tanzer@swing.co.at	set	messages: + msg254095
2015-11-04 18:29:00	r.david.murray	set	messages: + msg254067
2015-11-04 17:41:28	tanzer@swing.co.at	set	messages: + msg254066
2015-11-04 15:36:27	r.david.murray	set	status: closed -> open versions: + Python 3.4, Python 3.6 title: email.message.get_payload returns wrong encoding -> email parsing docs need to be clear that only ASCII strings are supported messages: + msg254058 resolution: not a bug -> stage: resolved -> needs patch
2015-11-04 09:04:29	tanzer@swing.co.at	set	messages: + msg254041
2015-11-03 19:59:53	r.david.murray	set	status: open -> closed nosy: + r.david.murray messages: + msg254014 resolution: not a bug stage: resolved
2015-11-03 14:43:30	tanzer@swing.co.at	create