classification
Title: email parsing docs: clarify that only ASCII strings are supported
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.6, Python 3.5, Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, jason.coombs, jayvdb, r.david.murray, tanzer@swing.co.at
Priority: normal Keywords:

Created on 2015-11-03 14:43 by tanzer@swing.co.at, last changed 2018-12-05 22:21 by r.david.murray.

Files
File name Uploaded Description Edit
email_get_payload__test.py tanzer@swing.co.at, 2015-11-03 14:43
parse-text.py jason.coombs, 2018-12-05 20:36
Messages (16)
msg253994 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2015-11-03 14:43
For an email message with `Content-type: text/plain; charset=utf-8`, in Python 3.5, get_payload returns a bytes object encoded with `latin-1`. Python 2.7 returns a str object encoded with `utf-8` as expected.

Running the attached test script `email_get_payload__test.py`  with Python 2.7 and 3.5 shows the difference.

Python 2.7::

    2.7.10.final.0 *** utf8 ***
    From: Christian Tanzer <tanzer@swing.co.at>
    To: Christian Tanzer <tanzer@swing.co.at>
    Content-type: text/plain; charset=utf-8


    Sehr geehrte Damen und Herren,

    ...

    Danke und mit freundlichen Grüssen,

    --
    Christian Tanzer                                    http://www.c-tanzer.at/

Python 3.5::

    3.5.0.final.0 *** latin-1 ***
    From: Christian Tanzer <tanzer@swing.co.at>
    To: Christian Tanzer <tanzer@swing.co.at>
    Content-type: text/plain; charset=utf-8


    Sehr geehrte Damen und Herren,

    ...

    Danke und mit freundlichen Grüssen,

    --
    Christian Tanzer                                    http://www.c-tanzer.at/

In both Python versions, `msg.get_content_charset()` returns None, which is not correct, either.
msg254014 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-03 19:59
Your problem is that your input email is ia unicode string.  A unicode string has no RFC defintion as an email, so things do not work right, as you observed.  Whether or not email should throw an error when fed a non-ascii unicode string is an interesting question, but it hasn't in the past and so for backward compatibility reasons we won't change that.

If you add an "encode('utf-8')" to the end of your email string, and then use message_from_bytes, you will get the correct result.  You might also be interested in the newer email API, currently documented in the 'contentmanager' and 'policy' chapters of the documentation.  It says it is provisional, but the changes (other than bug fixes) between the current API and what will be final in 3.6 are trivial.

get_content_charset is None because you don't have any actual headers in your message, just body.  This is because of the leading newline in your triple quoted string, which the email package takes as the end of the headers.
msg254041 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2015-11-04 09:04
R. David Murray wrote at Tue, 03 Nov 2015 19:59:53 +0000:

>  Your problem is that your input email is ia unicode string.  A unicode
>  string has no RFC defintion as an email, so things do not work right,
>  as you observed.  Whether or not email should throw an error when fed
>  a non-ascii unicode string is an interesting question, but it hasn't
>  in the past and so for backward compatibility reasons we won't change
>  that.

Excuse me, I am using `email.message_from_string` which is documented
to convert a unicode string to an email object. If you are serious
`message_from_string` should not even exist! As long as it is there
and documented as::

  email.message_from_string(s, _class=email.message.Message, *, policy=policy.compat32)

    Return a message object structure from a string. This is exactly
    equivalent to Parser().parsestr(s). _class and policy are
    interpreted as with the Parser class constructor.

    Changed in version 3.3: Removed the strict argument. Added the
    policy keyword.

your argument is unfounded and this is definitely a serious bug!

> You might also be interested in the newer email API, currently
> documented in the 'contentmanager' and 'policy' chapters of the
> documentation.  It says it is provisional, but the changes (other than
> bug fixes) between the current API and what will be final in 3.6 are
> trivial.

I'm using Python 2.7 and only just exploring 3.5.

Unfortunately, there are many bugs and your response is a typical
example why moving from 2.7 to 3.x is hard.

There is gratuitous breakage but the reaction is::

    resolution:  -> not a bug

I would ask you to reconsider that stance.

As long as my code needs to support 2.7, use of any new API doesn't
fly. After an eventual switch to 3.5 (probably years in the future), I
might use new APIs for new code but changing existing code that used
to work won't be in the cards

> get_content_charset is None because you don't have any actual headers
> in your message, just body.  This is because of the leading newline in
> your triple quoted string, which the email package takes as the end of
> the headers.

Thanks for the hint. BTW, removing the leading newline doesn't change
the buggy behavior of `message_from_string`!
msg254058 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-04 15:36
There is no problem with supporting both 2.7 and python3 with the same email API as long as your input strings are ASCII only, which is what is required by the email RFCs (as I said, they do not support unicode...even the new one only supports utf8 (a unicode encoding) not unicode itself).  So if your input is RFC compliant (using content transfer encoding to encode non-ASCII characters), things will work fine.  Just think of unicode as a 7-bit transmission channel (which is what it is from email's perspective).  Otherwise the bytes/string issues are no different than they are for any other shared-code-base application.

I have an extensive doc rewrite in process, but I'm not sure when it will land.  I thought I had already added the note about ASCII-only to the parser docs, but I see that I did not.  I'll reopen this issue to remind myself to do that, since the doc rewrite will only apply to 3.6 (when the new API will no longer be provisional).
msg254066 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2015-11-04 17:41
R. David Murray wrote at Wed, 04 Nov 2015 15:36:27 +0000:

> There is no problem with supporting both 2.7 and python3 with the same
> email API as long as your input strings are ASCII only, which is what
> is required by the email RFCs (as I said, they do not support
> unicode...even the new one only supports utf8 (a unicode encoding) not
> unicode itself).

You are talking about byte strings. And of course the email RFCs only
talk about byte strings.

But the email package offers the use of unicode strings for various
functions, including `email.message_from_string`,
`email.Message.as_string`, and `email.Message.__str__`. These
functions could be useful (and were useful in Python 2) but aren't in
Python 3.

Assume I load an email satisfying all relevant RFCs from a file. Say
that email contains three MIMEText parts with
content-transfer-encoding "8bit", all with different
encodings:

* I don't see any use for `as_string` to obfuscate that by
  re-encoding each of the three to content-transfer-encoding "base64",
  which is completely unreadable when it could be converted painlessly
  to a real unicode string.

  One of my usage scenarios is something of the form::

    >>> print(msg)

  Of course, in this case I'll better use `utf-8` as my output
  encoding otherwise the print might fail.

  If I wanted to output a RFC-compliant byte string, I should have
  used `as_bytes`, not `as_string`. But that would be a different
  usage scenario.

* The same argument applies in reverse to `message_from_string`. If
  one wants RFC compliance one should use `message_from_bytes`.

  But if one builds up a unicode string for an email in Python, it
  should be possible to convert that to a `email.Message` instance via
  `message_from_string`.

I have several use cases where I want to convert an `email.Message`
to a unicode string without any embedded content-transfer-encodings
like "base64", do some transformations on that string and then
convert that back into an `email.Message` instance.

> I have an extensive doc rewrite in process, but I'm not sure when it
> will land.  I thought I had already added the note about ASCII-only to
> the parser docs, but I see that I did not.  I'll reopen this issue to
> remind myself to do that, since the doc rewrite will only apply to 3.6
> (when the new API will no longer be provisional).

I don't see any point in the semantics of the string-functions as they
are currently implemented, after all one can do things like easily
`message_from_string(...).decode("latin-1")` or
`msg.as_bytes().encode("latin-1")` if one really wants to convert an
RFC-compatible byte-string to/from unicode strings as-is. But this
as-is conversion normally isn't very useful because it isn't

* human-readable

* well suited to search and replace operations or any other text
  transformations

So documenting the current situation would improve the situation slightly
but it's more like putting lipstick on a pig.
msg254067 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-04 18:28
Yes, the port from python2 to python3 of the email package was...suboptimal. (I wasn't a contributor when that happened, and the person who did it simply did not have time to do the needed rewrite...he had to settle for just making it more-or-less work.)  The whole concept of using unicode as a 7bit data channel only is just...weird.  But, we are now stuck with maintaining that API for backward compatibility reasons.  To fix it, I rewrote significant parts of the email package, which is the new API.  And even with that the internals are more than a bit hackish and I'd love to make further changes.  I probably won't have time, though, since what we have now works and I'm not (currently) getting paid to work on it.

It also is...fraught with the danger of bugs...to talk about serializing an email message as a string, transforming it, and then trying to re-parse it as an email message.  If your transformations are simple, it will probably work, but anything at all complex runs the risk of breaking the message.  And having non-ascii bodies counts as non-trivial.  The whole point of the Message model is to allow you to transform an email message and be able to produce an RFC valid serialization as the output after you are done.

You do have to conditionalize your 2/3 code to use the bytes parser and generator if you are dealing with 8-bit messages. There's just no way around that.
msg254095 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2015-11-05 09:58
> Yes, the port from python2 to python3 of the email package
> was...suboptimal.
> ...
> The whole concept of using unicode as a 7bit data channel only is
> just...weird.

+100 to both.

> But, we are now stuck with maintaining that API for backward
> compatibility reasons.

That's a weird definition of backward compatibility, though. The API
breaks backward compatibility to Python 2. Any Python 3 user shouldn't
use the broken API anyway, IMHO.

> To fix it, I rewrote significant parts of the email package, which
> is the new API.

Which unfortunately isn't any help if one needs to stay compatible to
2.7.

> It also is...fraught with the danger of bugs...to talk about
> serializing an email message as a string, transforming it, and then
> trying to re-parse it as an email message.  If your transformations
> are simple, it will probably work, but anything at all complex runs
> the risk of breaking the message.

One of Python's mottos used to be:

   We are all consenting adults here.

But there are other uses for converting a message instance to a
unicode string. Display, printing, and grepping come to mind.

> And having non-ascii bodies counts as non-trivial.

For anybody living in a non-ascii country that statement sounds
**very strange**.

To start with, I have many friends with names that contain non-ascii
characters.

> You do have to conditionalize your 2/3 code to use the bytes parser
> and generator if you are dealing with 8-bit messages. There's just no
> way around that.

I did that yesterday. There are problems with that though:

* Recognizing the problem for what it is.

  Trying to run Python 2.7 code that *should* run under 3.5 but breaks
  with weird errors wastes a lot of time.

  Multiply with the number of Python programmers that want to migrate
  and you get a problem.

  If `message_as_string` and `as_string` just weren't there in 3.x it
  would be much less of a problem (clear documentation would also help
  but not as much).

* Lots of ugly workarounds for the same problem.

  Most of them (mine certainly included) are done quick and ad-hoc and
  probably break in many ways.

  The question then arises: why should one use the email package at
  all. But of course that way lies madness.

Just more roadblocks for the move to Python 3.
msg254131 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-05 17:35
I agree that the situation is not the best, but it is the one we have.  I can't delete those methods now, they've existed in Python3 for too long, and initially were the only thing that worked (albeit only with ASCII only strings).  

If you can suggest ways of improving the string support without breaking existing python3 code that may be using it (most likely wrongly, but working for them), then I will happily review them.

As for "that sounds strange" about non-ascii bodies being non-trivial, remember that the context is the byte-string serialization protocol defined in RFC 5322.  This is the *evolution* of a protocol that started out ascii only, learned something about 8-bit data, then learned something about using bytes for handling other languages.  It is an evolutionary mess that has lots of pitfalls.  You can't simply serialize a message to unicode, preserving the RFC 5322/MIME markup, and have a valid email, unless you make it a 7-bit clean (ascii only) representation.  And that is what the email package does.  So, conversely, email can only *parse* (as a string) a 7-bit, ASCII only, representation.

To do what you appear to want, to be able to represent non-ascii as the equivalent unicode *cannot work*, because email messages may contain binary data which *cannot* be represented in printable unicode.

So, it is *unfortunate* that a non-ascii body is non-trivial in email, but there's no getting around the fact that it is.  The new API in python3 aims to make it as simple as possible, but of course that doesn't help python2 users.  But, making unicode easier is one big reason python3 exists (the biggest one, in practice).
msg254179 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2015-11-06 09:59
> If you can suggest ways of improving the string support without
> breaking existing python3 code that may be using it (most likely
> wrongly, but working for them), then I will happily review them.

At the moment, I'm mainly interested in having code that runs
correctly in both python2.7 and python3.

Having the same method behave totally differently in the two versions
is what triggered this bug report.

Adding new methods won't help with 2.7.

> To do what you appear to want, to be able to represent non-ascii as
> the equivalent unicode *cannot work*, because email messages may
> contain binary data which *cannot* be represented in printable
> unicode.

I have no problem whatsoever if, and would actually expect that,
binary message parts are encoded as necessary for RFS compliance. My
beef is with message parts that are text and are naturally represented
as unicode not as charset- and transfer-encoded 7-bit strings!

I also don't see how such a representation would break existing
python3 code but that might just be another example of famous last
words.

> But, making unicode easier is one big reason python3 exists (the
> biggest one, in practice).

>From what I have seen up to now, that has failed (spectacularly, in my
opinion, if you consider things like unpickling python2-created
pickles with binary strings, e.g., datetime instances).

Using unicode in python2 worked well enough although there was the
problem that one couldn't specify which strings were supposed to be
binary. Exactly those strings are a big problem for code that wants to
run in both python2 and python3.

python3 solves the problem of binary strings, though badly because
of the various missing string functions. But there seem to be bugs all
over the standard library and in third party modules.

That library APIs still haven't settled down yet in python3 is even
worse!

Maybe python3 would work well if one threw away all existing code and
started with completely new code but I don't think that was the
intention.
msg254189 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-06 13:20
Python3 is easier to do unicode in for programs that start with a clear bytes/string split.  Yes, porting from python2 has bumps arising from the places where bytes and string are blurred.  Yes if we could redo python3 knowing what we know now we could improve matters.  But IMO we did a pretty good job given that we *didn't* know what we know now.  

This is not the forum to discuss such matters further :)
msg254262 - (view) Author: Christian Tanzer (tanzer@swing.co.at) Date: 2015-11-07 08:24
Terry J. Reedy wrote at Fri, 06 Nov 2015 22:49:57 +0000:

> email parsing docs: clarify that only ASCII strings are supported

If that is the decision, `message_from_string` should raise an
exception if it gets a non-ASCII argument!
msg254315 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-08 00:56
Except that that might break code that is currently working, so I can't do that, even though I'd like to.
msg254322 - (view) Author: John Mark Vandenberg (jayvdb) * Date: 2015-11-08 05:20
Could it issue a UnicodeWarning?
msg254325 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-08 05:55
Issuing a warning is an interesting idea.  Basically, deprecate using a non-ASCII string with message_from_string etc formally by issuing a deprecation warning as well as the doc note.
msg331159 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2018-12-05 20:36
I don't think this ticket should be implemented as described.

Consider the use-case in importlib_metadata, which loads metadata from a package, metadata known to be of a specified encoding. It already knows the encoding and has decoded the full message to text and now wants to parse it. It seems very much in the remit of something like email.parser to parse already-decoded content.

Yes, the RFCs describe how to decode bytes content, but that shouldn't preclude the e-mail module from supporting parsing from Unicode text.

And in fact, it does seem that the library is able to parse non-ascii Unicode text, especially on Python 3. Consider 'parse-text.py', attached. It illustrates that the parser currently mostly meets my expectation - on Python 2.7 and 3.7, e-mail messages are parsed from unicode text without any indication of an encoding, and returning unicode text on both Python 2 and Python 3.

Python 2 is deficient in that message_from_string will get a UnicodeEncodeError constructing a bytes-oriented StringIO from the input, which is easily worked-around by using the text-oriented io.StringIO.

Still, I would argue the current behavior is desirable and shouldn't be deprecated.
msg331183 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-12-05 22:21
The problem comes from thinking you can parse an arbitrary email message if it is in unicode form.  *YOU CANNOT DO THAT* in the general case (ie: non-ascii attachments).

That said, the new email package API is designed to facilitate "off label" uses.  I would have no problem with the definition of a policy object[*] that was basically "use this to parse messages in unicode form as long as they don't use MIME".  As soon as you start parsing MIME headers, the input had better be binary or pure ascii, or the headers *won't make sense*.  You break the MIME API contract if you use MIME with a non-ascii unicode string.

[*] that policy might be a clone of one of the existing policies and not actually do anything to prevent the input having mime headers...ideally it would, but I just don't want to say it is OK to use the standard email policies to do this and expect it to continue to work in the future.  It probably will, but we should not document it that way! :)
History
Date User Action Args
2018-12-05 22:21:01r.david.murraysetmessages: + msg331183
2018-12-05 20:36:43jason.coombssetfiles: + parse-text.py
nosy: + barry, jason.coombs
messages: + msg331159

2015-11-08 05:55:37r.david.murraysetmessages: + msg254325
2015-11-08 05:20:31jayvdbsetnosy: + jayvdb
messages: + msg254322
2015-11-08 00:56:08r.david.murraysetmessages: + msg254315
2015-11-07 08:24:40tanzer@swing.co.atsetmessages: + msg254262
2015-11-06 22:49:57terry.reedysettitle: email parsing docs need to be clear that only ASCII strings are supported -> email parsing docs: clarify that only ASCII strings are supported
2015-11-06 13:20:57r.david.murraysetmessages: + msg254189
2015-11-06 09:59:59tanzer@swing.co.atsetmessages: + msg254179
2015-11-05 17:35:37r.david.murraysetmessages: + msg254131
2015-11-05 09:58:19tanzer@swing.co.atsetmessages: + msg254095
2015-11-04 18:29:00r.david.murraysetmessages: + msg254067
2015-11-04 17:41:28tanzer@swing.co.atsetmessages: + msg254066
2015-11-04 15:36:27r.david.murraysetstatus: closed -> open
versions: + Python 3.4, Python 3.6
title: email.message.get_payload returns wrong encoding -> email parsing docs need to be clear that only ASCII strings are supported
messages: + msg254058

resolution: not a bug ->
stage: resolved -> needs patch
2015-11-04 09:04:29tanzer@swing.co.atsetmessages: + msg254041
2015-11-03 19:59:53r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg254014

resolution: not a bug
stage: resolved
2015-11-03 14:43:30tanzer@swing.co.atcreate