Author tanzer@swing.co.at
Recipients r.david.murray, tanzer@swing.co.at
Date 2015-11-04.17:41:27
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <E1Zu23p-0004cT-Q2@swing.co.at>
In-reply-to Your message of "Wed, 04 Nov 2015 15:36:27 +0000" <1446651387.54.0.809862214631.issue25545@psf.upfronthosting.co.za>
Content
R. David Murray wrote at Wed, 04 Nov 2015 15:36:27 +0000:

> There is no problem with supporting both 2.7 and python3 with the same
> email API as long as your input strings are ASCII only, which is what
> is required by the email RFCs (as I said, they do not support
> unicode...even the new one only supports utf8 (a unicode encoding) not
> unicode itself).

You are talking about byte strings. And of course the email RFCs only
talk about byte strings.

But the email package offers the use of unicode strings for various
functions, including `email.message_from_string`,
`email.Message.as_string`, and `email.Message.__str__`. These
functions could be useful (and were useful in Python 2) but aren't in
Python 3.

Assume I load an email satisfying all relevant RFCs from a file. Say
that email contains three MIMEText parts with
content-transfer-encoding "8bit", all with different
encodings:

* I don't see any use for `as_string` to obfuscate that by
  re-encoding each of the three to content-transfer-encoding "base64",
  which is completely unreadable when it could be converted painlessly
  to a real unicode string.

  One of my usage scenarios is something of the form::

    >>> print(msg)

  Of course, in this case I'll better use `utf-8` as my output
  encoding otherwise the print might fail.

  If I wanted to output a RFC-compliant byte string, I should have
  used `as_bytes`, not `as_string`. But that would be a different
  usage scenario.

* The same argument applies in reverse to `message_from_string`. If
  one wants RFC compliance one should use `message_from_bytes`.

  But if one builds up a unicode string for an email in Python, it
  should be possible to convert that to a `email.Message` instance via
  `message_from_string`.

I have several use cases where I want to convert an `email.Message`
to a unicode string without any embedded content-transfer-encodings
like "base64", do some transformations on that string and then
convert that back into an `email.Message` instance.

> I have an extensive doc rewrite in process, but I'm not sure when it
> will land.  I thought I had already added the note about ASCII-only to
> the parser docs, but I see that I did not.  I'll reopen this issue to
> remind myself to do that, since the doc rewrite will only apply to 3.6
> (when the new API will no longer be provisional).

I don't see any point in the semantics of the string-functions as they
are currently implemented, after all one can do things like easily
`message_from_string(...).decode("latin-1")` or
`msg.as_bytes().encode("latin-1")` if one really wants to convert an
RFC-compatible byte-string to/from unicode strings as-is. But this
as-is conversion normally isn't very useful because it isn't

* human-readable

* well suited to search and replace operations or any other text
  transformations

So documenting the current situation would improve the situation slightly
but it's more like putting lipstick on a pig.
History
Date User Action Args
2015-11-04 17:41:28tanzer@swing.co.atsetrecipients: + tanzer@swing.co.at, r.david.murray
2015-11-04 17:41:28tanzer@swing.co.atlinkissue25545 messages
2015-11-04 17:41:27tanzer@swing.co.atcreate