classification
Title: email.parser: impossible to read messages encoded in a different encoding
Type: behavior Stage: resolved
Components: email Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: anadelonbrin, barry, dato, haypo, meatballhat, pitrou, r.david.murray, tercero12, yu.zhao@getcwd.com
Priority: high Keywords: patch

Created on 2008-12-14 16:55 by dato, last changed 2014-10-03 01:07 by yu.zhao@getcwd.com. This issue is now closed.

Files
File name Uploaded Description Edit
email_parse_bytes.diff r.david.murray, 2010-09-21 22:38
email_parse_bytes2.diff r.david.murray, 2010-10-01 02:36
email_parse_bytes3.diff r.david.murray, 2010-10-02 02:24
email_parse_bytes4.diff r.david.murray, 2010-10-02 18:49 completed patch set
email_parse_bytes5.diff r.david.murray, 2010-10-02 21:34 svn diff against current py3k trunk review
email_parse_bytes7.diff r.david.murray, 2010-10-08 03:05 review
email_parse_bytes8.diff r.david.murray, 2010-10-08 12:15 review
email_parse_bytes9.diff r.david.murray, 2010-10-08 14:02 review
BytesParser_newline.patch yu.zhao@getcwd.com, 2014-10-03 01:07 review
Messages (25)
msg77807 - (view) Author: Adeodato Simó (dato) Date: 2008-12-14 16:55
Currently, email.parser/feedparser can only parse messages that come 
as a string, or from a file opened in text mode.

Email messages, however, can contain 8bit characters in any encoding 
other than the local one (yet still be valid e-mails, of course), so I 
think a method is needed to have the parser be able to receive bytes. 
At the moment, and as far as I can see, it's not possible to parse 
some perfectly valid messages with python 3.0.

I don't think it's appropriate to ask that files be opened with the 
proper encoding, and then for the parser to read them. First, it is 
not possible to know what encoding that would be without parsing the 
message. And second, a message could contain parts in different 
encoding, and many mailboxes with random messages most certainly do.

Also, message objects will need a way to return a bytes repreentation, 
for the reasons explained above, and particularly if one wants to 
write back the message without modifying it.
msg89508 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-06-18 16:44
Is there any use case for feedparser accepting strings as input that
isn't a design error waiting to bite the programmer?
msg89509 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2009-06-18 17:48
dato: We've started some branches that try to address this, by exposing
both a read-a-buncha-bytes interface and a read-a-string interface.

rdm: As it turns out, yes.  There are use cases for reading a string
containing only ascii bytes.

In general, this is part of the big tricky problem in fixing the email
package for Python 3.x.  We had great debates about this at Pycon with
no resolution, and I think everyone has just been too busy to engage on
this since then. :(

email-sig is the best place to try to rally some effort.
msg91737 - (view) Author: Alex Quinn (Alex Quinn) Date: 2009-08-19 19:56
This bug also prevents the cgi module from handling POST data with 
multipart/form-data.  Consequently, 3.x cannot be readily used to write 
web apps for uploading files.  See #4953:
   http://bugs.python.org/issue4953
msg95293 - (view) Author: Timothy Farrell (tercero12) Date: 2009-11-15 14:19
Just an update for people interested:

The email team has a goal of fixing the email module in time for the 3.2
release.  There is the possibility of having to change some interfaces.
 See this document: http://wiki.python.org/moin/Email%20SIG/DesignThoughts
msg117118 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-21 22:38
OK, I'm not entirely sure I want to post this, but....

Antoine and I were having a conversation about nntplib and email and I noted that unicode as an email transmission channel acts as if it required 7bit clean data.  That is, that there's no way to use unicode as an 8bit data transmission channel.  Antoine pointed out that there is PEP 383, and that he is using that in his nntplib update to tunnel 8bit data (if there is any) from and back to the nntp server.  I said I couldn't do that with email because I not only needed to transmit the data, I also needed to *parse* it.

Antoine pointed out that you can in fact parse a header even if it has surrogateescape code points in it.

So I started thinking about that.  In point of fact, from the point of view of an email parser, non-ASCII bytes are pretty much opaque.  They don't affect the semantics of the parsing.  Either they are invalid data (in headers), or they are opaque content data (8bit Content-Transfer-Encoding).

So...I came up with a horrible little hack, which is attached here as a patch.  This is horrible because it is a perversion of the Python3 desire to make a clean separation between bytes and strings.  The only thing it really has to recommend it is that it works: it allows email5 (the version of email currently in Python3) to read wire-format messages and parse them into valid message structures.

The patch is a proof of concept and is far from complete.  It handles only message bodies (but those are the most important) and has no doc updates and only one test.  If this approach is deemed worth considering, I will flesh out the tests and make sure the corner cases are handled correctly, and write docs with lots of notes about why this is perverse and email6 will make it all better :)

I feel bad about posting this both because it is an ugly hack and because it will likely slow down email6 development (because it will make email5 mostly work).  But making email5 mostly work in 3.2 seems like a case where practicality beats purity.

The essence of the hack is as follows: Given binary data we encode it to ASCII using the surrogateescape error handler.  Then, when a message body is retrieved we check to see if there are any surrogates in it, and if there are we encode it back to ASCII using surrogateescape, thereby recovering the original bytes.  For "Content-Transfer-Encoding: 8bit" parts we can then try to decode it using the declared charset, or ASCII with the replace error handler if the charset isn't known.  But in any case the original binary data is accessible by using 'decode=True' in the call to get_payload.  (NB for those not familiar with the API: decode=True refers to decoding the Content-Transfer-Encoding, *not* decoding to unicode...which means after CTE decoding you end up with a byte string).

For headers, which are not supposed to have 8bit data in them, the best we can do is re-decode them with ASCII/replace, but at least it will be possible to parse the messages.  (The current patch doesn't do this.)

Another thing missing from the current patch is the generator side.  But since the binary data for the message content is now available, it should be possible to have a generator that outputs binary.

Note that in this patch I've introduced new functions/methods for getting binary string data in, but for file input one needs to open the file as text using ASCII encoding and the surrogateescape error handler.

I've only done minimal testing on this (obviously), and so I may find a showstopper somewhere along the way, but so far it seems to work, and logically it seems like it should work.

I don't know if that makes me happy or sad :)
msg117119 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-21 23:18
A couple of comments:
- what is `str(self.get_param('charset', 'ascii'))` supposed to achieve? does get_param() return a bytes object?
- instead of ascii+surrogateescape, you could simply use latin1
msg117120 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-21 23:26
The 'str' around get_param shouldn't be there, that was left over from an earlier version of the patch.

I use surrogateescape rather than latin1 because using surrogateescape with ascii encoding gives me a reliable way to know whether or not the original source was bytes containing non-ascii chars.  (If it was bytes containing only ascii chars, then it doesn't matter what the original source was.)
msg117774 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-01 02:36
New version of the patch that adds many more tests, and handles non-ASCII bytes in header values by changing them to '?'s when the header value is retrieved as a string.  I think I'm half done.  Still to do: generate_bytes, and the doc updates.

By the way, another important reason to use surrogateescape rather than latin1 is that if I miss something and the byte-containing-strings escape, it will be obvious that that is what happened.  Otherwise we're back in Python2 bytes/string conflation land.

I of course make no promises about performance.  And there is an issue there in that every header value access is now wrapped in an additional function call and a regex test, at a minimum, whether there are bytes present in the input or not :(
msg117856 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-02 02:24
New version of patch including a BytesGenerator.
msg117857 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-02 02:37
In case it isn't clear, the code patch is now complete, so anyone who wants to give it a review, please do.  I'll add the docs soon, but the basic idea is you can put bytes in by either using message_from_bytes or by using the 'ascii' codec and the 'surrogateescape' error handler on a file passed to msg_from_file, and you can get bytes out by using BytesGenerator and passing it a file-like object that accepts bytes.  As a side benefit, Generator will correctly render (as unicode) the content of a section with a ContentTransferEncoding of '8bit'.
msg117893 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-02 18:49
Version 4 of patch, now including doc updates.

The patch set is now complete.
msg117897 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-02 21:16
Rietveld issue, with a small doc addition compared to pach4:

http://codereview.appspot.com/2362041
msg117900 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-02 21:34
Upload svn patch, so that Martin's new rietveld support will (hopefully) create an automatic review link.
msg118160 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-08 03:05
Here is an updated patch incorporating the reitveld feedback and feedback from python-dev about the API.  Now we have BytesParser instead of Parser with a parsebytes method, and a message_from_binary_file helper.  Generator also now converts bodies with an 8bit CTE into bodies with an appropriate 7bit coding.

Things still to do: (1) Add the (documented in this patch) BytesFeedParser class. (2) Figure out how to encode unknown bytes using the 'unknown' MIME charset in headers instead of replacing them with '?'s.  (3) Once I land a revised patch for issue 6302, add a flag to DecodedGenerator to have it fully decode headers.

I'd like to land this patch before Alpha3 if possible, so I'm setting it to release blocker for Georg to decide whether or not that is possible.  I'll complete the work in subsequent patches after the alpha, but everything needed to test the patch in field conditions is already present.

Georg, feel free to knock down the priority right away if you don't think it is ready or don't want to take time to even decide if it is ready :)
msg118175 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-08 10:19
There are a couple of things I don't understand:

+* :class:`~email.generator.Generator` will convert message bodies that
+  have a :mailheader:`ContentTransferEncoding` of 8bit and a known charset to
+  instead have a :mailheader:`CotnentTransferEncoding` of ``QuotedPrintable``.

Why so?

+* All operations are on unicode strings.  Text inputs must be strings,
+  text outputs are strings.

This is a rather strange statement given that you are adding bytes-consuming and bytes-producing functions.
msg118191 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-08 11:56
Generator converts 8bit bodies into 7bit bodies by applying an appropriate 7bit CTE.  The reason it does this is that the output of Generator will often be passed to some other Python library function (most often smtplib) that can only handle ASCII unicode input.  That is, Generator now produces a 7bit clean message that can be put on the wire by encoding it to ascii.  This means that RFC-compliant bytes input can be successfully transmitted onward using Generator and smtplib, whereas if Generator produced non-ASCII unicode it would not be possible to pass a message with an 8bit CTE on to smtplib.

The statement about string input and output is a description of email *5.0*, the existing email package in 3.0 and 3.1, before my patch.  The differences between 4.0 and 5.0 were never previously added to the docs, so I had to add them in order to then describe the differences between 5.0 and 5.1.
msg118192 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-08 12:00
> Generator converts 8bit bodies into 7bit bodies by applying an
> appropriate 7bit CTE.  The reason it does this is that the output of
> Generator will often be passed to some other Python library function
> (most often smtplib) that can only handle ASCII unicode input.

What do you mean, ASCII *unicode* input? Any low-level network library
should accept bytes when arbitrary data is possible.

Enforcing 7-bit means things like binary attachments can grow larger for
no real reason. Also, raw message bodies become less readable (which
obviously is very minor issue).

> The statement about string input and output is a description of email
> *5.0*, the existing email package in 3.0 and 3.1, before my patch.
> The differences between 4.0 and 5.0 were never previously added to the
> docs, so I had to add them in order to then describe the differences
> between 5.0 and 5.1.

Ah, my bad. Sorry.
msg118197 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-08 12:15
Even if smtplib accepted bytes (it currently does not), *Generator* is still producing unicode, and should produce valid unicode and still insofar as possible preserve the meaning of the original message.  This means unicode acts as if it is an SMTP server that does not support the 8bit capability, so we must convert to 7bit clean CTEs.

If smtplib later grows the ability to accept bytes, BytesGenerator can be used to feed it.  I've clarified that BytesGenerator does not do the 7bit transform, and made some other doc tweaks.
msg118198 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-08 12:18
> Even if smtplib accepted bytes (it currently does not),

That sounds like a critical failure.
msg118199 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-08 12:20
I can only fix one package at a time :)

And in case it isn't clear, the "Generator produces ASCII-only unicode", which is in many ways a rather strange API, is one of the chief motivations for email6.
msg118201 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-08 14:02
Here is the final pre-alpha patch.  This one includes the BytesFeedParser class and a test.

Unless there are objections I'd like to commit this.  Believing the code needs a more thorough review would be a valid objection :)
msg118204 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-08 16:01
After RM approval on irc, committed in r85322, with some additional doc fixes but no code changes relative to the last patch posted here.

I'm leaving this open because I still want to try to improve the handling of non-ascii bytes in headers when decoding them to unicode.
msg123843 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-12 18:03
I've opened a issue 10686 to address improving the RFC conformance by using unknown-8bit encoded words for 8bit bytes in headers.
msg228289 - (view) Author: Yu Zhao (yu.zhao@getcwd.com) Date: 2014-10-03 01:07
BytesParser.parse uses TextIOWrapper which by default translates universal newlines to '\n'. This breaks binary payload.

Fix the problem by disabling the translation.
History
Date User Action Args
2014-10-03 01:07:07yu.zhao@getcwd.comsetfiles: + BytesParser_newline.patch

nosy: + yu.zhao@getcwd.com
messages: + msg228289

components: + email, - Library (Lib)
2010-12-27 17:04:58r.david.murrayunlinkissue1685453 dependencies
2010-12-12 18:03:17r.david.murraysetstatus: open -> closed
resolution: fixed
messages: + msg123843

stage: patch review -> resolved
2010-10-08 16:01:50r.david.murraysetpriority: release blocker -> high

messages: + msg118204
2010-10-08 14:02:42r.david.murraysetfiles: + email_parse_bytes9.diff

messages: + msg118201
2010-10-08 12:20:35r.david.murraysetmessages: + msg118199
2010-10-08 12:18:15pitrousetmessages: + msg118198
2010-10-08 12:16:00r.david.murraysetfiles: + email_parse_bytes8.diff

messages: + msg118197
2010-10-08 12:00:20pitrousetmessages: + msg118192
2010-10-08 11:56:34r.david.murraysetmessages: + msg118191
2010-10-08 10:19:36pitrousetmessages: + msg118175
2010-10-08 03:07:00r.david.murraysetpriority: high -> release blocker
files: + email_parse_bytes7.diff
messages: + msg118160
2010-10-05 03:32:44anadelonbrinsetnosy: + anadelonbrin
2010-10-03 04:18:11r.david.murraylinkissue6302 dependencies
2010-10-02 21:35:47r.david.murraysetfiles: + email_parse_bytes5.diff

messages: + msg117900
2010-10-02 21:16:58r.david.murraysetmessages: + msg117897
2010-10-02 18:49:51r.david.murraysetfiles: + email_parse_bytes4.diff

messages: + msg117893
2010-10-02 02:57:47Alex Quinnsetnosy: - Alex Quinn
2010-10-02 02:37:30r.david.murraysetmessages: + msg117857
stage: test needed -> patch review
2010-10-02 02:25:03r.david.murraysetfiles: + email_parse_bytes3.diff

messages: + msg117856
2010-10-01 02:36:23r.david.murraysetfiles: + email_parse_bytes2.diff

messages: + msg117774
2010-09-24 00:07:09meatballhatsetnosy: + meatballhat
2010-09-21 23:50:22hayposetnosy: + haypo
2010-09-21 23:26:33r.david.murraysetmessages: + msg117120
2010-09-21 23:18:12pitrousetnosy: + pitrou
messages: + msg117119
2010-09-21 22:38:07r.david.murraysetfiles: + email_parse_bytes.diff
keywords: + patch
messages: + msg117118

versions: - Python 3.1
2010-05-05 13:32:29barrysetassignee: barry -> r.david.murray
2009-11-15 14:19:41tercero12setmessages: + msg95293
2009-08-19 19:56:30Alex Quinnsetnosy: + Alex Quinn
messages: + msg91737
2009-08-18 18:59:09tercero12setnosy: + tercero12
2009-06-18 17:48:30barrysetmessages: + msg89509
2009-06-18 16:44:50r.david.murraysetpriority: high

type: behavior
versions: + Python 3.1, Python 3.2, - Python 3.0
nosy: + r.david.murray

messages: + msg89508
stage: test needed
2009-03-30 22:56:23ajaksu2linkissue1685453 dependencies
2008-12-14 17:24:43benjamin.petersonsetassignee: barry
nosy: + barry
2008-12-14 16:55:45datocreate