diff -r e0f8bed0435c Doc/library/email.generator.rst --- a/Doc/library/email.generator.rst Tue Sep 21 14:28:43 2010 +0200 +++ b/Doc/library/email.generator.rst Sat Oct 02 14:45:20 2010 -0400 @@ -22,6 +22,13 @@ result in changes to the :class:`~email.message.Message` object as defaults are filled in. +:class:`bytes` output can be generated using the :class:`BytesGenerator` class. +If the message object structure contains non-ASCII bytes encoded via +`surrogateescape`, this generator's :meth:`~BytesGenerator.flatten` method will +turn them back into the original bytes. Parsing a message decoded to ASCII +with the `surrogateescape` error handler and then flattening it with +:class:`BytesGenerator` should be idempotent for standards compliant messages. + Here are the public methods of the :class:`Generator` class, imported from the :mod:`email.generator` module: @@ -81,6 +88,18 @@ of a formatted string representation of a message object. For more detail, see :mod:`email.message`. +.. class:: BytesGenerator(outfp, mangle_from_=True, maxheaderlen=78, fmt=None) + + Identical to :class:`Generator` except that *outfp* must be a file like + object that will accept :class`bytes` input to its `write` method. If the + message object structure contains non-ASCII bytes encoded via + `surrogateescape`, this generator's :meth:`~BytesGenerator.flatten` method + will turn them back into the original bytes. + + Note that even the :meth:`write` method API is identical: it expects + strings as input (it encodes them using the ASCII codec and the + 'surrogateescape' error handler). + The :mod:`email.generator` module also provides a derived class, called :class:`DecodedGenerator` which is like the :class:`Generator` base class, except that non-\ :mimetype:`text` parts are substituted with a format string diff -r e0f8bed0435c Doc/library/email.message.rst --- a/Doc/library/email.message.rst Tue Sep 21 14:28:43 2010 +0200 +++ b/Doc/library/email.message.rst Sat Oct 02 14:45:20 2010 -0400 @@ -111,9 +111,17 @@ be decoded if this header's value is ``quoted-printable`` or ``base64``. If some other encoding is used, or :mailheader:`Content-Transfer-Encoding` header is missing, or if the payload has bogus base64 data, the payload is - returned as-is (undecoded). If the message is a multipart and the - *decode* flag is ``True``, then ``None`` is returned. The default for - *decode* is ``False``. + returned as-is (undecoded). In all cases the returned value is binary + data. If the message is a multipart and the *decode* flag is ``True``, + then ``None`` is returned. + + When *decode* is ``False`` (the default) the body is returned as a string + without decoding the :mailheader:`ContentTransferEncoding`. However, for + a :mailheader:`ContentTransferEncoding` of 8bit, an attempt is made to + decode the original bytes using the `charset` specified by the + :mailheader:`Content-Type` header, using the `replace` error handler. If + no `charset` is specified, or if the `charset` given is not recognized by + the email package, the body is decoding using the default ASCII charset. .. method:: set_payload(payload, charset=None) @@ -160,6 +168,10 @@ Note that in all cases, any envelope header present in the message is not included in the mapping interface. + In a model generated from bytes, any header values that (in contravention + of the RFCs) contain non-ASCII bytes will have those bytes transformed + into '?' characters when the values are retrieved through this interface. + .. method:: __len__() diff -r e0f8bed0435c Doc/library/email.parser.rst --- a/Doc/library/email.parser.rst Tue Sep 21 14:28:43 2010 +0200 +++ b/Doc/library/email.parser.rst Sat Oct 02 14:45:20 2010 -0400 @@ -34,6 +34,12 @@ :class:`~email.message.Message` class, so your custom parser can create message object trees any way it finds necessary. +The parser expects its input in the form of strings. To handle binary input +data, the data must be decoded as ASCII. If the input contains non-ASCII data, +use the `surrogateescape` error handler to do the decoding. The standard model +recognizes binary data escaped in this way and transforms it back to bytes for +processing and re-decoding as needed. + FeedParser API ^^^^^^^^^^^^^^ @@ -71,7 +77,8 @@ :class:`FeedParser` will stitch such partial lines together properly. The lines in the string can have any of the common three line endings, carriage return, newline, or carriage return and newline (they can even be - mixed). + mixed). The lines may contain binary data escaped using the `surrogateescape` + error handler; this data will be recognized as binary data by the model. .. method:: close() @@ -140,23 +147,30 @@ Since creating a message object structure from a string or a file object is such -a common task, two functions are provided as a convenience. They are available +a common task, three functions are provided as a convenience. They are available in the top-level :mod:`email` package namespace. .. currentmodule:: email -.. function:: message_from_string(s[, _class][, strict]) +.. function:: message_from_string(s, _class=email.message.Message, strict=None) Return a message object structure from a string. This is exactly equivalent to ``Parser().parsestr(s)``. Optional *_class* and *strict* are interpreted as with the :class:`Parser` class constructor. +.. function:: message_from_bytes(s, _class=email.message.Message, strict=None) -.. function:: message_from_file(fp[, _class][, strict]) + Return a message object structure from a byte string. This is a convience + wrapper around :func:`message_from_string` that decodes the bytes using + the ASCII codec and the `surrogateescape` error handler. + +.. function:: message_from_file(fp, _class=email.message.Message, strict=None) Return a message object structure tree from an open :term:`file object`. This is exactly equivalent to ``Parser().parse(fp)``. Optional *_class* and *strict* are interpreted as with the :class:`Parser` class constructor. + To process a binary file, open it as a text file decoded with the ASCII + codec and the `surrogateescape` error handler. Here's an example of how you might use this at an interactive Python prompt:: diff -r e0f8bed0435c Doc/library/email.rst --- a/Doc/library/email.rst Tue Sep 21 14:28:43 2010 +0200 +++ b/Doc/library/email.rst Sat Oct 02 14:45:20 2010 -0400 @@ -6,7 +6,7 @@ email messages, including MIME documents. .. moduleauthor:: Barry A. Warsaw .. sectionauthor:: Barry A. Warsaw -.. Copyright (C) 2001-2007 Python Software Foundation +.. Copyright (C) 2001-2010 Python Software Foundation The :mod:`email` package is a library for managing email messages, including @@ -92,6 +92,38 @@ +---------------+------------------------------+-----------------------+ | :const:`4.0` | Python 2.5 | Python 2.3 to 2.5 | +---------------+------------------------------+-----------------------+ +| :const:`5.0` | Python 3.0 and Python 3.1 | Python 3.0 to 3.2 | ++---------------+------------------------------+-----------------------+ +| :const:`5.1` | Python 3.2 | Python 3.0 to 3.2 | ++---------------+------------------------------+-----------------------+ + +Here are the major differences between :mod:`email` version 5.1 and +version 5.0: + +* It is once again possible to parse messages containing non-ASCII bytes, + and to reproduce such messages if the data containing the non-ASCII + bytes is not modified. For models built by reading a file, the + file must be decoded using the ASCII codec and the `surrogateescape` + error handler. + +* Given bytes input to the model, :meth:`~email.Message.get_payload()` will + by default return the data from a message body using a + :mailheader:`ContentTransferEncoding` of `8bit` decoded into strings using + the charset specified in the MIME headers. + +* New function :func:`message_from_bytes` accepts byte strings as input. + +* New function :class:`~email.generator.BytesGenerator` produces bytes + as output, preserving any unchanged non-ASCII data if it was + present in the input used to build the model. + +Here are the major differences between :mod:`email` version 5 and version 3: + +* All operations are on unicode strings. Text inputs must be strings, + text outputs are strings. Outputs are limited to the ASCII character + set and so can be encoded to ASCII for transmission. Inputs are also + limited to ASCII; this is an acknowledged limitation of email 5.0 and + means it can only be used to parse email that is 7bit clean. Here are the major differences between :mod:`email` version 4 and version 3: diff -r e0f8bed0435c Lib/email/__init__.py --- a/Lib/email/__init__.py Tue Sep 21 14:28:43 2010 +0200 +++ b/Lib/email/__init__.py Sat Oct 02 14:45:20 2010 -0400 @@ -4,7 +4,7 @@ """A package for parsing, handling, and generating email messages.""" -__version__ = '5.0.0' +__version__ = '5.1.0' __all__ = [ 'base64mime', @@ -36,6 +36,14 @@ from email.parser import Parser return Parser(*args, **kws).parsestr(s) +def message_from_bytes(s, *args, **kws): + """Parse a bytes string into a Message object model. + + Optional _class and strict are passed to the Parser constructor. + """ + from email.parser import Parser + return Parser(*args, **kws).parsebytes(s) + def message_from_file(fp, *args, **kws): """Read a file and parse its contents into a Message object model. diff -r e0f8bed0435c Lib/email/generator.py --- a/Lib/email/generator.py Tue Sep 21 14:28:43 2010 +0200 +++ b/Lib/email/generator.py Sat Oct 02 14:45:20 2010 -0400 @@ -12,8 +12,9 @@ import random import warnings -from io import StringIO +from io import StringIO, BytesIO from email.header import Header +from email.message import has_surrogates UNDERSCORE = '_' NL = '\n' @@ -72,7 +73,7 @@ ufrom = msg.get_unixfrom() if not ufrom: ufrom = 'From nobody ' + time.ctime(time.time()) - print(ufrom, file=self._fp) + self.write(ufrom + '\n') self._write(msg) def clone(self, fp): @@ -83,6 +84,21 @@ # Protected interface - undocumented ;/ # + # Note that we use 'self.write' when what we are writing is comming from + # the source, and self._fp.write when what we are writing is coming from a + # buffer (because the Bytes subclass has already had a chance to transform + # the data in its write method in that case). This is an entirely + # pragmatic split determined by experiment; we could be more general by + # always using write and having the Bytes subclass write method detect when + # it has already transformed the input; but, since this whole thing is a + # hack anyway this seems good enough. + + _NL = NL + _EMPTY = '' + + def _new_buffer(self): + return StringIO() + def _write(self, msg): # We can't write the headers yet because of the following scenario: # say a multipart message includes the boundary string somewhere in @@ -91,13 +107,13 @@ # parameter. # # The way we do this, so as to make the _handle_*() methods simpler, - # is to cache any subpart writes into a StringIO. The we write the - # headers and the StringIO contents. That way, subpart handlers can + # is to cache any subpart writes into a buffer. The we write the + # headers and the buffer contents. That way, subpart handlers can # Do The Right Thing, and can still modify the Content-Type: header if # necessary. oldfp = self._fp try: - self._fp = sfp = StringIO() + self._fp = sfp = self._new_buffer() self._dispatch(msg) finally: self._fp = oldfp @@ -132,16 +148,16 @@ def _write_headers(self, msg): for h, v in msg.items(): - print('%s:' % h, end=' ', file=self._fp) + self.write('%s: ' % h) if isinstance(v, Header): - print(v.encode(maxlinelen=self._maxheaderlen), file=self._fp) + self.write(v.encode(maxlinelen=self._maxheaderlen)+'\n') else: # Header's got lots of smarts, so use it. header = Header(v, maxlinelen=self._maxheaderlen, header_name=h) - print(header.encode(), file=self._fp) + self.write(header.encode()+'\n') # A blank line always separates headers from body - print(file=self._fp) + self.write('\n') # # Handlers for writing types and subtypes @@ -155,7 +171,7 @@ raise TypeError('string payload expected: %s' % type(payload)) if self._mangle_from_: payload = fcre.sub('>From ', payload) - self._fp.write(payload) + self.write(payload) # Default body handler _writeBody = _handle_text @@ -170,21 +186,21 @@ subparts = [] elif isinstance(subparts, str): # e.g. a non-strict parse of a message with no starting boundary. - self._fp.write(subparts) + self.write(subparts) return elif not isinstance(subparts, list): # Scalar payload subparts = [subparts] for part in subparts: - s = StringIO() + s = self._new_buffer() g = self.clone(s) g.flatten(part, unixfrom=False) msgtexts.append(s.getvalue()) # Now make sure the boundary we've selected doesn't appear in any of # the message texts. - alltext = NL.join(msgtexts) + alltext = self._NL.join(msgtexts) # BAW: What about boundaries that are wrapped in double-quotes? - boundary = msg.get_boundary(failobj=_make_boundary(alltext)) + boundary = msg.get_boundary(failobj=self._make_boundary(alltext)) # If we had to calculate a new boundary because the body text # contained that string, set the new boundary. We don't do it # unconditionally because, while set_boundary() preserves order, it @@ -195,9 +211,9 @@ msg.set_boundary(boundary) # If there's a preamble, write it out, with a trailing CRLF if msg.preamble is not None: - print(msg.preamble, file=self._fp) + self.write(msg.preamble + '\n') # dash-boundary transport-padding CRLF - print('--' + boundary, file=self._fp) + self.write('--' + boundary + '\n') # body-part if msgtexts: self._fp.write(msgtexts.pop(0)) @@ -206,14 +222,14 @@ # --> CRLF body-part for body_part in msgtexts: # delimiter transport-padding CRLF - print('\n--' + boundary, file=self._fp) + self.write('\n--' + boundary + '\n') # body-part self._fp.write(body_part) # close-delimiter transport-padding - self._fp.write('\n--' + boundary + '--') + self.write('\n--' + boundary + '--') if msg.epilogue is not None: - print(file=self._fp) - self._fp.write(msg.epilogue) + self.write('\n') + self.write(msg.epilogue) def _handle_multipart_signed(self, msg): # The contents of signed parts has to stay unmodified in order to keep @@ -232,23 +248,23 @@ # block and the boundary. Sigh. blocks = [] for part in msg.get_payload(): - s = StringIO() + s = self._new_buffer() g = self.clone(s) g.flatten(part, unixfrom=False) text = s.getvalue() - lines = text.split('\n') + lines = text.split(self._NL) # Strip off the unnecessary trailing empty line - if lines and lines[-1] == '': - blocks.append(NL.join(lines[:-1])) + if lines and lines[-1] == self._EMPTY: + blocks.append(self._NL.join(lines[:-1])) else: blocks.append(text) # Now join all the blocks with an empty line. This has the lovely # effect of separating each block with an empty line, but not adding # an extra one after the last one. - self._fp.write(NL.join(blocks)) + self._fp.write(self._NL.join(blocks)) def _handle_message(self, msg): - s = StringIO() + s = self._new_buffer() g = self.clone(s) # The payload of a message/rfc822 part should be a multipart sequence # of length 1. The zeroth element of the list should be the Message @@ -265,12 +281,93 @@ payload = s.getvalue() self._fp.write(payload) + # This used to be a a module level function, we use a classmethod for + # this and _compile_re so we can continue to provide the module level + # function for backward compatibility (it *is* internal, so we could + # drop that...) + @classmethod + def _make_boundary(cls, text=None): + # Craft a random boundary. If text is given, ensure that the chosen + # boundary doesn't appear in the text. + token = random.randrange(sys.maxsize) + boundary = ('=' * 15) + (_fmt % token) + '==' + if text is None: + return boundary + b = boundary + counter = 0 + while True: + cre = cls._compile_re('^--' + re.escape(b) + '(--)?$', re.MULTILINE) + if not cre.search(text): + break + b = boundary + '.' + str(counter) + counter += 1 + return b + + @classmethod + def _compile_re(cls, s, flags): + return re.compile(s, flags) + + + +class BytesGenerator(Generator): + """Generates a bytes version of a Message object tree. + + Functionally identical to the base Generator except that the output is + bytes and not string. When surrogates were used in the input to encode + bytes, these are decoded back to bytes for output. + + The outfp object must accept bytes in its write method. + """ + + _NL = NL.encode('ascii') + _EMPTY = b'' + + def write(self, s): + self._fp.write(s.encode('ascii', 'surrogateescape')) + + def _new_buffer(self): + return BytesIO() + + def _write_headers(self, msg): + # This is almost the same as the string version, except for handling + # strings with 8bit bytes. + for h, v in msg._headers: + self.write('%s: ' % h) + if isinstance(v, Header): + self.write(v.encode(maxlinelen=self._maxheaderlen)+'\n') + elif has_surrogates(v): + # If we have raw 8bit data in a byte string, we have no idea + # what the encoding is. There is no safe way to split this + # string. If it's ascii-subset, then we could do a normal + # ascii split, but if it's multibyte then we could break the + # string. There's no way to know so the least harm seems to + # be to not split the string and risk it being too long. + self.write(v+'\n') + else: + # Header's got lots of smarts and this string is safe... + header = Header(v, maxlinelen=self._maxheaderlen, + header_name=h) + self.write(header.encode()+'\n') + # A blank line always separates headers from body + self.write('\n') + + def _handle_text(self, msg): + # If the string has surrogates the original source was bytes, so + # just write it back out. + if has_surrogates(msg._payload): + self.write(msg._payload) + else: + super(BytesGenerator,self)._handle_text(msg) + + @classmethod + def _compile_re(cls, s, flags): + return re.compile(s.encode('ascii'), flags) _FMT = '[Non-text (%(type)s) part of message omitted, filename %(filename)s]' class DecodedGenerator(Generator): - """Generator a text representation of a message. + """Generates a text representation of a message. Like the Generator base class, except that non-text parts are substituted with a format string representing the part. @@ -325,23 +422,9 @@ -# Helper +# Helper used by _make_boundary _width = len(repr(sys.maxsize-1)) _fmt = '%%0%dd' % _width -def _make_boundary(text=None): - # Craft a random boundary. If text is given, ensure that the chosen - # boundary doesn't appear in the text. - token = random.randrange(sys.maxsize) - boundary = ('=' * 15) + (_fmt % token) + '==' - if text is None: - return boundary - b = boundary - counter = 0 - while True: - cre = re.compile('^--' + re.escape(b) + '(--)?$', re.MULTILINE) - if not cre.search(text): - break - b = boundary + '.' + str(counter) - counter += 1 - return b +# Backward compatibility +_make_boundary = Generator._make_boundary diff -r e0f8bed0435c Lib/email/message.py --- a/Lib/email/message.py Tue Sep 21 14:28:43 2010 +0200 +++ b/Lib/email/message.py Sat Oct 02 14:45:20 2010 -0400 @@ -24,8 +24,26 @@ # existence of which force quoting of the parameter value. tspecials = re.compile(r'[ \(\)<>@,;:\\"/\[\]\?=]') +# How to figure out if we are processing strings that come from a byte +# source with undecodable characters. +has_surrogates = re.compile( + '([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)').search + # Helper functions +def _sanitize_surrogates(value): + # If the value contains surrogates, re-decode and replace the original + # non-ascii bytes with '?'s. Used to sanitize header values before letting + # them escape as strings. + if not isinstance(value, str): + # Header object + return value + if has_surrogates(value): + original_bytes = value.encode('ascii', 'surrogateescape') + return original_bytes.decode('ascii', 'replace').replace('�', '?') + else: + return value + def _splitparam(param): # Split header parameters. BAW: this may be too simple. It isn't # strictly RFC 2045 (section 5.1) compliant, but it catches most headers @@ -98,7 +116,7 @@ objects, otherwise it is a string. Message objects implement part of the `mapping' interface, which assumes - there is exactly one occurrance of the header per message. Some headers + there is exactly one occurrence of the header per message. Some headers do in fact appear multiple times (e.g. Received) and for those headers, you must use the explicit API to set or get all the headers. Not all of the mapping methods are implemented. @@ -184,44 +202,72 @@ If the message is a multipart and the decode flag is True, then None is returned. """ - if i is None: - payload = self._payload - elif not isinstance(self._payload, list): + # Here is the logic table for this code, based on the email5.0.0 code: + # i decode is_multipart result + # ------ ------ ------------ ------------------------------ + # None True True None + # i True True None + # None False True _payload (a list) + # i False True _payload element i (a Message) + # i False False error (not a list) + # i True False error (not a list) + # None False False _payload + # None True False _payload decoded (bytes) + # Note that Barry planned to factor out the 'decode' case, but that + # isn't so easy now that we handle the 8 bit data, which needs to be + # converted in both the decode and non-decode path. + if self.is_multipart(): + if decode: + return None + if i is None: + return self._payload + else: + return self._payload[i] + # For backward compatibility, Use isinstance and this error message + # instead of the more logical is_multipart test. + if i is not None and not isinstance(self._payload, list): raise TypeError('Expected list, got %s' % type(self._payload)) - else: - payload = self._payload[i] + payload = self._payload + cte = self.get('content-transfer-encoding', '').lower() + # payload can be bytes here, (I wonder if that is actually a bug?) + if isinstance(payload, str): + if has_surrogates(payload): + bpayload = payload.encode('ascii', 'surrogateescape') + if not decode: + try: + payload = bpayload.decode(str(self.get_param('charset', 'ascii')), 'replace') + except LookupError: + payload = bpayload.decode('ascii', 'replace') + elif decode: + try: + bpayload = payload.encode('ascii') + except UnicodeError: + # This won't happen for RFC compliant messages (messages + # containing only ASCII codepoints in the unicode input). + # If it does happen, turn the string into bytes in a way + # guaranteed not to fail. + bpayload = payload.encode('raw-unicode-escape') if not decode: return payload - # Decoded payloads always return bytes. XXX split this part out into - # a new method called .get_decoded_payload(). - if self.is_multipart(): - return None - cte = self.get('content-transfer-encoding', '').lower() if cte == 'quoted-printable': - if isinstance(payload, str): - payload = payload.encode('ascii') - return utils._qdecode(payload) + return utils._qdecode(bpayload) elif cte == 'base64': try: - if isinstance(payload, str): - payload = payload.encode('ascii') - return base64.b64decode(payload) + return base64.b64decode(bpayload) except binascii.Error: # Incorrect padding - pass + return bpayload elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'): - in_file = BytesIO(payload.encode('ascii')) + in_file = BytesIO(bpayload) out_file = BytesIO() try: uu.decode(in_file, out_file, quiet=True) return out_file.getvalue() except uu.Error: # Some decoding problem - pass - # Is there a better way to do this? We can't use the bytes - # constructor. + return bpayload if isinstance(payload, str): - return payload.encode('raw-unicode-escape') + return bpayload return payload def set_payload(self, payload, charset=None): @@ -290,7 +336,7 @@ Return None if the header is missing instead of raising an exception. Note that if the header appeared multiple times, exactly which - occurrance gets returned is undefined. Use get_all() to get all + occurrence gets returned is undefined. Use get_all() to get all the values matching a header field name. """ return self.get(name) @@ -322,9 +368,6 @@ for field, value in self._headers: yield field - def __len__(self): - return len(self._headers) - def keys(self): """Return a list of all the message's header field names. @@ -343,7 +386,7 @@ Any fields deleted and re-inserted are always appended to the header list. """ - return [v for k, v in self._headers] + return [_sanitize_surrogates(v) for k, v in self._headers] def items(self): """Get all the message's header fields and values. @@ -353,6 +396,7 @@ Any fields deleted and re-inserted are always appended to the header list. """ + return [(k, _sanitize_surrogates(v)) for k, v in self._headers] return self._headers[:] def get(self, name, failobj=None): @@ -364,7 +408,7 @@ name = name.lower() for k, v in self._headers: if k.lower() == name: - return v + return _sanitize_surrogates(v) return failobj # @@ -384,7 +428,7 @@ name = name.lower() for k, v in self._headers: if k.lower() == name: - values.append(v) + values.append(_sanitize_surrogates(v)) if not values: return failobj return values diff -r e0f8bed0435c Lib/email/parser.py --- a/Lib/email/parser.py Tue Sep 21 14:28:43 2010 +0200 +++ b/Lib/email/parser.py Sat Oct 02 14:45:20 2010 -0400 @@ -71,6 +71,17 @@ feedparser.feed(data) return feedparser.close() + def parsebytes(self, text, headersonly=False): + """Create a message structure from a byte string. + + Returns the root of the message structure. Optional headersonly is a + flag specifying whether to stop parsing after reading the headers or + not. The default is False, meaning it parses the entire contents of + the file. + """ + text = text.decode('ASCII', errors='surrogateescape') + return self.parsestr(text, headersonly) + def parsestr(self, text, headersonly=False): """Create a message structure from a string. diff -r e0f8bed0435c Lib/email/test/test_email.py --- a/Lib/email/test/test_email.py Tue Sep 21 14:28:43 2010 +0200 +++ b/Lib/email/test/test_email.py Sat Oct 02 14:45:20 2010 -0400 @@ -9,8 +9,9 @@ import difflib import unittest import warnings - -from io import StringIO +import textwrap + +from io import StringIO, BytesIO from itertools import chain import email @@ -2064,6 +2065,10 @@ msg, text = self._msgobj('msg_36.txt') self._idempotent(msg, text) + def test_message_signed_idempotent(self): + msg, text = self._msgobj('msg_45.txt') + self._idempotent(msg, text) + def test_content_type(self): eq = self.assertEquals unless = self.assertTrue @@ -2663,6 +2668,207 @@ self.assertTrue(msg.get_payload(0).get_payload().endswith('\r\n')) +class Test8BitBytesHandling(unittest.TestCase): + # In Python3 all input is string, but that doesn't work if the actual input + # uses an 8bit transfer encoding. To hack around that, in email 5.1 we + # decode byte streams using the surrogateescape error handler, and + # reconvert to binary at appropriate places if we detect surrogates. This + # doesn't allow us to transform headers with 8bit bytes (they get munged), + # but it does allow us to parse and preserve them, and to decode body + # parts that use an 8bit CTE. + + bodytest_msg = textwrap.dedent("""\ + From: foo@bar.com + To: baz + Mime-Version: 1.0 + Content-Type: text/plain; charset={charset} + Content-Transfer-Encoding: {cte} + + {bodyline} + """) + + def test_known_8bit_CTE(self): + m = self.bodytest_msg.format(charset='utf-8', + cte='8bit', + bodyline='pöstal').encode('utf-8') + msg = email.message_from_bytes(m) + self.assertEqual(msg.get_payload(), "pöstal\n") + + def test_unknown_8bit_CTE(self): + m = self.bodytest_msg.format(charset='notavalidcharset', + cte='8bit', + bodyline='pöstal').encode('utf-8') + msg = email.message_from_bytes(m) + self.assertEqual(msg.get_payload(), "p��stal\n") + + def test_8bit_in_quopri_body(self): + # This is non-RFC compliant data...without 'decode' the library code + # decodes the body using the charset from the headers, and because the + # source byte really is utf-8 this works. This is likely to fail + # against real dirty data (ie: produce mojibake), but the data is + # invalid anyway so it is as good a guess as any. But this means that + # this test just confrms the current behavior; that behavior is not + # necessarily the best possible behavior. With 'decode' it is + # returning the raw bytes, so that test should be of correct behavior, + # or at least produce the same result that email4 did. + m = self.bodytest_msg.format(charset='utf-8', + cte='quoted-printable', + bodyline='p=C3=B6stál').encode('utf-8') + msg = email.message_from_bytes(m) + self.assertEqual(msg.get_payload(), 'p=C3=B6stál\n') + self.assertEqual(msg.get_payload(decode=True), + 'pöstál\n'.encode('utf-8')) + + def test_invalid_8bit_in_non_8bit_cte_uses_replace(self): + # This is similar to the previous test, but proves that if the 8bit + # byte is undecodeable in the specified charset, it gets replaced + # by the unicode 'unknown' character. Again, this may or may not + # be the ideal behavior. Note that if decode=False none of the + # decoders will get involved, so this is the only test we need + # for this behavior. + m = self.bodytest_msg.format(charset='ascii', + cte='quoted-printable', + bodyline='p=C3=B6stál').encode('utf-8') + msg = email.message_from_bytes(m) + self.assertEqual(msg.get_payload(), 'p=C3=B6st��l\n') + + def test_8bit_in_base64_body(self): + # Sticking an 8bit byte in a base64 block makes it undecodable by + # normal means, so the block is returned undecoded, but as bytes. + m = self.bodytest_msg.format(charset='utf-8', + cte='base64', + bodyline='cMO2c3RhbAá=').encode('utf-8') + msg = email.message_from_bytes(m) + self.assertEqual(msg.get_payload(decode=True), + 'cMO2c3RhbAá=\n'.encode('utf-8')) + + def test_8bit_in_uuencode_body(self): + # Sticking an 8bit byte in a uuencode block makes it undecodable by + # normal means, so the block is returned undecoded, but as bytes. + m = self.bodytest_msg.format(charset='utf-8', + cte='uuencode', + bodyline='<,.V