Title: TypeError in e-mail.parser when non-ASCII is present
Components: email Versions: Python 3.4
Messages (9)
msg210390 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2014-02-06 14:44
As reported in, the email.parser no longer accepts Unicode content as it did in 3.3. I searched the What's New and module documentation, but found no indication that this behavior is no longer supported, so it appears to be a regression. If it's an intentional change, the behavior should be documented in one of the aforementioned documents.

Consider this simple test case:

# -*- coding: utf-8 -*-
import email.parser
meta = """
Header: ☃

Run that on Python 3.3.3 or Python 2 and it executes silently. Run it on Python 3.4.0b3 and it produces this traceback:

Traceback (most recent call last):
  File "C:\Users\jaraco\projects\public\wheel\", line 6, in <module>
  File "C:\Program Files\Python34\lib\email\", line 70, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "C:\Program Files\Python34\lib\email\", line 60, in parse
    return feedparser.close()
  File "C:\Program Files\Python34\lib\email\", line 170, in close
  File "C:\Program Files\Python34\lib\email\", line 163, in _call_parse
  File "C:\Program Files\Python34\lib\email\", line 449, in _parsegen
  File "C:\Program Files\Python34\lib\email\", line 311, in set_payload
    " payload") from None
TypeError: charset argument must be specified when non-ASCII characters are used in the payload
msg210396 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-06 14:56
This was an intentional change, but I'm having second thoughts about it.  I think I need to make it a deprecation warning in 3.4.

Note that it doesn't actually do anything useful in 3.3:

Python 3.3.2 (default, Dec  9 2013, 11:44:21) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email.parser
>>> meta = """
... Header: ☃
... """
>>> m = email.parser.Parser().parsestr(meta)
>>> str(m)
'\nHeader: ☃\n'
>>> import email.generator
>>> import io
>>> s = io.BytesIO()
>>> g = email.generator.BytesGenerator(s)
>>> g.flatten(m)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/email/", line 112, in flatten
  File "/usr/lib/python3.3/email/", line 177, in _write
  File "/usr/lib/python3.3/email/", line 203, in _dispatch
  File "/usr/lib/python3.3/email/", line 421, in _handle_text
  File "/usr/lib/python3.3/email/", line 233, in _handle_text
  File "/usr/lib/python3.3/email/", line 158, in _write_lines
  File "/usr/lib/python3.3/email/", line 395, in write
    self._fp.write(s.encode('ascii', 'surrogateescape'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 8: ordinal not in range(128)

That is, if you pretend the message is a string, it will happily output
it as a string, including perhaps your writing the output to a file as
utf-8...but it will *NOT* be a valid email message, since it will have
non-ascii data in it with no specified CTE.
msg210397 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-06 14:59
Ideally what should really happen here, I think, is for email to treat this as analogous to an SMTPUTF8 message.  But I certainly don't have time to do that for the alpha :(

So, yeah, I need to revert that check for 3.4.
msg210398 - (view) Author: Daniel Holth (dholth) * Date: 2014-02-06 15:04
In bdist_wheel I've gone to some lengths to re-use the email module to parse and generate "RFC822 inspired" documents. The output is not a valid e-mail but it is useful.

It is awkward to use the email module this way.

We will sidestep the issue hopefully this year by switching to json.
msg210399 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-06 15:09
The long term goal is to make it not-awkward to do the kind of thing you are doing, Daniel.  The change I made was premature in hindsight, I need to comprehensively address "parsing unicode" sort-of-works now, but only by accident.

That said, it is the policy stuff that will really give you the flexibility to manipulate "rfc822-inspired" data, and that doesn't really help you since you need to remain backward compatible with older pythons.
msg210401 - (view) Author: Daniel Holth (dholth) * Date: 2014-02-06 15:17
We do this.

I appreciate the long-term goal. The policy system is really neat.

We are going to json largely because the next version of the metadata is more nested. The decision had nothing to do with the email module itself.
msg210506 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-02-07 18:06
New changeset f942f1eddfea by R David Murray in branch 'default':
#20531: Revert e20f98a8ed71, the 3.4 version of the #19063 fix.

New changeset ef8aaace85ca by R David Murray in branch 'default':
#20531: Apply the 3.3 version of the #19063 fix.
msg210507 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-07 18:09
OK, backward compatibility is restored.  Hopefully I can fix the underlying problem right in 3.5 as part of issue 8489.
msg210508 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2014-02-07 18:14
Thanks David. I've confirmed the fix works (copying 'email' package over Python 3.4.0b3).
