classification
Title: TypeError in e-mail.parser when non-ASCII is present
Type: behavior Stage: resolved
Components: email Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: barry, dholth, jason.coombs, larry, python-dev, r.david.murray
Priority: release blocker Keywords: 3.4regression

Created on 2014-02-06 14:44 by jason.coombs, last changed 2014-02-07 18:30 by r.david.murray. This issue is now closed.

Messages (9)
msg210390 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2014-02-06 14:44
As reported in https://bitbucket.org/dholth/wheel/issue/104, the email.parser no longer accepts Unicode content as it did in 3.3. I searched the What's New and module documentation, but found no indication that this behavior is no longer supported, so it appears to be a regression. If it's an intentional change, the behavior should be documented in one of the aforementioned documents.

Consider this simple test case:

# -*- coding: utf-8 -*-
import email.parser
meta = """
Header: ☃
"""
email.parser.Parser().parsestr(meta)

Run that on Python 3.3.3 or Python 2 and it executes silently. Run it on Python 3.4.0b3 and it produces this traceback:

Traceback (most recent call last):
  File "C:\Users\jaraco\projects\public\wheel\test.py", line 6, in <module>
    email.parser.Parser().parsestr(meta)
  File "C:\Program Files\Python34\lib\email\parser.py", line 70, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "C:\Program Files\Python34\lib\email\parser.py", line 60, in parse
    return feedparser.close()
  File "C:\Program Files\Python34\lib\email\feedparser.py", line 170, in close
    self._call_parse()
  File "C:\Program Files\Python34\lib\email\feedparser.py", line 163, in _call_parse
    self._parse()
  File "C:\Program Files\Python34\lib\email\feedparser.py", line 449, in _parsegen
    self._cur.set_payload(EMPTYSTRING.join(lines))
  File "C:\Program Files\Python34\lib\email\message.py", line 311, in set_payload
    " payload") from None
TypeError: charset argument must be specified when non-ASCII characters are used in the payload
msg210396 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-06 14:56
This was an intentional change, but I'm having second thoughts about it.  I think I need to make it a deprecation warning in 3.4.

Note that it doesn't actually do anything useful in 3.3:

Python 3.3.2 (default, Dec  9 2013, 11:44:21) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email.parser
>>> meta = """
... Header: ☃
... """
>>> m = email.parser.Parser().parsestr(meta)
>>> str(m)
'\nHeader: ☃\n'
>>> import email.generator
>>> import io
>>> s = io.BytesIO()
>>> g = email.generator.BytesGenerator(s)
>>> g.flatten(m)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/email/generator.py", line 112, in flatten
    self._write(msg)
  File "/usr/lib/python3.3/email/generator.py", line 177, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.3/email/generator.py", line 203, in _dispatch
    meth(msg)
  File "/usr/lib/python3.3/email/generator.py", line 421, in _handle_text
    super(BytesGenerator,self)._handle_text(msg)
  File "/usr/lib/python3.3/email/generator.py", line 233, in _handle_text
    self._write_lines(payload)
  File "/usr/lib/python3.3/email/generator.py", line 158, in _write_lines
    self.write(laststripped)
  File "/usr/lib/python3.3/email/generator.py", line 395, in write
    self._fp.write(s.encode('ascii', 'surrogateescape'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 8: ordinal not in range(128)

That is, if you pretend the message is a string, it will happily output
it as a string, including perhaps your writing the output to a file as
utf-8...but it will *NOT* be a valid email message, since it will have
non-ascii data in it with no specified CTE.
msg210397 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-06 14:59
Ideally what should really happen here, I think, is for email to treat this as analogous to an SMTPUTF8 message.  But I certainly don't have time to do that for the alpha :(

So, yeah, I need to revert that check for 3.4.
msg210398 - (view) Author: Daniel Holth (dholth) (Python committer) Date: 2014-02-06 15:04
In bdist_wheel I've gone to some lengths to re-use the email module to parse and generate "RFC822 inspired" documents. The output is not a valid e-mail but it is useful.

It is awkward to use the email module this way.

We will sidestep the issue hopefully this year by switching to json.
msg210399 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-06 15:09
The long term goal is to make it not-awkward to do the kind of thing you are doing, Daniel.  The change I made was premature in hindsight, I need to comprehensively address "parsing unicode" instead...it sort-of-works now, but only by accident.

That said, it is the policy stuff that will really give you the flexibility to manipulate "rfc822-inspired" data, and that doesn't really help you since you need to remain backward compatible with older pythons.
msg210401 - (view) Author: Daniel Holth (dholth) (Python committer) Date: 2014-02-06 15:17
We do this. https://bitbucket.org/dholth/wheel/src/tip/wheel/pkginfo.py?at=default

I appreciate the long-term goal. The policy system is really neat.

We are going to json largely because the next version of the metadata is more nested. The decision had nothing to do with the email module itself.
msg210506 - (view) Author: Roundup Robot (python-dev) Date: 2014-02-07 18:06
New changeset f942f1eddfea by R David Murray in branch 'default':
#20531: Revert e20f98a8ed71, the 3.4 version of the #19063 fix.
http://hg.python.org/cpython/rev/f942f1eddfea

New changeset ef8aaace85ca by R David Murray in branch 'default':
#20531: Apply the 3.3 version of the #19063 fix.
http://hg.python.org/cpython/rev/ef8aaace85ca
msg210507 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-07 18:09
OK, backward compatibility is restored.  Hopefully I can fix the underlying problem right in 3.5 as part of issue 8489.
msg210508 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2014-02-07 18:14
Thanks David. I've confirmed the fix works (copying 'email' package over Python 3.4.0b3).
History
Date User Action Args
2014-02-07 18:37:51r.david.murraylinkissue20089 superseder
2014-02-07 18:30:31r.david.murraysettype: behavior
resolution: fixed
stage: resolved
2014-02-07 18:14:15jason.coombssettype: behavior -> (no value)
resolution: fixed -> (no value)
messages: + msg210508
stage: resolved -> (no value)
2014-02-07 18:09:33r.david.murraysetstatus: open -> closed
type: behavior
messages: + msg210507

resolution: fixed
stage: resolved
2014-02-07 18:06:35python-devsetnosy: + python-dev
messages: + msg210506
2014-02-06 15:17:48dholthsetmessages: + msg210401
2014-02-06 15:09:45r.david.murraysetmessages: + msg210399
2014-02-06 15:04:02dholthsetnosy: + dholth
messages: + msg210398
2014-02-06 14:59:37r.david.murraysetpriority: normal -> release blocker
nosy: + larry
messages: + msg210397

2014-02-06 14:56:53r.david.murraysetmessages: + msg210396
2014-02-06 14:44:54jason.coombscreate