classification
Title: email.Header.Header incorrect/non-smart on international charset address fields
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 3.3, Python 3.4, Python 2.7, Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: kxroberto, r.david.murray
Priority: normal Keywords:

Created on 2012-01-01 17:25 by kxroberto, last changed 2012-01-02 17:43 by r.david.murray. This issue is now closed.

Messages (3)
msg150434 - (view) Author: kxroberto (kxroberto) Date: 2012-01-01 17:24
the email.* package seems to over-encode international charset address fields - resulting even in display errors in the receivers reader - , 
when message header composition is done as recommended in http://docs.python.org/library/email.header.html 

Python 2.7.2
>>> e=email.Parser.Parser().parsestr(getcliptext())
>>> e['From']
'=?utf-8?q?Martin_v=2E_L=C3=B6wis?= <report@bugs.python.org>'
# note the par
>>> email.Header.decode_header(_)
[('Martin v. L\xc3\xb6wis', 'utf-8'), ('<report@bugs.python.org>', None)]
# unfortunately there is no comfortable function for this:
>>> u='Martin v. L\xc3\xb6wis'.decode('utf8') + ' <report@bugs.python.org>'
>>> u
u'Martin v. L\xf6wis <report@bugs.python.org>'
>>> msg=email.Message.Message()
>>> msg['From']=u
>>> msg.as_string()
'From: =?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=\n\n'
>>> msg['From']=str(u)
>>> msg.as_string()
'From: =?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=\nFrom: Martin v. L\xf6wis <report@bugs.python.org>\n\n'
>>> msg['From']=email.Header.Header(u)
>>> msg.as_string()
'From: =?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=\nFrom: Martin v. L\xf6wis <report@bugs.python.org>\nFrom: =?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=\n\n'
>>> 

(BTW: strange is that multiple msg['From']=... _assignments_ end up as multiple additions !???   also msg renders 8bit header lines without warning/error or auto-encoding, while it does auto on unicode!??)

Whats finally arriving at the receiver is typically like:

From: "=?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=" <report@bugs.python.org>

because the servers seem to want the address open, they extract the address and _add_ it (duplicating) as ASCII. => error

I have not found any emails in my archives where address header fields are so over-encoded like python does. Even in non-address fields mostly only those words/groups are encoded which need it.

I assume the sophisticated/high-level looking email.* package doesn't expect that the user fiddles things together low-level? with parseaddr, re.search, make_header Header.encode , '.join ... Or is it indeed (undocumented) so? IMHO it should be auto-smart enough.

Note: there is a old deprecated function mimify.mime_encode_header which seemed to try to cautiously auto-encode correct/sparsely (but actually fails too on all examples tried).
msg150440 - (view) Author: kxroberto (kxroberto) Date: 2012-01-01 18:57
now I tried to render this address field header 

u'Name <abc\u03a3@xy>, abc@ewf, "Nameß" <weofij@fjeio>'

with 
h = email.Header.Header(continuation_ws='')
h.append ... / email.Header.make_header via these chunks:

[('Name <', us-ascii), ('abc\xce\xa3', utf-8), ('@xy>, abc@ewf, "', us-ascii), ('Name\xc3\x9f', utf-8), ('" <weofij@fjeio>', us-ascii)]

the outcome is:

'Name < =?utf-8?b?YWJjzqM=?= @xy>, abc@ewf, " =?utf-8?b?TmFtZcOf?=\n " <weofij@fjeio>'


(note: local part of email address can be utf too)

It seems to be impossible to avoid the erronous extra spaces from outside within that email.Header framework.
Thus I guess it was not possible up to now to decently format a beyond-ascii MIME message using the official email.Header mechanism? - even when pre-digesting things
msg150468 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-01-02 17:43
Actually, no, the local part cannot be in anything other than ascii (see RFC 5335, which desires to address this problem among others).  Also, an encoded word cannot occur inside quotation marks.  If you correct those two bugs, you can generate an RFC-valid address using Header.append.

There is a project underway to make all of this header parsing and formatting stuff work better: see the http://pypi.python.org/pypi/email.

By the way, this is easier already in python 3.2.  There you can do:

   >>> formataddr(('Nameß', 'weofij@fjeio'))
   '=?utf-8?b?TmFtZcOf?= <weofij@fjeio>'
History
Date User Action Args
2012-01-02 17:43:26r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg150468

resolution: not a bug
2012-01-01 18:57:07kxrobertosetmessages: + msg150440
2012-01-01 17:25:00kxrobertocreate