classification
Title: email.utils.formataddr is not exactly the reverse of email.utils.parseaddr
Type: behavior Stage: resolved
Components: email Versions: Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: barry, r.david.murray, remi.lapeyre, skreft
Priority: normal Keywords:

Created on 2018-11-12 22:09 by skreft, last changed 2018-11-13 17:48 by r.david.murray. This issue is now closed.

Messages (5)
msg329765 - (view) Author: (skreft) Date: 2018-11-12 22:09
The docs (https://docs.python.org/3/library/email.util.html#email.utils.formataddr) say that formataddr is the inverse of parseaddr, however non-ascii email addresses are treated differently in both methods.

parseaddr will return non-ascci addresses, whereas formataddr will raise a UnicodeError.

Below is an example:

In [1]: import email.utils as u

In [2]: u.parseaddr('skreft+ñandú@sudoai.com')
Out[2]: ('', 'skreft+ñandú@sudoai.com')

In [3]: u.formataddr(u.parseaddr('skreft+ñandú@sudoai.com'))
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-3-1323122e1773> in <module>()
----> 1 u.formataddr(u.parseaddr('skreft+ñandú@sudoai.com'))

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/email/utils.py in formataddr(pair, charset)
     89     name, address = pair
     90     # The address MUST (per RFC) be ascii, so raise a UnicodeError if it isn't.
---> 91     address.encode('ascii')
     92     if name:
     93         try:

UnicodeEncodeError: 'ascii' codec can't encode character '\xf1' in position 7: ordinal not in range(128)
msg329772 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2018-11-12 23:25
This is indeed an issue with formataddr, it expects the input to be ascii encoded as RFC 2822 requires.

Email is much more complicated though and has been internationalized, a summary of this work is available at https://en.wikipedia.org/wiki/Email_address#Internationalization.

I think the check in formataddr is not desirable anymore and should be remove.

I'm not sure wether the resulting value should be encoded using email.header or not.
msg329775 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-11-12 23:50
Thanks for the report, but parseaddr and formataddr are defined *only* for ASCII.  In the port to python3, parseaddr sort-of-maybe-sometimes does the naively expected thing with non-ascii, but that's just an accident.  We could have added a check for non-ascii to parseaddr during the python3 port, but we didn't think of it, and it is too late now since adding it would break otherwise working code even though that code is technically broken.

So, for the defined API of parseaddr/formataddr, there is no bug here.

As for handling non-ascii in email per your link:

    >>> from email.message import EmailMessage
    >>> from email.policy import default
    >>> m = EmailMessage(policy=default.clone(utf8=True))
    >>> m['From'] = 'skreft+ñandú@sudoai.com
    >>> bytes(m)
    b'From: skreft+\xc3\xb1and\xc3\xba@sudoai.com\n\n'

(NB: in testing the above I discovered there is actually a recent bug in the serialization when utf8 is *False*: it does RFC2047 encoding of the username, which it should not do...instead it should raise an error.  Feel free to open a bug report for that...)
msg329782 - (view) Author: (skreft) Date: 2018-11-13 01:05
@r.david.murray where do you see that those functions are only defined for ascii? There's nothing in the online docs stating that and furthermore `formataddr` has supported non-ascii names since version 3.3. RFC 2822 is however mentioned in the docstrings.

The fact that `formataddr` is not really the inverse warrants at least a note or clarification in the docs.
msg329858 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-11-13 17:48
Because the RFCs are defined only for ascii.  Non-ascii in RFC 2822 addresses is an RFC violation.  In python2 non-ascii would usually round-trip through these functions, but again that was an accident.

If you'd like to propose a doc clarification that would be fine, but the clarification would be that behavior on strings containing non-ascii is undefined.

Note that these functions are considered soft-deprecated...they are in modules that are in the "Legacy API" section of the email docs.
History
Date User Action Args
2018-11-13 17:48:46r.david.murraysetmessages: + msg329858
2018-11-13 01:05:51skreftsetmessages: + msg329782
2018-11-12 23:50:38r.david.murraysetstatus: open -> closed
type: behavior
messages: + msg329775

resolution: not a bug
stage: resolved
2018-11-12 23:25:18remi.lapeyresetnosy: + remi.lapeyre
messages: + msg329772
2018-11-12 22:09:52skreftcreate