classification
Title: smtplib: support for IDN (international domain names)
Type: enhancement Stage:
Components: email, Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: 11783 Superseder:
Assigned To: Nosy List: barry, jesstess, macfreek, r.david.murray, zvyn
Priority: normal Keywords:

Created on 2013-12-28 01:49 by macfreek, last changed 2014-06-12 20:23 by zvyn.

Messages (7)
msg207017 - (view) Author: Freek Dijkstra (macfreek) Date: 2013-12-28 01:49
smtplib has limited support for non-ASCII domain names in the From to To mail address. It only works for punycode-encoded domain names, submitted as unicode string (e.g. server.rcpt(u"user@xn--e1afmkfd.ru").

The following two calls fail:

server.rcpt(u"user@пример.ru"):
  File smtplib.py, line 332, in send
    s = s.encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode character '\u03c0' in position 19: ordinal not in range(128)
http://hg.python.org/cpython/file/3.3/Lib/smtplib.py#l332

server.rcpt(b"user@xn--e1afmkfd.ru"):
  File email/_parseaddr.py, line 236, in gotonext
    if self.field[self.pos] in self.LWS + '\n\r':
TypeError: 'in <string>' requires string as left operand, not int
http://hg.python.org/cpython/file/3.3/Lib/email/_parseaddr.py#l236

There are three ways to solve this (from trivial to complex):
* Make it clear in the documentation what type of input is expected.
* Accept punycode-encoded domain names in email addresses, either in string or binary format.
* Accept Unicode-encoded domain names, and do the punycode encoding in the smtplib if required.

See also 

References:
https://tools.ietf.org/html/rfc5891: Internationalized Domain Names in Applications (IDNA): Protocol
msg207019 - (view) Author: Freek Dijkstra (macfreek) Date: 2013-12-28 01:53
This issue deals with international domain names in email addresses (the part behind the "@"). See issue 20084 for the issue that deals with the part before the "@".
msg207041 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-12-28 17:46
Thanks for the suggestion.

Once the issue 11783 patch is committed, smtplib can be changed to use formataddr in quoteaddr, which will result in the domain being punycoded automatically.  (It's too bad I forgot about that issue, since the 3.4 beta deadline has already passed :(

The input to the commands is string, not bytes, so you can already pre-encode yourself, as you noted.  The commands don't accept bytes, and should not, since the data they cause to be sent on the wire may not contain non-ASCII characters; there is thus no need to generate binary.  SMTPUTF8 will of course require generating binary data in these contexts, but in that case the correct way to generate the binary is by utf-8 encoding the unicode input, so there will again be no reason for the commands to accept binary input, and it will be better if they don't.  (If you need to generate invalid data, say for a test scenario, you can drop down to executing 'send' calls manually.)

(Note: using the 'u' prefix in python3, while supported for backward compatibility, is only confusing when used outside of that context...I thought you were talking about 2.7 until I read carefully.)
msg207044 - (view) Author: Freek Dijkstra (macfreek) Date: 2013-12-28 19:16
Great to hear that a patch already exists (sorry I couldn't find in in the tracker).

Feel free to close this issue as duplicate of issue 11783.

(As for the u"string", I wanted to distinguish it from b'string'. I don't use it in code (since the backward compatibility is only present in 3.3+, not in 3.2). Sorry for the confusion.)
msg207045 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-12-28 19:18
No, that issue is about the email library.  So we need this one too for the equivalent enhancement to smtplib.
msg207053 - (view) Author: Freek Dijkstra (macfreek) Date: 2013-12-28 20:44
Since smtplib.quoteaddr() uses email.utils.parseaddr(), and the patch for issue 11783 fixes email.utils.parseaddr(), that patch will hopefully solve this issue as well (though a test case wouldn't hurt for sure).

What I had not realised is that hostnames are also used elsewhere, in particular in the ehlo() and helo() but also in connect(). Do you consider that a separate issue or part of this issue?

Are there other places where you think a fix is needed?

I may be able to create a patch, though bear with me: I never checked out the source for Python or the standard library (other than installing point releases through my package manager).
msg207064 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-12-29 02:12
A call to formataddr will need to be added to quoteaddr.  And yes, test cases are needed.

I don't believe that the format of the HELO/EHLO message is defined by the RFC, so I don't think we can automatically parse it.  I think we just have to leave the domain name encoded as punycode there.  Regardless, though, yes I would consider that a separate issue.

If you want to work on a patch, that would be great.  For guidance on doing so, you can take a look at http://docs.python.org/devguide.

You can also help me to remember to commit 11783 after the final release of 3.4.0.
History
Date User Action Args
2014-06-12 20:23:23zvynsetnosy: + jesstess, zvyn
2013-12-29 02:12:48r.david.murraysetmessages: + msg207064
2013-12-28 20:44:17macfreeksetmessages: + msg207053
2013-12-28 19:18:45r.david.murraysetresolution: duplicate ->
messages: + msg207045
2013-12-28 19:16:39macfreeksetresolution: duplicate
messages: + msg207044
2013-12-28 17:46:01r.david.murraysetversions: + Python 3.5
nosy: + barry, r.david.murray

messages: + msg207041

dependencies: + email parseaddr and formataddr should be IDNA aware
components: + email
2013-12-28 01:53:10macfreeksetmessages: + msg207019
2013-12-28 01:49:23macfreekcreate