classification
Title: email parseaddr and formataddr should be IDNA aware
Type: enhancement Stage: needs patch
Components: email Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, christian.heimes, jesstess, macfreek, r.david.murray, sdaoden, torsten.becker, zvyn
Priority: normal Keywords: patch

Created on 2011-04-06 14:12 by r.david.murray, last changed 2017-06-02 16:29 by r.david.murray.

Files
File name Uploaded Description Edit
issue-11783-v1.patch torsten.becker, 2011-04-09 20:58 IDNA awareness for formataddr() and parseaddr() + tests + docs review
issue-11783-v2.patch torsten.becker, 2011-04-11 13:02 Fix for local-part only addresses + test case review
email_address_idna.patch r.david.murray, 2011-04-13 19:14 review
issue-11783-v4.patch torsten.becker, 2011-04-17 19:06 getaddresses() also decodes IDNs review
issue11783.patch zvyn, 2014-06-11 08:54 Update issue-11783-v4.patch to be applieable on 3.5. review
issue11783-rdm-fixed.patch zvyn, 2014-06-11 11:20 review
Pull Requests
URL Status Linked Edit
PR 1900 closed zvyn, 2017-06-01 04:18
Messages (32)
msg133133 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-06 14:12
The patch for issue 1690608 adds support for unicode in the realname field to formataddr.  To complete the currently-workable internationalization of address specs, both parseaddr and formataddr should be made IDNA aware.  It is probably a good idea to do some research on this to double check, but it seems as though recognizing IDNA during parsing and generating it during formatting when the domain contains non-ASCII characters should be both safe and useful.

See also issue 963906, which contains some test cases that could be adapted.
msg133335 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-08 19:32
If Torsten does not have time for this i'll do that next.
I hate this useless Punycode (but for nameprep),
so it's exactly the right thing for me to do next in 2011.
Unless someone complains.
I've not looked at formataddr/parseaddr, so that's a motivation.
Torsten?
msg133338 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-08 20:22
I believe Torsten is interested, but he can of course speak for himself.

Just to make sure you are aware: Python has a built in idna codec, so this should be a fairly simple patch.
msg133343 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-08 21:03
I was about to look into this over the weekend, but of course I don't
want to steal your fun, Steffen. :)
msg133381 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-09 11:37
On Fri, Apr 08, 2011 at 09:03:51PM +0000, Torsten Becker wrote:
> I was about to look into this over the weekend, but of course I don't
> want to steal your fun, Steffen. :)

Toll, toll, toll!!
Still cherry blossom, thanks to the weather, apple blossom 
started, more than 20 degrees.
I really understand if a young german man rather prefers staying 
at home over the weekend, looking at its beautiful, intelligent 
chancellor whenever the TV is started, than doing whatever else!

Dear god, all these nice memories ...
14" BW CRT monitors, 30 hours at a time ...
Have a nice weekend!
msg133421 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-09 20:58
> Have a nice weekend!

Thank you for the wishes, I hope yours is going well, too!

I added IDNA awareness to formataddr() and parseaddr(), updated the docs and wrote 2 tests for it.

I wasn't sure if the IDNA awareness should be optional via a argument or always automatically enabled, I favored the latter.

Also, is it safe to split at "@" and encode/decode the last component?  I am not familiar with all the weird variants a email address could be in strictly after the RFCs.
msg133512 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-11 11:11
Patch mostly looks good to me, modulo some English wording that I'll fix up when I commit it.

The issue with the '@' is that it might not be there.  So you do need to check for that case (it means the domain part defaults to the 'local' domain).  I should double check the RFC to make sure that's the only special case, but I believe it is.
msg133519 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-11 13:02
> modulo some English wording that I'll fix up when I commit it.

Yeah, sorry for that, I seem to have trouble with writing good documentation. :)  I'll have a look at the documents referenced by [1] to improve my writing.

> The issue with the '@' is that it might not be there.

I added a fix and a test for this in v2.  However, when reading through the RFC [2] and Wikipedia [3], it seems like this is not actually allowed.

Is there a way to internationalize the local-part as well?  That is the only part which is missing now that domain and real name are covered.


[1]: http://docs.python.org/devguide/docquality.html
[2]: http://tools.ietf.org/html/rfc5322#section-3.4
[3]: http://en.wikipedia.org/wiki/Email_address#Invalid_email_addresses
msg133522 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-11 14:21
Hmm.  You are correct.  I thought the RFC's covered this case, but apparently they don't.

The email package gets used in MUA contexts, where the domain part of the address may be omitted and the MUA must fill it in before transmitting the message to the MTA.  And some MTAs will fill in the local domain if it is omitted, so actually it applies in an MTA context, too.  So we need to support this case even if it isn't covered by the RFCs.

Thanks, the revised patch looks good.
msg133686 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-13 19:14
OK, so when I went to apply this, I figured out that the patch isn't quite right.  I've redone the doc updates, and am attaching a version of the patch containing them.

The issue is that the place that the IDNA decode support needs to be added isn't in parseaddr, it's in _parseaddr.py's AddresslistClass.  Tests are then needed to make sure that the IDNA decoding gets done both when parseaddr and getaddresslist are used.

Do you want to tackle this, Torsten?
msg133696 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-13 21:42
> OK, so when I went to apply this, I figured out that the patch isn't quite right.  I've redone the doc updates, and am attaching a version of the patch containing them.
>
> The issue is that the place that the IDNA decode support needs to be added isn't in parseaddr, it's in _parseaddr.py's AddresslistClass.  Tests are then needed to make sure that the IDNA decoding gets done both when parseaddr and getaddresslist are used.
>
> Do you want to tackle this, Torsten?

I would like to, but I probably will not get to it before Monday.  So
if anybody wants to work on this before that time, please feel free to
fix it properly. :)

Just two questions for the implementation:
  1. Would it be fine to move the helper _encode_decode_addr() into
_parseaddr.py and then import it in util.py, so it can be shared
between the two?
  2. Would line 232 in _parseaddr.py (AddrlistClass.getaddrlist) be a
good place to integrate it?
msg133699 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-13 22:02
Yes, putting the function in _parseaddr is fine.  And yes, 232 looks like a good place.  The alternative would be understanding the rfc822 parser, which is pretty mind bending, and of doubtful additional benefit.  (At most it would save a pair of split/join calls.)
msg133754 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-14 17:33
On Wed, Apr 13, 2011 at 09:42:22PM +0000, Torsten Becker wrote:
> So if anybody wants to work on this before that time,
> please feel free to fix it properly. :)

(The word anybody made me think.
But "fix properly" ... i'm sure you cannot refer to myself.
:))
msg133757 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-14 18:05
> (The word anybody made me think.
> But "fix properly" ... i'm sure you cannot refer to myself.
> :))

"fix properly" referred to my inferior implementation and "anybody"
should probably have been worded "Steffen or David".  So sure .. go
ahead. :)
msg133767 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-14 19:52
On Thu, Apr 14, 2011 at 06:05:59PM +0000, Torsten Becker wrote:
> So sure .. go ahead. :)

Grrrrr....
Wuah, Wuuuah, Wuuuuuuaaaah!
No.  Now i'm exhausted.
msg133768 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-14 19:53
On Thu, Apr 14, 2011 at 06:05:59PM +0000, Torsten Becker wrote:
> :)

Mumble...
:)
msg133935 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-17 19:06
Hi, here is my revised patch with email.utils.getaddresses() also decoding IDNs.

I decided to integrate IDN decoding in AddrlistClass.getaddress() instead of AddrlistClass.getaddrlist() since that function is one level lower and if somebody should ever all it directly, the conversion would not happen.

I also fixed a glitch in the docs, "versionchanged" seems to need two colons to end up in the generated HTML.


As a follow up, wouldn't it be helpful if email.Message would do the conversions directly?  So when you parse a mail into a Message and access the "To" field, you get a list of tuples which are decoded properly?

For example the following test currently still fails because the quoted header value is not decoded by email.feedparser.FeedParser nor email.Message:

    def test_email_decodes_idns_and_unicode(self):
        text = '''\
To: =?utf-8?b?SMOkbnMgV8O8cnN0?= <hans@xn--dm-fka.ain>

Hello World!'''
        msg = Parser().parsestr(text)
        self.assertEqual(utils.getaddresses(msg.get_all('To')),
            [('H\xe4ns W\xfcrst', 'hans@d\xf6m.ain')])

Am I using the package wrong here or is this actually missing?  email.header.decode_header seems to be able to do this already but it is not used.  Would it be safe to integrate this into the email.message._sanitize_header helper?
msg133936 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-17 19:37
Thanks.  I should be able to look at this tomorrow.

You are correct about the fact that Message currently doesn't do any decoding.  That is part of the design: you get the string out of Message and use the helper decoding functions (decode_header, getaddresses, etc) to get actual usable data out of it.

Part of the email6 design addresses this with a new API that will make it much easier to get at the parsed data (methods on a header object returned from msg['xxx']).  But for the current package the way it works it the way it is supposed to work ;)
msg161807 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-05-28 20:51
I'm finally getting back around to this.  Torsten, could you submit a contributor agreement, please?  (http://www.python.org/psf/contrib/)

And to answer the question you had about the 'still failing' test, parseaddr isn't currently doing the encoded-word decode.  Either it should do that too, or it shouldn't do the IDNA decode.  I think it should do the decode, because parseaddr and formataddr are supposed to be inverses.  And decoding when decoding didn't used to be done seems like a backward compatible change: a program doing its own decoding just won't find anything that needs decoding.
msg170381 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-09-12 14:14
3.3 is in feature freeze mode. This new feature has to go into 3.4.
msg213328 - (view) Author: Freek Dijkstra (macfreek) Date: 2014-03-12 23:24
@r.david.murray
Quote from issue20083:

"You can also help me to remember to commit 11783 after the final release of 3.4.0."

Ping!
msg213332 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-12 23:37
Final isn't out yet :)
msg220240 - (view) Author: Milan Oberkirch (zvyn) * Date: 2014-06-11 06:55
Ping :)
msg220253 - (view) Author: Milan Oberkirch (zvyn) * Date: 2014-06-11 11:20
Here comes an updated patch based on 'email_address_idna.patch' without breaking smtplib (as the previous patches did).
msg294937 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-06-01 13:24
zvyn, thanks for your patch.

However I'm sorry to say that Python stdlib's IDNA support is fundamentally broken by design. Therefore I'm against any IDNA related patches until we have addresses multiple issues with internationalized domain names. Our naive support of IDNA in socket module and ssl module is a security issue waiting to be happening.

* Python blindly assume that 'idna' is the only transformation of IDN U-labels into IDN A-labels. That's just plain wrong. Python's idna is really IDNA-2003.
* Besides IDNA 2003 there is also IDNA 2008. Of course the encodings are not compatible to each other.
* The old encoding IDNA-2003 and *MUST NOT* be used for some TLDs like .de because has an incorrect mapping for several characters like 'ß'.
* IDNA-2008 does not support upper case letters. Most applications want to use UTR46 mapping for IDNA-2008.
* On the application side, mapping of IDN U-labels must go through an additional validation layer to counteract homoglyphic confusion attacks. (e.g. cyrillic 'r' looks like latin 'p').

Before we add more security issues to libraries, we should come up with a plan to address this mess. First step: add IDNA-2008 and UTR46 support to stdlib.

I'm deeply sorry for dragging you into this mess. :/

PS: I have removed the 'easy' keyword.
msg294938 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-06-01 13:31
http://www.unicode.org/reports/tr46/#IDNA2008-Section

    Additions. Some IDNs are invalid in IDNA2003, but valid in IDNA2008.
    Subtractions. Some IDNs are valid in IDNA2003, but invalid in IDNA2008.
    Deviations. Some IDNs are valid in both, but resolve to different destinations.
msg294987 - (view) Author: Milan Oberkirch (zvyn) * Date: 2017-06-02 07:24
I see your point, but I'm not fully convinced it relates to this PR directly: the code here just uses the standard interface to use an 'idna' codec. If that codec is buggy that is a different issue.

If it's so buggy that using it is absolutely pointless, it should not exist in the first place IMHO. But if we need to keep a useless codec for compabilety reasons and you prefer adding a now codec (e.g. 'idna-2008') I agree that we should put this PR on hold until a codec it could use exists.
msg295023 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-06-02 15:12
We can't replace IDNA-2003 with IDNA-2008 either. Some applications / protocols / TLDs still use IDNA-2003. I see two options, (1) make encoding configurable somehow, (2) drop implicit IDNA encoding/decoding and force applications to perform explicit IDNA.
msg295027 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-06-02 15:41
The email package currently forces explicit IDNA use currently.  The new policies are supposed to support it automatically but I they currently don't.  I'm not sure we should add it to the old interface (parseaddr/formataddr) any longer, but I don't object to doing it, either.

Not handling idna automatically would go against the entire design of the new email policies, which is to produce unicode from the wire encoding for programs to work with, and convert back to wire protocol on output.

The work on resolving the idna2008 issue belongs in issue #17305, where MvL (who wrote the original idna codec) points to IMO the correct solution (a uts46 codec) in msg217218.
msg295028 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-06-02 15:55
Adding automatic IDNA decoding is like opening Pandora's box.

uts46 is not necessarily the correct solution. The correct handling of internationalized domain names depends on multiple factors: TLD, remote application and more. You actually have to know if the remote peer uses 2003 or 2008. For any user facing application you cannot simply expose all of IDNA to the user (homoglypic confusion attacks).
msg295031 - (view) Author: Paul Kehrer (reaperhulk) Date: 2017-06-02 16:08
As someone who built an idna aware API for pyca/cryptography and deeply regrets it I'd like to weigh in on the side of saying that IDNA is a presentation issue and that supporting it in lower level APIs is the cause of many bugs, some of which can potentially be security issues. Users wanting to make requests to IDNA domains should be responsible for the encoding themselves so that impedance mismatches in encoding version are both discoverable and correctable.
msg295035 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-06-02 16:29
In other words, this was a major standards screwup and we get to deal with the consequences :(

All right, since I'm hardly likely to have time to deal with it anyway, we'll just say that email isn't going to handle unicode domain names until *someone* figures out how to do this right.  And it sounds like that may be never.  Because saying that "users that want to make requests to IDNA domains should be responsible for the encoding themselves" is, really, a *complete* non-starter from any perspective I can think of, and has its own security issues.  If UTF46 does not do the job, we are just out of luck and the users will pay the price.
History
Date User Action Args
2017-06-02 16:29:14r.david.murraysetmessages: + msg295035
2017-06-02 16:09:01reaperhulksetnosy: - reaperhulk
2017-06-02 16:08:27reaperhulksetnosy: + reaperhulk
messages: + msg295031
2017-06-02 15:55:39christian.heimessetmessages: + msg295028
2017-06-02 15:41:34r.david.murraysetmessages: + msg295027
versions: + Python 3.7, - Python 3.5
2017-06-02 15:12:03christian.heimessetmessages: + msg295023
2017-06-02 07:24:41zvynsetmessages: + msg294987
2017-06-01 13:31:25christian.heimessetmessages: + msg294938
2017-06-01 13:24:08christian.heimessetkeywords: - easy

messages: + msg294937
2017-06-01 04:18:11zvynsetpull_requests: + pull_request1979
2014-06-11 11:20:02zvynsetfiles: + issue11783-rdm-fixed.patch

messages: + msg220253
2014-06-11 08:54:45zvynsetfiles: + issue11783.patch
nosy: + jesstess
2014-06-11 06:55:13zvynsetnosy: + zvyn
messages: + msg220240
2014-03-12 23:37:01r.david.murraysetmessages: + msg213332
2014-03-12 23:24:34macfreeksetnosy: + macfreek

messages: + msg213328
versions: + Python 3.5, - Python 3.4
2013-12-28 17:46:01r.david.murraylinkissue20083 dependencies
2012-09-12 14:14:02christian.heimessetnosy: + christian.heimes

messages: + msg170381
versions: + Python 3.4, - Python 3.3
2012-05-28 20:51:35r.david.murraysetassignee: r.david.murray ->
2012-05-28 20:51:26r.david.murraysetmessages: + msg161807
2012-05-28 20:24:53r.david.murraysetnosy: + barry
components: + email, - Library (Lib)
2011-04-17 19:37:50r.david.murraysetmessages: + msg133936
2011-04-17 19:06:30torsten.beckersetfiles: + issue-11783-v4.patch

messages: + msg133935
2011-04-14 19:53:44sdaodensetmessages: + msg133768
2011-04-14 19:52:32sdaodensetmessages: + msg133767
2011-04-14 18:05:58torsten.beckersetmessages: + msg133757
2011-04-14 17:33:50sdaodensetmessages: + msg133754
2011-04-13 22:02:22r.david.murraysetmessages: + msg133699
2011-04-13 21:42:21torsten.beckersetmessages: + msg133696
2011-04-13 19:14:52r.david.murraysetfiles: + email_address_idna.patch

messages: + msg133686
2011-04-11 14:21:23r.david.murraysetmessages: + msg133522
2011-04-11 13:02:21torsten.beckersetfiles: + issue-11783-v2.patch

messages: + msg133519
2011-04-11 11:11:06r.david.murraysetmessages: + msg133512
2011-04-09 20:58:39torsten.beckersetfiles: + issue-11783-v1.patch
keywords: + patch
messages: + msg133421
2011-04-09 11:37:20sdaodensetmessages: + msg133381
2011-04-08 21:03:51torsten.beckersetmessages: + msg133343
2011-04-08 20:22:49r.david.murraysetmessages: + msg133338
2011-04-08 19:32:22sdaodensetnosy: + sdaoden
messages: + msg133335
2011-04-06 14:12:00r.david.murraycreate