classification
Title: email parseaddr and formataddr should be IDNA aware
Type: enhancement Stage: needs patch
Components: email Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, christian.heimes, r.david.murray, sdaoden, torsten.becker
Priority: normal Keywords: easy, patch

Created on 2011-04-06 14:12 by r.david.murray, last changed 2012-09-12 14:14 by christian.heimes.

Files
File name Uploaded Description Edit
issue-11783-v1.patch torsten.becker, 2011-04-09 20:58 IDNA awareness for formataddr() and parseaddr() + tests + docs review
issue-11783-v2.patch torsten.becker, 2011-04-11 13:02 Fix for local-part only addresses + test case review
email_address_idna.patch r.david.murray, 2011-04-13 19:14 review
issue-11783-v4.patch torsten.becker, 2011-04-17 19:06 getaddresses() also decodes IDNs review
Messages (20)
msg133133 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-06 14:12
The patch for issue 1690608 adds support for unicode in the realname field to formataddr.  To complete the currently-workable internationalization of address specs, both parseaddr and formataddr should be made IDNA aware.  It is probably a good idea to do some research on this to double check, but it seems as though recognizing IDNA during parsing and generating it during formatting when the domain contains non-ASCII characters should be both safe and useful.

See also issue 963906, which contains some test cases that could be adapted.
msg133335 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-08 19:32
If Torsten does not have time for this i'll do that next.
I hate this useless Punycode (but for nameprep),
so it's exactly the right thing for me to do next in 2011.
Unless someone complains.
I've not looked at formataddr/parseaddr, so that's a motivation.
Torsten?
msg133338 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-08 20:22
I believe Torsten is interested, but he can of course speak for himself.

Just to make sure you are aware: Python has a built in idna codec, so this should be a fairly simple patch.
msg133343 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-08 21:03
I was about to look into this over the weekend, but of course I don't
want to steal your fun, Steffen. :)
msg133381 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-09 11:37
On Fri, Apr 08, 2011 at 09:03:51PM +0000, Torsten Becker wrote:
> I was about to look into this over the weekend, but of course I don't
> want to steal your fun, Steffen. :)

Toll, toll, toll!!
Still cherry blossom, thanks to the weather, apple blossom 
started, more than 20 degrees.
I really understand if a young german man rather prefers staying 
at home over the weekend, looking at its beautiful, intelligent 
chancellor whenever the TV is started, than doing whatever else!

Dear god, all these nice memories ...
14" BW CRT monitors, 30 hours at a time ...
Have a nice weekend!
msg133421 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-09 20:58
> Have a nice weekend!

Thank you for the wishes, I hope yours is going well, too!

I added IDNA awareness to formataddr() and parseaddr(), updated the docs and wrote 2 tests for it.

I wasn't sure if the IDNA awareness should be optional via a argument or always automatically enabled, I favored the latter.

Also, is it safe to split at "@" and encode/decode the last component?  I am not familiar with all the weird variants a email address could be in strictly after the RFCs.
msg133512 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-11 11:11
Patch mostly looks good to me, modulo some English wording that I'll fix up when I commit it.

The issue with the '@' is that it might not be there.  So you do need to check for that case (it means the domain part defaults to the 'local' domain).  I should double check the RFC to make sure that's the only special case, but I believe it is.
msg133519 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-11 13:02
> modulo some English wording that I'll fix up when I commit it.

Yeah, sorry for that, I seem to have trouble with writing good documentation. :)  I'll have a look at the documents referenced by [1] to improve my writing.

> The issue with the '@' is that it might not be there.

I added a fix and a test for this in v2.  However, when reading through the RFC [2] and Wikipedia [3], it seems like this is not actually allowed.

Is there a way to internationalize the local-part as well?  That is the only part which is missing now that domain and real name are covered.


[1]: http://docs.python.org/devguide/docquality.html
[2]: http://tools.ietf.org/html/rfc5322#section-3.4
[3]: http://en.wikipedia.org/wiki/Email_address#Invalid_email_addresses
msg133522 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-11 14:21
Hmm.  You are correct.  I thought the RFC's covered this case, but apparently they don't.

The email package gets used in MUA contexts, where the domain part of the address may be omitted and the MUA must fill it in before transmitting the message to the MTA.  And some MTAs will fill in the local domain if it is omitted, so actually it applies in an MTA context, too.  So we need to support this case even if it isn't covered by the RFCs.

Thanks, the revised patch looks good.
msg133686 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-13 19:14
OK, so when I went to apply this, I figured out that the patch isn't quite right.  I've redone the doc updates, and am attaching a version of the patch containing them.

The issue is that the place that the IDNA decode support needs to be added isn't in parseaddr, it's in _parseaddr.py's AddresslistClass.  Tests are then needed to make sure that the IDNA decoding gets done both when parseaddr and getaddresslist are used.

Do you want to tackle this, Torsten?
msg133696 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-13 21:42
> OK, so when I went to apply this, I figured out that the patch isn't quite right.  I've redone the doc updates, and am attaching a version of the patch containing them.
>
> The issue is that the place that the IDNA decode support needs to be added isn't in parseaddr, it's in _parseaddr.py's AddresslistClass.  Tests are then needed to make sure that the IDNA decoding gets done both when parseaddr and getaddresslist are used.
>
> Do you want to tackle this, Torsten?

I would like to, but I probably will not get to it before Monday.  So
if anybody wants to work on this before that time, please feel free to
fix it properly. :)

Just two questions for the implementation:
  1. Would it be fine to move the helper _encode_decode_addr() into
_parseaddr.py and then import it in util.py, so it can be shared
between the two?
  2. Would line 232 in _parseaddr.py (AddrlistClass.getaddrlist) be a
good place to integrate it?
msg133699 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-13 22:02
Yes, putting the function in _parseaddr is fine.  And yes, 232 looks like a good place.  The alternative would be understanding the rfc822 parser, which is pretty mind bending, and of doubtful additional benefit.  (At most it would save a pair of split/join calls.)
msg133754 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-14 17:33
On Wed, Apr 13, 2011 at 09:42:22PM +0000, Torsten Becker wrote:
> So if anybody wants to work on this before that time,
> please feel free to fix it properly. :)

(The word anybody made me think.
But "fix properly" ... i'm sure you cannot refer to myself.
:))
msg133757 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-14 18:05
> (The word anybody made me think.
> But "fix properly" ... i'm sure you cannot refer to myself.
> :))

"fix properly" referred to my inferior implementation and "anybody"
should probably have been worded "Steffen or David".  So sure .. go
ahead. :)
msg133767 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-14 19:52
On Thu, Apr 14, 2011 at 06:05:59PM +0000, Torsten Becker wrote:
> So sure .. go ahead. :)

Grrrrr....
Wuah, Wuuuah, Wuuuuuuaaaah!
No.  Now i'm exhausted.
msg133768 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-04-14 19:53
On Thu, Apr 14, 2011 at 06:05:59PM +0000, Torsten Becker wrote:
> :)

Mumble...
:)
msg133935 - (view) Author: Torsten Becker (torsten.becker) Date: 2011-04-17 19:06
Hi, here is my revised patch with email.utils.getaddresses() also decoding IDNs.

I decided to integrate IDN decoding in AddrlistClass.getaddress() instead of AddrlistClass.getaddrlist() since that function is one level lower and if somebody should ever all it directly, the conversion would not happen.

I also fixed a glitch in the docs, "versionchanged" seems to need two colons to end up in the generated HTML.


As a follow up, wouldn't it be helpful if email.Message would do the conversions directly?  So when you parse a mail into a Message and access the "To" field, you get a list of tuples which are decoded properly?

For example the following test currently still fails because the quoted header value is not decoded by email.feedparser.FeedParser nor email.Message:

    def test_email_decodes_idns_and_unicode(self):
        text = '''\
To: =?utf-8?b?SMOkbnMgV8O8cnN0?= <hans@xn--dm-fka.ain>

Hello World!'''
        msg = Parser().parsestr(text)
        self.assertEqual(utils.getaddresses(msg.get_all('To')),
            [('H\xe4ns W\xfcrst', 'hans@d\xf6m.ain')])

Am I using the package wrong here or is this actually missing?  email.header.decode_header seems to be able to do this already but it is not used.  Would it be safe to integrate this into the email.message._sanitize_header helper?
msg133936 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-17 19:37
Thanks.  I should be able to look at this tomorrow.

You are correct about the fact that Message currently doesn't do any decoding.  That is part of the design: you get the string out of Message and use the helper decoding functions (decode_header, getaddresses, etc) to get actual usable data out of it.

Part of the email6 design addresses this with a new API that will make it much easier to get at the parsed data (methods on a header object returned from msg['xxx']).  But for the current package the way it works it the way it is supposed to work ;)
msg161807 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-05-28 20:51
I'm finally getting back around to this.  Torsten, could you submit a contributor agreement, please?  (http://www.python.org/psf/contrib/)

And to answer the question you had about the 'still failing' test, parseaddr isn't currently doing the encoded-word decode.  Either it should do that too, or it shouldn't do the IDNA decode.  I think it should do the decode, because parseaddr and formataddr are supposed to be inverses.  And decoding when decoding didn't used to be done seems like a backward compatible change: a program doing its own decoding just won't find anything that needs decoding.
msg170381 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-09-12 14:14
3.3 is in feature freeze mode. This new feature has to go into 3.4.
History
Date User Action Args
2012-09-12 14:14:02christian.heimessetnosy: + christian.heimes

messages: + msg170381
versions: + Python 3.4, - Python 3.3
2012-05-28 20:51:35r.david.murraysetassignee: r.david.murray ->
2012-05-28 20:51:26r.david.murraysetmessages: + msg161807
2012-05-28 20:24:53r.david.murraysetnosy: + barry
components: + email, - Library (Lib)
2011-04-17 19:37:50r.david.murraysetmessages: + msg133936
2011-04-17 19:06:30torsten.beckersetfiles: + issue-11783-v4.patch

messages: + msg133935
2011-04-14 19:53:44sdaodensetmessages: + msg133768
2011-04-14 19:52:32sdaodensetmessages: + msg133767
2011-04-14 18:05:58torsten.beckersetmessages: + msg133757
2011-04-14 17:33:50sdaodensetmessages: + msg133754
2011-04-13 22:02:22r.david.murraysetmessages: + msg133699
2011-04-13 21:42:21torsten.beckersetmessages: + msg133696
2011-04-13 19:14:52r.david.murraysetfiles: + email_address_idna.patch

messages: + msg133686
2011-04-11 14:21:23r.david.murraysetmessages: + msg133522
2011-04-11 13:02:21torsten.beckersetfiles: + issue-11783-v2.patch

messages: + msg133519
2011-04-11 11:11:06r.david.murraysetmessages: + msg133512
2011-04-09 20:58:39torsten.beckersetfiles: + issue-11783-v1.patch
keywords: + patch
messages: + msg133421
2011-04-09 11:37:20sdaodensetmessages: + msg133381
2011-04-08 21:03:51torsten.beckersetmessages: + msg133343
2011-04-08 20:22:49r.david.murraysetmessages: + msg133338
2011-04-08 19:32:22sdaodensetnosy: + sdaoden
messages: + msg133335
2011-04-06 14:12:00r.david.murraycreate