Title: should not decode utf-8
Type: enhancement Stage: resolved
Components: email, Library (Lib) Versions: Python 3.5
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Duke.Dougal, Illirgway, barry, jesstess, lpolzer, maciej.szulik, python-dev, r.david.murray, richard, sreepriya, vstinner, zvyn
Priority: normal Keywords: patch

Created on 2013-11-20 10:51 by lpolzer, last changed 2015-05-19 11:19 by r.david.murray. This issue is now closed.

File name Uploaded Description Edit
smtpd_charset_latin1.diff lpolzer, 2013-11-20 12:48 Make use latin1 instead of utf-8 as default decoding. review
python3.3-lib-smtpd-patch.diff Illirgway, 2013-11-26 20:29 move utf-8 decode to the end of line rcv process
switch_while_decode1.patch sreepriya, 2014-04-02 11:39 Patch to switch between utf8 and binary decode with decode_data variable review
switch_while_decode2.patch sreepriya, 2014-04-02 20:26 Switch between utf8 and binary decode based on decode_data var review
issue19662_v1.patch maciej.szulik, 2014-05-28 21:52 decode_data extension for smptd (patch v1) review
issue19662_v2.patch maciej.szulik, 2014-05-29 20:36 decode_data extension for smptd (patch v2) review
issue19662_v3.patch maciej.szulik, 2014-05-30 10:21 decode_data extension for smptd (patch v3) review
Messages (29)
msg203467 - (view) Author: Leslie P. Polzer (lpolzer) Date: 2013-11-20 10:51

as of now decodes incoming bytes as UTF-8.

An SMTP server must not attempt to interpret characters beyond ASCII, however. Originally mail servers were not 8-bit clean, meaning they would only guarantee the lower 7 bits of each octet to be preserved.
However even then they were not expected to choke on any input because of attempts to decode it into a specific extended charset. Whenever a mail server does not need to interpret data (like base64-encoded auth information) it is simply left alone and passed through.

I am not aware of the reasons that caused the current state, but to correct this behavior and make it possible to support the 8BITMIME feature I suggest decoding received bytes as latin1, leaving it to the user to reinterpret it as UTF-8 or whatever charset they need. Any other simple extended encoding could be used for this, but latin1 is the default in asynchat.

The documentation should also mention charset handling. I'll be happy to submit a patch for both code and docs.
msg203473 - (view) Author: Leslie P. Polzer (lpolzer) Date: 2013-11-20 12:48
Patch attached. This also adds some more charset clarification to the docs and corrects a minor spelling issue.

It is also conceivable that we add a charset attribute to the class. This should have the safe default of latin1, and some notes in the docs that setting this to utf-8 (and probably other utf-* encodings) is not really standards-compliant.
msg203477 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-20 13:53
This bug was apparently introduced as part of the work from issue 4184 in python 3.2.  My guess, looking at the code, is that the module simply didn't work before that patch, since it would have been attempting to join binary data using a string join (''.join(...)).  Richard says in the issue that he wrote tests, so he probably figured out it wasn't working and "fixed" it.  It looks like there was no final review of his patch (at least not via the tracker...the patch uploaded to the tracker did not include the decode).  Not that a final review would necessarily have caught the bug...

The problem here is backward compatibility.

In terms of the API, it really ought to be producing binary data, and not decoding at all.  But, at the time he wrote the patch the email package couldn't handle binary data (Richard's patch landed in July 2010, binary support in the email package landed in October), so presumably nobody was thinking about binary emails.

I'm really not sure what to do here, I'll have to give it some thought.
msg203488 - (view) Author: Leslie P. Polzer (lpolzer) Date: 2013-11-20 15:02
Since this is my first contribution I'm not entirely sure about the fine details of backwards compatibility in Python, so please forgive me if I'm totally missing the mark here.

There are facilities in smtpd's parent class asynchat that perform the necessary conversions automatically if the user sets an encoding, so smtpd should be adjusted to rely on that and thus give the user the opportunity to choose for themselves.

Then it boils down to breaking backwards compatibility by setting a default encoding, which could be none as you suggest or latin1 as I suggest; either will probably be painful for current users.

My take here is that whoever is using this code for their SMTP server and hasn't given the encoding issues any thought will need to take a look at their code in that respect anyway, so IMHO a break with compatibility might be a bit painful but necessary.

If you agree then I will gladly rework the patch to have smtpd work with an underlying byte stream by default, rejecting anything non-ASCII where necessary.

Later patches could bring 8BITMIME support to smtpd, with charset conversion as specified by the MIME metadata.
msg203496 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-20 16:06
I think the only backward compatible solution is to add a switch of *some* sort (exact API TBD), whose default is to continue to decode using utf-8, and document it as wrong.

Conversion of an email to unicode should be handled by the email package, not by smtpd, which is why I say smtpd should be emitting binary.

As I say, I need to find time to look at the current API in more detail before I'll be comfortable discussing the new API.  I've put it on my list, but likely I won't get to it until the weekend.
msg203497 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-20 16:10
Oh, and to clarify: the backward compatibility is that if code works with X.Y.Z, it should work with X.Y.Z+1.  So even though correctly handling binary mail would indeed require someone to reexamine their code, if things happen to be working OK for them (eg: their program only needs to handle utf-8 email), we don't want to break their working program.
msg204527 - (view) Author: (Illirgway) Date: 2013-11-26 20:29
Here is another patch for fixing this issue:

Sorry for my bad english
msg204540 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-26 22:22
As I said, the decoding needs to be controlled by a switch (presumably a keyword argument to SMTPServer) that defaults to the present (incorrect) behavior.
msg210431 - (view) Author: Duke Dougal (Duke.Dougal) Date: 2014-02-07 01:58
Is there a workaround for this as I'd like to just be receiving binary data from SMTPD. I'm new to this system - is this scheduled for fixing in Python 3.4?
msg210433 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-07 02:38
Unfortunately I did not get to this before the 3.4 beta release, so no, it won't be fixed in 3.4.

You can work around it by overriding collect_incoming_data in your subclass and doing data.decode('ascii', 'surrogateescape') instead of str(data, 'utf-8'), and then doing mydata.encode('ascii', 'surrogateescape') at the point where you want to turn the data back into binary.
msg213897 - (view) Author: Sreepriya Chalakkal (sreepriya) * Date: 2014-03-17 21:17
Hi David, 

I would like to work on this bug. Can you give some more insights about the main issue? As far as I understood, the smtp server is now decoding the incoming bytes as UTF-8. Why do you say that it is not the right way? Can you give some idea about the right convention?  Also, you mention about a solution with a switch statement having default case as utf8. What are the other cases? And you also mention that smtpd should be emitting binary and unicode should be handled by the email package. 
But is it possible to make that change now as other functions depending on this might be affected?
msg214010 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-18 19:48
I propose that we add a new keyword argument to SMTP's __init__, 'decode_data'.  This would be set to True by default, and would preserve the current behavior of passing utf-8 decoded data to process_message.

Setting it to True would mean that process_message would get passed binary (undecoded) data.

In 3.5 we add this keyword, but we immediately deprecate 'decode_data=True'.  In 3.6 we change the default to decode_data=False, and we deprecate the decode_data keyword.  Then in 3.7 we drop the decode_data keyword.

Now, as for implementation: what 'push' currently does (encode to ascii) is just fine for now.  What we need to change is collect_incoming_data (where the decode happens) and found_terminator (where the data is passed to other parts of the class or its subclasses).

When decode_data is False, collect_incoming_data should not decode.  received_lines should be binary.  Then, in found_terminator the else branch of the if can pass the binary received_lines into process_message (care will be needed to use the correct data types for the various operations).  In the first branch of the if, though, when decode_data is False the data will now need to be decoded (still, I think, using utf-8) so that text can still be used to manipulate this part of the API, since unlike the message data it *is* conceptually text, just encoded as ASCII.  (I suggest still decoding using utf-8 rather than ASCII because this will be useful when we implement RFC6531.)  This will provide for the smallest number of needed changes to subclasses when converting to decode_data=False mode.
msg215375 - (view) Author: Sreepriya Chalakkal (sreepriya) * Date: 2014-04-02 11:39
Hi David,
The variable decode_data is included to control decoding. But I am not sure what needs to be done while calling the process_message inside found_terminator when it is binary data. How to work around with binary data? Can you tell me what are the data types concerning binary data?
msg216843 - (view) Author: Maciej Szulik (maciej.szulik) * (Python triager) Date: 2014-04-19 05:07
Sreepriya, are you still working on this issue? If no I'll be happy to take it over, is yes start with fixing following things:
- start with test - this is the most important to have each feautre tested
- decode_data, as David mentioned, needs to have default value True, meaning that __init__ should look like this: 
def __init__(self, server, conn, addr, data_size_limit=DATA_SIZE_DEFAULT, map=None, decode_data=True)
Assigning True in __init__ will make this value always True, and that's not the point. 
- add deprecation warning about this parameter using warnings module:
warnings.warn('decode_data=True is deprecated, data will not be decoded by default', DeprecationWarning, 2)
- as for the found_terminator method what David means is to decode data in the first if, where commands are checked, to simplify processing of this part (David please correct me if I'm wrong) and not what you did
- and finally you need to update the docs to include decode_data parameter with information about how it works and it's deprecation
msg217135 - (view) Author: Sreepriya Chalakkal (sreepriya) * Date: 2014-04-24 18:44
Hi Maciej,
I am travelling now and it might take some delay for me to work on this! I got to know that you are working on RFC 6532. You might take this up and fix it as this is related to your work and I don't want to create delays.
msg218888 - (view) Author: Duke Dougal (Duke.Dougal) Date: 2014-05-21 22:22
Is this one likely to be included in 3.5? It effectively breaks smtpd so it would be good to see it working again.
msg218899 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-05-22 15:21
Yes, this will be fixed in 3.5 one way or another.
msg218900 - (view) Author: Maciej Szulik (maciej.szulik) * (Python triager) Date: 2014-05-22 15:23
I'll try to take care of this issue in the following few days.
msg219308 - (view) Author: Maciej Szulik (maciej.szulik) * (Python triager) Date: 2014-05-28 21:52
I'm attaching file issue19662_v1.patch. David please have a look at it and let me know if this is it, if not I'm waiting for your suggestions.
msg219353 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-05-29 17:31
Added review comments.
msg219363 - (view) Author: Maciej Szulik (maciej.szulik) * (Python triager) Date: 2014-05-29 20:35
I've implemented all your proposed changes, because for most of your changes I was thinking pretty the same way for the whole day today, to make the code more elegant. The current state of work is attached as issue19662_v2.patch
msg219382 - (view) Author: Maciej Szulik (maciej.szulik) * (Python triager) Date: 2014-05-30 10:21
I've included Leslie's comments in rst file. The 3rd version is attached in issue19662_v3.patch.
msg220278 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-06-11 15:18
New changeset 4e22213ca275 by R David Murray in branch 'default':
#19662: add decode_data to smtpd so you can get at DATA in bytes form.
msg220279 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-06-11 15:25
Thanks, Maciej. 

I tweaked the patch a bit, you might want to take a look just for your own information.  Mostly I fixed the warning stuff, which I didn't explain very well.  The idea is that if the default is used (no value is specified), we want there to be a warning.  But if a value *is* specified, there should be no warning (the user knows what they want).  To accomplish that we make the actual default value None, and check for that.  I also had to modify the tests so that warnings aren't issued, as well as test that they actually get issued when the default is used.

I also added versionchanged directives and a whatsnew entry, and expanded the decode_data docs a bit.
msg220284 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-06-11 16:27
New changeset a6c846ec5fd3 by R David Murray in branch 'default':
#19662: Eliminate warnings in other test modules that use smtpd.
msg243348 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-05-16 18:18
New changeset a7d3074fa888 by R David Murray in branch 'default':
#19662: Make requirement to support arbitrary keywords explicit.
msg243564 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * (Python triager) Date: 2015-05-19 08:00
> New changeset a7d3074fa888 by R David Murray in branch 'default':
> #19662: Make requirement to support arbitrary keywords explicit.

msg243579 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-05-19 11:18
New changeset a3f2b171b765 by R David Murray in branch 'default':
#19662: fix typo
msg243580 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-05-19 11:19
Thanks, Arfrever.
Date User Action Args
2015-05-19 11:19:16r.david.murraysetmessages: + msg243580
2015-05-19 11:18:53python-devsetmessages: + msg243579
2015-05-19 08:00:16Arfreversetnosy: + Arfrever
messages: + msg243564
2015-05-16 18:18:25python-devsetmessages: + msg243348
2014-06-11 16:27:57python-devsetmessages: + msg220284
2014-06-11 15:25:23r.david.murraysetstatus: open -> closed
resolution: fixed
messages: + msg220279

stage: patch review -> resolved
2014-06-11 15:18:34python-devsetnosy: + python-dev
messages: + msg220278
2014-06-10 16:50:56zvynsetnosy: + jesstess, zvyn
2014-05-30 10:21:31maciej.szuliksetfiles: + issue19662_v3.patch

messages: + msg219382
2014-05-29 20:36:06maciej.szuliksetfiles: + issue19662_v2.patch

messages: + msg219363
2014-05-29 17:31:39r.david.murraysetmessages: + msg219353
2014-05-28 21:53:02maciej.szuliksetfiles: + issue19662_v1.patch

messages: + msg219308
2014-05-22 15:23:33maciej.szuliksetmessages: + msg218900
2014-05-22 15:21:27r.david.murraysetmessages: + msg218899
2014-05-21 22:22:01Duke.Dougalsetmessages: + msg218888
2014-04-24 18:44:59sreepriyasetmessages: + msg217135
2014-04-19 05:07:12maciej.szuliksetnosy: + maciej.szulik
messages: + msg216843
2014-04-02 20:26:24sreepriyasetfiles: + switch_while_decode2.patch
2014-04-02 11:39:07sreepriyasetfiles: + switch_while_decode1.patch

messages: + msg215375
2014-03-18 19:48:57r.david.murraysetmessages: + msg214010
2014-03-17 21:17:35sreepriyasetnosy: + sreepriya
messages: + msg213897
2014-02-07 02:38:31r.david.murraysetmessages: + msg210433
2014-02-07 01:58:05Duke.Dougalsetnosy: + Duke.Dougal
messages: + msg210431
2013-11-26 22:23:12r.david.murraysetversions: + Python 3.5, - Python 3.4
2013-11-26 22:22:49r.david.murraysetmessages: + msg204540
2013-11-26 20:29:45pitrousetstage: patch review
versions: + Python 3.4, - Python 3.3
2013-11-26 20:29:07Illirgwaysetfiles: + python3.3-lib-smtpd-patch.diff
versions: + Python 3.3, - Python 3.5
nosy: + Illirgway

messages: + msg204527
2013-11-20 16:10:56r.david.murraysetmessages: + msg203497
2013-11-20 16:06:33r.david.murraysetmessages: + msg203496
versions: + Python 3.5, - Python 3.3, Python 3.4
2013-11-20 15:02:42lpolzersetmessages: + msg203488
2013-11-20 13:53:53r.david.murraysetversions: + Python 3.4, - Python 2.6, Python 3.1, Python 2.7, Python 3.2
nosy: + barry, richard, r.david.murray

messages: + msg203477

components: + email
2013-11-20 12:48:27lpolzersetfiles: + smtpd_charset_latin1.diff
keywords: + patch
messages: + msg203473
2013-11-20 10:52:31vstinnersetnosy: + vstinner
2013-11-20 10:51:43lpolzercreate