This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: smtpd crashes when a multi-byte UTF-8 sequence is split between consecutive data packets
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, Illirgway, barry, doerwalter, giampaolo.rodola, jesstess, petri.lehtinen, r.david.murray, zvyn
Priority: normal Keywords: needs review, patch

Created on 2013-11-27 06:10 by petri.lehtinen, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
smtpd.patch petri.lehtinen, 2013-11-27 06:10 Patch from GitHub
smtpd_undecodable_data_does_not_raise.patch zvyn, 2014-07-09 15:33 review
smtpd_undecodable_data_does_not_raiseV2.patch zvyn, 2014-07-16 19:56 review
Messages (14)
msg204553 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2013-11-27 06:10
From Illirgway, https://github.com/akheron/cpython/pull/2:

Fix utf-8 bug, when symbol byte chain splits on 2 tcp-packets:

error: uncaptured python exception, closing channel (:'utf-8' codec can't decode byte 0xd0 in position 1474: unexpected end of data [/usr/lib/python3.3/asyncore.py|read|83] [/usr/lib/python3.3/asyncore.py|handle_read_event|441] [/usr/lib/python3.3/asynchat.py|handle_read|178] [/usr/lib/python3.3/smtpd.py|collect_incoming_data|289])
msg204577 - (view) Author: (Illirgway) Date: 2013-11-27 12:34
"base64", "quoted-printable" and "7bit" messages only use ASCII range of characters so raw_message.decode('ascii') == raw_message.decode('latin1') == etc. == raw_message.decode('utf-8') due to internal representation of utf-8 characters

P.S. Python 3.4.0 beta 1 Lib/smtpd has the same bug and needs the same patch to fix this issue
msg204578 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-27 13:14
It occurs to me to wonder why smtpd is receiving 8bit data at all.  It isn't advertising the 8BITMIME capability, so the client should be sending only 7bit data.  See also issue 19662.  I'm not sure that making utf-8 decoding work better is a good idea, but I suppose that since it is a data-dependent exception in a situation that would work with different data, we should go ahead and fix it.
msg204580 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2013-11-27 13:45
Is there a reason why this patch changes the decoding error handler to 'ignore' (from the default 'strict')?
msg204585 - (view) Author: (Illirgway) Date: 2013-11-27 14:17
because "strict" throws an exception (for example, on raw win1251) which brings down production smtp daemon and "replace" embeds "ugly" characters into received (and passed) email messages
msg221208 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-21 23:11
@Petri/Illirgway could one of you produce a new patch including the code change and a test for this?
msg222622 - (view) Author: Milan Oberkirch (zvyn) * Date: 2014-07-09 15:33
I do not think that the proposed patch solves the bug because it silently changes binary input. With the patch for issue 19662 a proper solution to avoid this bug has been applied. The only thing left is to prevent the server to raise the exception when in legacy mode. Instead of deleting single bytes from the input (which is what .decode('utf-8', 'ignore') does) I would reply with an error to the client. The attached patch implements and tests this behaviour.
msg222735 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2014-07-11 10:31
I don't know anything about SMTP, but would it make sense to use an incremental decoder for decoding UTF-8?
msg223023 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-07-14 13:09
As Milan said, the problem doesn't arise in 3.5 with decode_data=False, since there's no decoding.  His patch doesn't actually fix the bug for the decode_data=True case, though, since the bug is a *valid* utf-8 sequence getting split across tcp buffers.

To fix it, we would need to change the implementation of decode_data.  Instead of conditionally decoding in collect_data, we'd need to postpone decoding to found_terminator.  This would have the undesirable affect of changing what is in the received_lines attribute, which is why we didn't do it in the decode_data patch.  Using an incremental decoder won't solve that problem, since it too would change what gets stored in received_lines.

Since decode_data=True is really not a legitimate mode for smtpd (it is an historical accident/bug) and we are planning on removing it eventually, I think we should go ahead and apply Milan's patch as is, since it does improve the error reporting.  The message would need to be adjusted though, since it can trigger on valid utf-8 data.  It should say that smtpd should be run with decode_data=False in order to fix the decode problem.

That would leave the bug as-is in 3.4, but a similar patch with an error message suggesting an upgrade to 3.5/decode_data=True could be applied.  That feels a little weird, though :).
msg223263 - (view) Author: Milan Oberkirch (zvyn) * Date: 2014-07-16 19:56
Note that in this (and my previous) patch the message is sent to the client (the idea was not to raise an exception). Maybe it would be better to raise an exception with the information you mentioned?
msg223267 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-07-16 20:30
No, I think the error return is better.  Crashing the server on bad input data (which, worse, could be perfectly valid if the server were standards compliant) is a bad thing, I think.  That is, if someone has an application that actually works with decode_data=True, I think it is better if it doesn't crash just because of different input data or TCP packet boundaries, as long as the error is reported.  I'm willing to be convinced otherwise, but that's how I'm leaning.
msg223274 - (view) Author: Milan Oberkirch (zvyn) * Date: 2014-07-16 21:04
Agreed. It just feels a bit weird to send programming instructions for the server to the client (but it's probably the best we can do here).
msg223278 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-07-16 21:17
Oh, that's a good point, I hadn't thought of that.  Maybe I can come up with a better wording.
msg309754 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2018-01-10 00:56
I'm closing this as won't fix since smtpd.py is deprecated and will likely not get any future development.  Please see aiosmtpd as a much better third party replacement.
History
Date User Action Args
2022-04-11 14:57:54adminsetgithub: 64005
2018-01-10 00:56:05barrysetstatus: open -> closed

nosy: + barry
messages: + msg309754

resolution: wont fix
stage: patch review -> resolved
2014-07-16 21:17:09r.david.murraysetmessages: + msg223278
2014-07-16 21:04:20zvynsetmessages: + msg223274
2014-07-16 20:30:12r.david.murraysetmessages: + msg223267
2014-07-16 19:56:13zvynsetfiles: + smtpd_undecodable_data_does_not_raiseV2.patch

messages: + msg223263
2014-07-14 13:09:49r.david.murraysetmessages: + msg223023
2014-07-11 10:31:13doerwaltersetnosy: + doerwalter
messages: + msg222735
2014-07-09 15:34:39zvynsetnosy: + jesstess
2014-07-09 15:33:02zvynsetfiles: + smtpd_undecodable_data_does_not_raise.patch
nosy: + zvyn
messages: + msg222622

2014-06-21 23:11:47BreamoreBoysetnosy: + BreamoreBoy

messages: + msg221208
versions: + Python 3.5, - Python 3.3
2013-11-27 14:17:07Illirgwaysetmessages: + msg204585
2013-11-27 13:45:19petri.lehtinensetmessages: + msg204580
2013-11-27 13:14:24r.david.murraysetnosy: + r.david.murray
messages: + msg204578
2013-11-27 12:34:08Illirgwaysetnosy: + Illirgway

messages: + msg204577
versions: + Python 3.4
2013-11-27 06:10:48petri.lehtinencreate