Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smtpd.py should not decode utf-8 #63861

Closed
lpolzer mannequin opened this issue Nov 20, 2013 · 29 comments
Closed

smtpd.py should not decode utf-8 #63861

lpolzer mannequin opened this issue Nov 20, 2013 · 29 comments
Labels
stdlib Python modules in the Lib dir topic-email type-feature A feature request or enhancement

Comments

@lpolzer
Copy link
Mannequin

lpolzer mannequin commented Nov 20, 2013

BPO 19662
Nosy @warsaw, @vstinner, @bitdancer, @soltysh, @zvyn
Files
  • smtpd_charset_latin1.diff: Make smtpd.py use latin1 instead of utf-8 as default decoding.
  • python3.3-lib-smtpd-patch.diff: move utf-8 decode to the end of line rcv process
  • switch_while_decode1.patch: Patch to switch between utf8 and binary decode with decode_data variable
  • switch_while_decode2.patch: Switch between utf8 and binary decode based on decode_data var
  • issue19662_v1.patch: decode_data extension for smptd (patch v1)
  • issue19662_v2.patch: decode_data extension for smptd (patch v2)
  • issue19662_v3.patch: decode_data extension for smptd (patch v3)
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-06-11.15:25:23.243>
    created_at = <Date 2013-11-20.10:51:43.730>
    labels = ['type-feature', 'library', 'expert-email']
    title = 'smtpd.py should not decode utf-8'
    updated_at = <Date 2015-05-19.11:19:16.216>
    user = 'https://bugs.python.org/lpolzer'

    bugs.python.org fields:

    activity = <Date 2015-05-19.11:19:16.216>
    actor = 'r.david.murray'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-06-11.15:25:23.243>
    closer = 'r.david.murray'
    components = ['Library (Lib)', 'email']
    creation = <Date 2013-11-20.10:51:43.730>
    creator = 'lpolzer'
    dependencies = []
    files = ['32719', '32861', '34700', '34704', '35390', '35404', '35409']
    hgrepos = []
    issue_num = 19662
    keywords = ['patch']
    message_count = 29.0
    messages = ['203467', '203473', '203477', '203488', '203496', '203497', '204527', '204540', '210431', '210433', '213897', '214010', '215375', '216843', '217135', '218888', '218899', '218900', '219308', '219353', '219363', '219382', '220278', '220279', '220284', '243348', '243564', '243579', '243580']
    nosy_count = 13.0
    nosy_names = ['barry', 'richard', 'vstinner', 'Arfrever', 'r.david.murray', 'jesstess', 'python-dev', 'maciej.szulik', 'lpolzer', 'Illirgway', 'Duke.Dougal', 'zvyn', 'sreepriya']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue19662'
    versions = ['Python 3.5']

    @lpolzer
    Copy link
    Mannequin Author

    lpolzer mannequin commented Nov 20, 2013

    http://hg.python.org/cpython/file/3.3/Lib/smtpd.py#l289

    as of now decodes incoming bytes as UTF-8.

    An SMTP server must not attempt to interpret characters beyond ASCII, however. Originally mail servers were not 8-bit clean, meaning they would only guarantee the lower 7 bits of each octet to be preserved.
    However even then they were not expected to choke on any input because of attempts to decode it into a specific extended charset. Whenever a mail server does not need to interpret data (like base64-encoded auth information) it is simply left alone and passed through.

    I am not aware of the reasons that caused the current state, but to correct this behavior and make it possible to support the 8BITMIME feature I suggest decoding received bytes as latin1, leaving it to the user to reinterpret it as UTF-8 or whatever charset they need. Any other simple extended encoding could be used for this, but latin1 is the default in asynchat.

    The documentation should also mention charset handling. I'll be happy to submit a patch for both code and docs.

    @lpolzer lpolzer mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Nov 20, 2013
    @lpolzer
    Copy link
    Mannequin Author

    lpolzer mannequin commented Nov 20, 2013

    Patch attached. This also adds some more charset clarification to the docs and corrects a minor spelling issue.

    It is also conceivable that we add a charset attribute to the class. This should have the safe default of latin1, and some notes in the docs that setting this to utf-8 (and probably other utf-* encodings) is not really standards-compliant.

    @bitdancer
    Copy link
    Member

    This bug was apparently introduced as part of the work from bpo-4184 in python 3.2. My guess, looking at the code, is that the module simply didn't work before that patch, since it would have been attempting to join binary data using a string join (''.join(...)). Richard says in the issue that he wrote tests, so he probably figured out it wasn't working and "fixed" it. It looks like there was no final review of his patch (at least not via the tracker...the patch uploaded to the tracker did not include the decode). Not that a final review would necessarily have caught the bug...

    The problem here is backward compatibility.

    In terms of the API, it really ought to be producing binary data, and not decoding at all. But, at the time he wrote the patch the email package couldn't handle binary data (Richard's patch landed in July 2010, binary support in the email package landed in October), so presumably nobody was thinking about binary emails.

    I'm really not sure what to do here, I'll have to give it some thought.

    @lpolzer
    Copy link
    Mannequin Author

    lpolzer mannequin commented Nov 20, 2013

    Since this is my first contribution I'm not entirely sure about the fine details of backwards compatibility in Python, so please forgive me if I'm totally missing the mark here.

    There are facilities in smtpd's parent class asynchat that perform the necessary conversions automatically if the user sets an encoding, so smtpd should be adjusted to rely on that and thus give the user the opportunity to choose for themselves.

    Then it boils down to breaking backwards compatibility by setting a default encoding, which could be none as you suggest or latin1 as I suggest; either will probably be painful for current users.

    My take here is that whoever is using this code for their SMTP server and hasn't given the encoding issues any thought will need to take a look at their code in that respect anyway, so IMHO a break with compatibility might be a bit painful but necessary.

    If you agree then I will gladly rework the patch to have smtpd work with an underlying byte stream by default, rejecting anything non-ASCII where necessary.

    Later patches could bring 8BITMIME support to smtpd, with charset conversion as specified by the MIME metadata.

    @bitdancer
    Copy link
    Member

    I think the only backward compatible solution is to add a switch of *some* sort (exact API TBD), whose default is to continue to decode using utf-8, and document it as wrong.

    Conversion of an email to unicode should be handled by the email package, not by smtpd, which is why I say smtpd should be emitting binary.

    As I say, I need to find time to look at the current API in more detail before I'll be comfortable discussing the new API. I've put it on my list, but likely I won't get to it until the weekend.

    @bitdancer
    Copy link
    Member

    Oh, and to clarify: the backward compatibility is that if code works with X.Y.Z, it should work with X.Y.Z+1. So even though correctly handling binary mail would indeed require someone to reexamine their code, if things happen to be working OK for them (eg: their program only needs to handle utf-8 email), we don't want to break their working program.

    @Illirgway
    Copy link
    Mannequin

    Illirgway mannequin commented Nov 26, 2013

    Here is another patch for fixing this issue:

    Illirgway/cpython@12d7c59

    Sorry for my bad english

    @bitdancer
    Copy link
    Member

    As I said, the decoding needs to be controlled by a switch (presumably a keyword argument to SMTPServer) that defaults to the present (incorrect) behavior.

    @DukeDougal
    Copy link
    Mannequin

    DukeDougal mannequin commented Feb 7, 2014

    Is there a workaround for this as I'd like to just be receiving binary data from SMTPD. I'm new to this system - is this scheduled for fixing in Python 3.4?

    @bitdancer
    Copy link
    Member

    Unfortunately I did not get to this before the 3.4 beta release, so no, it won't be fixed in 3.4.

    You can work around it by overriding collect_incoming_data in your subclass and doing data.decode('ascii', 'surrogateescape') instead of str(data, 'utf-8'), and then doing mydata.encode('ascii', 'surrogateescape') at the point where you want to turn the data back into binary.

    @sreepriya
    Copy link
    Mannequin

    sreepriya mannequin commented Mar 17, 2014

    Hi David,

    I would like to work on this bug. Can you give some more insights about the main issue? As far as I understood, the smtp server is now decoding the incoming bytes as UTF-8. Why do you say that it is not the right way? Can you give some idea about the right convention? Also, you mention about a solution with a switch statement having default case as utf8. What are the other cases? And you also mention that smtpd should be emitting binary and unicode should be handled by the email package.
    But is it possible to make that change now as other functions depending on this might be affected?

    @bitdancer
    Copy link
    Member

    I propose that we add a new keyword argument to SMTP's __init__, 'decode_data'. This would be set to True by default, and would preserve the current behavior of passing utf-8 decoded data to process_message.

    Setting it to True would mean that process_message would get passed binary (undecoded) data.

    In 3.5 we add this keyword, but we immediately deprecate 'decode_data=True'. In 3.6 we change the default to decode_data=False, and we deprecate the decode_data keyword. Then in 3.7 we drop the decode_data keyword.

    Now, as for implementation: what 'push' currently does (encode to ascii) is just fine for now. What we need to change is collect_incoming_data (where the decode happens) and found_terminator (where the data is passed to other parts of the class or its subclasses).

    When decode_data is False, collect_incoming_data should not decode. received_lines should be binary. Then, in found_terminator the else branch of the if can pass the binary received_lines into process_message (care will be needed to use the correct data types for the various operations). In the first branch of the if, though, when decode_data is False the data will now need to be decoded (still, I think, using utf-8) so that text can still be used to manipulate this part of the API, since unlike the message data it *is* conceptually text, just encoded as ASCII. (I suggest still decoding using utf-8 rather than ASCII because this will be useful when we implement RFC6531.) This will provide for the smallest number of needed changes to subclasses when converting to decode_data=False mode.

    @sreepriya
    Copy link
    Mannequin

    sreepriya mannequin commented Apr 2, 2014

    Hi David,
    The variable decode_data is included to control decoding. But I am not sure what needs to be done while calling the process_message inside found_terminator when it is binary data. How to work around with binary data? Can you tell me what are the data types concerning binary data?

    @soltysh
    Copy link

    soltysh commented Apr 19, 2014

    Sreepriya, are you still working on this issue? If no I'll be happy to take it over, is yes start with fixing following things:

    • start with test - this is the most important to have each feautre tested
    • decode_data, as David mentioned, needs to have default value True, meaning that __init__ should look like this:
      def __init__(self, server, conn, addr, data_size_limit=DATA_SIZE_DEFAULT, map=None, decode_data=True)
      Assigning True in __init__ will make this value always True, and that's not the point.
    • add deprecation warning about this parameter using warnings module:
      warnings.warn('decode_data=True is deprecated, data will not be decoded by default', DeprecationWarning, 2)
    • as for the found_terminator method what David means is to decode data in the first if, where commands are checked, to simplify processing of this part (David please correct me if I'm wrong) and not what you did
    • and finally you need to update the docs to include decode_data parameter with information about how it works and it's deprecation

    @sreepriya
    Copy link
    Mannequin

    sreepriya mannequin commented Apr 24, 2014

    Hi Maciej,
    I am travelling now and it might take some delay for me to work on this! I got to know that you are working on RFC 6532. You might take this up and fix it as this is related to your work and I don't want to create delays.

    @DukeDougal
    Copy link
    Mannequin

    DukeDougal mannequin commented May 21, 2014

    Is this one likely to be included in 3.5? It effectively breaks smtpd so it would be good to see it working again.

    @bitdancer
    Copy link
    Member

    Yes, this will be fixed in 3.5 one way or another.

    @soltysh
    Copy link

    soltysh commented May 22, 2014

    I'll try to take care of this issue in the following few days.

    @soltysh
    Copy link

    soltysh commented May 28, 2014

    I'm attaching file issue19662_v1.patch. David please have a look at it and let me know if this is it, if not I'm waiting for your suggestions.

    @bitdancer
    Copy link
    Member

    Added review comments.

    @soltysh
    Copy link

    soltysh commented May 29, 2014

    I've implemented all your proposed changes, because for most of your changes I was thinking pretty the same way for the whole day today, to make the code more elegant. The current state of work is attached as issue19662_v2.patch

    @soltysh
    Copy link

    soltysh commented May 30, 2014

    I've included Leslie's comments in rst file. The 3rd version is attached in issue19662_v3.patch.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 11, 2014

    New changeset 4e22213ca275 by R David Murray in branch 'default':
    bpo-19662: add decode_data to smtpd so you can get at DATA in bytes form.
    http://hg.python.org/cpython/rev/4e22213ca275

    @bitdancer
    Copy link
    Member

    Thanks, Maciej.

    I tweaked the patch a bit, you might want to take a look just for your own information. Mostly I fixed the warning stuff, which I didn't explain very well. The idea is that if the default is used (no value is specified), we want there to be a warning. But if a value *is* specified, there should be no warning (the user knows what they want). To accomplish that we make the actual default value None, and check for that. I also had to modify the tests so that warnings aren't issued, as well as test that they actually get issued when the default is used.

    I also added versionchanged directives and a whatsnew entry, and expanded the decode_data docs a bit.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 11, 2014

    New changeset a6c846ec5fd3 by R David Murray in branch 'default':
    bpo-19662: Eliminate warnings in other test modules that use smtpd.
    http://hg.python.org/cpython/rev/a6c846ec5fd3

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 16, 2015

    New changeset a7d3074fa888 by R David Murray in branch 'default':
    bpo-19662: Make requirement to support arbitrary keywords explicit.
    https://hg.python.org/cpython/rev/a7d3074fa888

    @Arfrever
    Copy link
    Mannequin

    Arfrever mannequin commented May 19, 2015

    New changeset a7d3074fa888 by R David Murray in branch 'default':
    bpo-19662: Make requirement to support arbitrary keywords explicit.
    https://hg.python.org/cpython/rev/a7d3074fa888

    s/keword/keyword/

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 19, 2015

    New changeset a3f2b171b765 by R David Murray in branch 'default':
    bpo-19662: fix typo
    https://hg.python.org/cpython/rev/a3f2b171b765

    @bitdancer
    Copy link
    Member

    Thanks, Arfrever.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-email type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants