Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ftplib uses latin-1 as default encoding #83561

Closed
SebastianGPedersen mannequin opened this issue Jan 18, 2020 · 15 comments
Closed

ftplib uses latin-1 as default encoding #83561

SebastianGPedersen mannequin opened this issue Jan 18, 2020 · 15 comments
Labels
3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@SebastianGPedersen
Copy link
Mannequin

SebastianGPedersen mannequin commented Jan 18, 2020

BPO 39380
Nosy @malemburg, @vstinner, @ericvsmith, @giampaolo, @benjaminp, @ezio-melotti, @methane, @corona10, @SebastianGPedersen
PRs
  • bpo-39380: Change ftplib encoding from latin-1 to utf-8 #18048
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-04-13.23:10:47.080>
    created_at = <Date 2020-01-18.09:57:38.519>
    labels = ['type-bug', 'library', '3.9']
    title = 'ftplib uses latin-1 as default encoding'
    updated_at = <Date 2020-04-13.23:10:47.079>
    user = 'https://github.com/SebastianGPedersen'

    bugs.python.org fields:

    activity = <Date 2020-04-13.23:10:47.079>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-04-13.23:10:47.080>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2020-01-18.09:57:38.519>
    creator = 'SebastianGPedersen'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39380
    keywords = ['patch']
    message_count = 15.0
    messages = ['360549', '360610', '360633', '360683', '360696', '360738', '360740', '360741', '364237', '364702', '364704', '364705', '364806', '366348', '366349']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'vstinner', 'eric.smith', 'giampaolo.rodola', 'benjamin.peterson', 'ezio.melotti', 'methane', 'corona10', 'SebastianGPedersen']
    pr_nums = ['18048']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue39380'
    versions = ['Python 3.9']

    @SebastianGPedersen SebastianGPedersen mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 18, 2020
    @SilentGhost SilentGhost mannequin removed 3.7 (EOL) end of life 3.8 only security fixes labels Jan 18, 2020
    @ericvsmith
    Copy link
    Member

    What's the reason behind this change? I love UTF-8 and all, but is there some standard (de facto or de jure) that discusses this?

    What does this change fix? What's the implication to existing clients?

    I'm not opposed to the change per se, but I'd like to understand the reasoning.

    @methane
    Copy link
    Member

    methane commented Jan 24, 2020

    Can we have some deprecation period?

    def __init__(self, ..., encoding=None):
        ...
        if encoding is not None:
            self.encoding = encoding
        else:
            warnings.warn("The default encoding of the FTP class will be changed from 'latin1' to 'utf-8' in 3.11. Pass the encoding explicitly for now.",
                DeprecationWarning)

    @corona10
    Copy link
    Member

    I agree with inada.naoki

    @SebastianGPedersen
    Copy link
    Mannequin Author

    SebastianGPedersen mannequin commented Jan 25, 2020

    Thank you for the feedback.

    I will elaborate a little bit on the reasons and thoughts behind the pull request:
    Since RFC 2640, the industry standard within FTP Clients is UTF-8 (see e.g. FileZilla here: https://wiki.filezilla-project.org/Character_Encoding, or WinSCP here: https://winscp.net/eng/docs/faq_utf8).

    Given this, I believe the majority of the users that have not investigated the code wrongly assumes UTF-8 encoding for ftplib as well (as it is now).

    I am new to contributing, and not sure, how much deprecation warnings are used (I simply followed the previous encoding change on ftplib), so I will change it based on the feedback. However, shouldn't it be a FutureWarning, so it will be reported by default at initialisation?

    @giampaolo
    Copy link
    Contributor

    It's been a long time since I implemented UTF-8 support in pyftpdlib, but long story short is that:

    • most recent servers are supposed to use UTF-8 by default
    • such servers must include "UTF-8" in the FEAT command response
    • some servers may enable UTF-8 only if client explicitly sends "OPTS UTF-8 ON" first, but that is based on an draft RFC. Server implementors usually treat this command as a no-op and simply assume UTF-8 as the default.

    With that said, I am -1 about implementing logic based on FEAT/OPTS: that should be done before login, and at that point some servers may erroneously reject any command other than USER, PASS and ACCT.

    Personally I think it makes more sense to just use UTF-8 without going through a deprecation period, document *encoding* attribute and mention that in order to deal with servers not supporting UTF8 you can pre-emptively check FEAT response and set ASCII encoding. But I am not a unicode expert, so I would like to hear some other opinion re. the implications of going from latin-1 to utf8 in terms of potential code breakage.

    @methane
    Copy link
    Member

    methane commented Jan 27, 2020

    I'm not FTP user so I don't have strong opinion.
    If it is too late to change already, change it in 3.9 might be OK.

    However, shouldn't it be a FutureWarning, so it will be reported by default at initialisation?

    If it is warning for end users, it should be FutureWarning.
    If the warning is for developers, it should be DeprecationWarning.
    Warning for API changes is DeprecationWarning in general.

    @vstinner
    Copy link
    Member

    I'm in favor of changing the default encoding to UTF-8, but it requires good documentation, especially to provide a solution working on Python 3.8 and 3.9 to change the encoding (see below).

    --

    The encoding is used to encode commands with the FTP server and decode the server replies. I expect that most replies are basically letters, digits and spaces. I guess that the most problematic commands are:

    • send user and password
    • decode filenames of LIST command reply
    • encode filename in STOR command

    I expect that the original FTP protocol doesn't specify any encoding and so that FTP server implementations took some freedom. I would not be surprised to use ANSI code pages used on servers running on Windows.

    Currently, encoding is a class attribute: it's not convenient to override it depending on the host. I would prefer to have a new parameter for the constructor.

    Giampaolo:

    some servers may enable UTF-8 only if client explicitly sends "OPTS UTF-8 ON" first, but that is based on an draft RFC. Server implementors usually treat this command as a no-op and simply assume UTF-8 as the default.
    With that said, I am -1 about implementing logic based on FEAT/OPTS: that should be done before login, and at that point some servers may erroneously reject any command other than USER, PASS and ACCT.

    Oh. In this case, always send "OPTS UTF-8 ON" just after the socket is connected sounds like a bad idea.

    Sebastian:

    Since RFC 2640, the industry standard within FTP Clients is UTF-8 (see e.g. FileZilla here: https://wiki.filezilla-project.org/Character_Encoding, or WinSCP here: https://winscp.net/eng/docs/faq_utf8).

    "Internationalization of the File Transfer Protocol" was published in 1999. It recommends the UTF-8. Following a RFC recommendation is good argument to change the default encoding to UTF-8.
    https://tools.ietf.org/html/rfc2640

    Giampaolo:

    Personally I think it makes more sense to just use UTF-8 without going through a deprecation period

    I concur. Deprecation is usually used for features which are going to be removed (module, function or function parameter). Here it's just about a default parameter value. I expect to have encoding="utf-8" default in the constructor.

    The annoying part is that Python 3.8 only has a class attribute. The simplest option seems to be creating a FTP object, modify its encoding attribute and *then* logs in. Another options is to subclass the FTP class. IMO the worst is to modify ftplib.FTP.encoding attribute (monkey patch the module).

    I expect that most users use username, password and filenames encodable to ASCII and so will not notify the change to UTF-8. We can document a solution working on all Python versions to use different encoding name.

    @vstinner
    Copy link
    Member

    Before changing the default, I would prefer to see a PR adding an encoding parameter to the FTP constructor.

    @SebastianGPedersen
    Copy link
    Mannequin Author

    SebastianGPedersen mannequin commented Mar 15, 2020

    Thanks again for the engagement.
    I am also in favor of adding the encoding to the constructor. However, following Giampaolo's comment, that utf-8 is standard and has been for a long time (and the RFC dating back to year 1999), I am also in favor of changing the default encoding in the same PR. As mentioned in the previous comments, most users use ASCII and will not be affected by this anyway - which is probably also why this issue hasn't been raised before.
    Please let me know your thoughts on this.

    @vstinner
    Copy link
    Member

    I'm now fine with changing the default. But I would still prefer to have an encoding parameter in the constructor. Making these two changes at once now makes sense to me.

    @ericvsmith
    Copy link
    Member

    I agree with Victor.

    @giampaolo
    Copy link
    Contributor

    +1 from me as well. @SebastianGPedersen could you update the PR (constructor + doc changes)?

    @SebastianGPedersen
    Copy link
    Mannequin Author

    SebastianGPedersen mannequin commented Mar 22, 2020

    Yes, I will update the PR before the end of next week.

    @vstinner
    Copy link
    Member

    New changeset a1a0eb4 by Sebastian Pedersen in branch 'master':
    bpo-39380: Change ftplib encoding from latin-1 to utf-8 (GH-18048)
    a1a0eb4

    @vstinner
    Copy link
    Member

    It's now fixed. Thanks Eric V. Smith for suggesting the change and thanks Sebastian Pedersen for the very complete implementation (with tests and new constructor parameters!).

    The default encoding change is documented properly in the "Changes in the Python API" section of What's New in Python 3.9.

    Hopefully, it was already possible to override the encoding in Python 3.8 and older by modifying FTP.encoding class attribute.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants