Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when using SSLContext._msg_callback and SSLContext.sni_callback #87743

Closed
theandrew168 mannequin opened this issue Mar 21, 2021 · 8 comments
Closed

Deadlock when using SSLContext._msg_callback and SSLContext.sni_callback #87743

theandrew168 mannequin opened this issue Mar 21, 2021 · 8 comments
Assignees
Labels
3.8 only security fixes 3.9 only security fixes 3.10 only security fixes topic-SSL type-bug An unexpected behavior, bug, or error

Comments

@theandrew168
Copy link
Mannequin

theandrew168 mannequin commented Mar 21, 2021

BPO 43577
Nosy @tiran, @miss-islington, @theandrew168
PRs
  • bpo-43577: Fix deadlock with SSLContext._msg_callback and sni_callback (GH-24957) #24957
  • [3.9] bpo-43577: Fix deadlock with SSLContext._msg_callback and sni_callback (GH-24957) #24958
  • [3.8] bpo-43577: Fix deadlock with SSLContext._msg_callback and sni_callback (GH-24957) #24959
  • Files
  • deadlock.zip: Deadlock reproduction and backtrace
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tiran'
    closed_at = <Date 2021-03-21.16:03:49.009>
    created_at = <Date 2021-03-21.05:00:42.206>
    labels = ['expert-SSL', 'type-bug', '3.8', '3.9', '3.10']
    title = 'Deadlock when using SSLContext._msg_callback and SSLContext.sni_callback'
    updated_at = <Date 2021-03-21.16:03:49.008>
    user = 'https://github.com/theandrew168'

    bugs.python.org fields:

    activity = <Date 2021-03-21.16:03:49.008>
    actor = 'christian.heimes'
    assignee = 'christian.heimes'
    closed = True
    closed_date = <Date 2021-03-21.16:03:49.009>
    closer = 'christian.heimes'
    components = ['SSL']
    creation = <Date 2021-03-21.05:00:42.206>
    creator = 'theandrew168'
    dependencies = []
    files = ['49897']
    hgrepos = []
    issue_num = 43577
    keywords = ['patch']
    message_count = 8.0
    messages = ['389216', '389222', '389229', '389232', '389234', '389235', '389238', '389240']
    nosy_count = 3.0
    nosy_names = ['christian.heimes', 'miss-islington', 'theandrew168']
    pr_nums = ['24957', '24958', '24959']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue43577'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @theandrew168
    Copy link
    Mannequin Author

    theandrew168 mannequin commented Mar 21, 2021

    Hello,

    I think I might've stumbled onto an oversight with how an SSLSocket handles overwriting its SSLContext within an sni_callback. If both "_msg_callback" and "sni_callback" are defined on an SSLContext object and the sni_callback replaces the context with new one, the interpreter locks up indefinitely. It fails to respond to keyboard interrupts and must be forcefully killed.

    This seems to be a common use case of the sni_callback: create a new context with a different cert chain and attach it to the current socket (which replaces the existing one). If _msg_callback never gets defined on the original context then this deadlock never occurs. Curiously, if you assign the same _msg_callback to the new context before replacement, this also avoids the deadlock.

    I've attached as minimal of a reproduction as I could come up with. I think the code within will probably do a better job explaining this problem than I've done here in prose. I've only tested it on a couple Linux distros (Ubuntu Server and Void Linux) but the lock occurs 100% of the time in my experience.

    In the brief time I've spent digging into the CPython source, I've come to understand that replacing the SSLContext on an SSLSocket isn't "just" a simple replacement but actually involves some OpenSSL mechanics (specifically, SSL_set_SSL_CTX) [0]. I'm wondering if maybe this context update routine isn't properly cleaning up whatever resources / references were being used by the msg_callback? Maybe this is even closer to an OpenSSL bug (or a least a gotcha)?

    I also feel the need to explain why I'd even be using an undocumented property (SSLContext._msg_callback) in the first place. I'm trying to implement a program that automatically manages TLS certs on a socket via Let's Encrypt and the ACME protocol (RFC8555). Part of this process involves serving up a specific cert when a connection requests the acme-tls/1 ALPN protocol. Given the existing Python SSL API, I don't believe there is any way for me to do this "correctly".

    The documentation for SSLContext.sni_callback [1] mentions that the selected_alpn_protocol function should be usable within the callback but I don't that is quite true. According to the OpenSSL docs [2]:
    Several callbacks are executed during ClientHello processing, including the ClientHello, ALPN, and servername callbacks. The ClientHello callback is executed first, then the servername callback, followed by the ALPN callback.

    If there is a better way for me to identify a specific ALPN protocol _before_ the sni_callback, I could definitely use the guidance. That would avoid this deadlock altogether (even though it'd still be waiting to catch someone else...).

    This is my first Python issue so I hope what I've supplied makes sense. If there is anything more I can do to help or provide more info, please let me know.

    [0] https://github.com/python/cpython/blob/3.9/Modules/_ssl.c#L2194
    [1] https://docs.python.org/3/library/ssl.html#ssl.SSLContext.sni_callback
    [2] https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_tlsext_servername_callback.html

    @theandrew168 theandrew168 mannequin added 3.8 only security fixes 3.9 only security fixes labels Mar 21, 2021
    @theandrew168 theandrew168 mannequin assigned tiran Mar 21, 2021
    @theandrew168 theandrew168 mannequin added topic-SSL 3.9 only security fixes type-bug An unexpected behavior, bug, or error labels Mar 21, 2021
    @theandrew168 theandrew168 mannequin assigned tiran Mar 21, 2021
    @theandrew168 theandrew168 mannequin added topic-SSL type-bug An unexpected behavior, bug, or error labels Mar 21, 2021
    @tiran
    Copy link
    Member

    tiran commented Mar 21, 2021

    Thanks for the excellent bug report and reproducer! I have identified the issue and submitted a fix for review. OpenSSL copies the internal msg_callback to SSL struct, but SSL_set_SSL_CTX() does not update the msg_callback with value from new context.

    Could you please open a new bug regarding the issue with SNI and ALPN order? This is unrelated. It looks like OpenSSL processes the ALPN extension after the SNI extension, https://github.com/openssl/openssl/blob/abded2ced44b94d96f08ea5cf01df6519b80f5d3/ssl/ssl_local.h#L740-L769 . I can see that the state machines fires "final_server_name" first (which triggers the SNI callback), then "tls_handle_alpn". This makes sense. This allows the new context to select ALPNs.

    #0 final_server_name (s=0x8a4080, context=128, sent=1) at ssl/statem/extensions.c:925
    #1 0x00007fffea3ec146 in tls_parse_all_extensions (x=<optimized out>, chainidx=<optimized out>, fin=<optimized out>, exts=<optimized out>, context=<optimized out>,
    s=<optimized out>) at ssl/statem/extensions.c:762
    #2 tls_parse_all_extensions (s=0x8a4080, context=128, exts=<optimized out>, x=<optimized out>, chainidx=<optimized out>, fin=1) at ssl/statem/extensions.c:737
    #3 0x00007fffea417db6 in tls_early_post_process_client_hello (s=0x8a4080) at ssl/statem/statem_srvr.c:1906
    #4 tls_post_process_client_hello (wst=<optimized out>, s=0x8a4080) at ssl/statem/statem_srvr.c:2249
    #5 ossl_statem_server_post_process_message (s=s@entry=0x8a4080, wst=<optimized out>) at ssl/statem/statem_srvr.c:1243
    #6 0x00007fffea3fe34c in read_state_machine (s=0x8a4080) at ssl/statem/statem.c:664
    #7 state_machine (s=0x8a4080, server=<optimized out>) at ssl/statem/statem.c:434
    #8 0x00007fffea48a9df in _ssl__SSLSocket_do_handshake_impl (self=0x7fffe9fe3ed0) at /home/heimes/dev/python/cpython/Modules/_ssl.c:1084
    #9 _ssl__SSLSocket_do_handshake (self=0x7fffe9fe3ed0, _unused_ignored=<optimized out>) at /home/heimes/dev/python/cpython/Modules/clinic/_ssl.c.h:19

    #0 tls_handle_alpn (s=0x8a4080) at ssl/statem/statem_srvr.c:2167
    #1 0x00007fffea3ec146 in tls_parse_all_extensions (x=<optimized out>, chainidx=<optimized out>, fin=<optimized out>, exts=<optimized out>, context=<optimized out>,
    s=<optimized out>) at ssl/statem/extensions.c:762
    #2 tls_parse_all_extensions (s=0x8a4080, context=128, exts=<optimized out>, x=<optimized out>, chainidx=<optimized out>, fin=1) at ssl/statem/extensions.c:737
    #3 0x00007fffea417db6 in tls_early_post_process_client_hello (s=0x8a4080) at ssl/statem/statem_srvr.c:1906
    #4 tls_post_process_client_hello (wst=<optimized out>, s=0x8a4080) at ssl/statem/statem_srvr.c:2249
    #5 ossl_statem_server_post_process_message (s=s@entry=0x8a4080, wst=<optimized out>) at ssl/statem/statem_srvr.c:1243
    #6 0x00007fffea3fe34c in read_state_machine (s=0x8a4080) at ssl/statem/statem.c:664
    #7 state_machine (s=0x8a4080, server=<optimized out>) at ssl/statem/statem.c:434
    #8 0x00007fffea48a9df in _ssl__SSLSocket_do_handshake_impl (self=0x7fffe9fe3ed0) at /home/heimes/dev/python/cpython/Modules/_ssl.c:1084

    @theandrew168
    Copy link
    Mannequin Author

    theandrew168 mannequin commented Mar 21, 2021

    I'm glad that the info I provided was helpful! I'll go ahead and create another issue for the misleading docs surrounding SSLContext.sni_callback. Thanks for looking into this and coming up with a fix so quickly.

    I do have one more question: does python provide a "safe" way to test for deadlocks like this? I noticed that you added a test case to verify that this lockup doesn't happen but what would happen if someone ran that test on an earlier version? Would the test runner also freeze or are there facilities in-place to catch such behavior? Maybe something nutty like:

    with should_deadlock():
        my_buggy_test()

    @tiran
    Copy link
    Member

    tiran commented Mar 21, 2021

    New changeset 77cde50 by Christian Heimes in branch 'master':
    bpo-43577: Fix deadlock with SSLContext._msg_callback and sni_callback (GH-24957)
    77cde50

    @miss-islington
    Copy link
    Contributor

    New changeset 93b0da7 by Miss Islington (bot) in branch '3.8':
    bpo-43577: Fix deadlock with SSLContext._msg_callback and sni_callback (GH-24957)
    93b0da7

    @tiran
    Copy link
    Member

    tiran commented Mar 21, 2021

    No, there is no check for that. This kind of deadlock should never occur. The problem was an implementation bug in low-level C code that had bad interaction with the global interpreter lock. Python releases the GIL around OpenSSL calls. Callbacks have to re-acquire the GIL correctly and release it again at the end of a callback.

    By the way the _msg_cb attribute is deliberately undocumented and marked as an internal property. I implemented the callback to debug some issues with TLS 1.3 and OpenSSL 1.1.1. It's neither well tested nor stable.

    @miss-islington
    Copy link
    Contributor

    New changeset c145c03 by Miss Islington (bot) in branch '3.9':
    bpo-43577: Fix deadlock with SSLContext._msg_callback and sni_callback (GH-24957)
    c145c03

    @tiran
    Copy link
    Member

    tiran commented Mar 21, 2021

    The fix will be available in next 3.9 and 3.8 release.

    @tiran tiran added the 3.10 only security fixes label Mar 21, 2021
    @tiran tiran closed this as completed Mar 21, 2021
    @tiran tiran added the 3.10 only security fixes label Mar 21, 2021
    @tiran tiran closed this as completed Mar 21, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes 3.10 only security fixes topic-SSL type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants