classification
Title: test_ssl: test_wrong_cert_tls13() and test_pha_required_nocert() fail randomly on Windows
Type: behavior Stage: needs patch
Components: SSL, Tests Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: christian.heimes Nosy List: christian.heimes, miss-islington, pablogsal, steve.dower
Priority: Keywords: patch

Created on 2021-04-23 10:56 by pablogsal, last changed 2021-06-04 12:30 by vstinner.

Pull Requests
URL Status Linked Edit
PR 25561 merged christian.heimes, 2021-04-23 17:26
PR 25574 merged christian.heimes, 2021-04-24 10:22
PR 26489 merged vstinner, 2021-06-02 15:42
PR 26494 merged miss-islington, 2021-06-02 23:25
PR 26501 closed vstinner, 2021-06-03 00:31
PR 26502 merged vstinner, 2021-06-03 01:32
PR 26518 merged miss-islington, 2021-06-03 16:10
PR 26520 merged vstinner, 2021-06-03 16:35
Messages (28)
msg391677 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-04-23 10:56
https://buildbot.python.org/all/#/builders/405/builds/9

Extract from the logs:

The Buildbot has detected a new failure on builder AMD64 Windows8.1 Non-Debug 3.x while building python/cpython.
Full details are available at:
    https://buildbot.python.org/all/#builders/405/builds/9

Buildbot URL: https://buildbot.python.org/all/

Worker for this Build: ware-win81-release

Build Reason: <unknown>
Blamelist: E-Paine <63801254+E-Paine@users.noreply.github.com>, Raymond Hettinger <rhettinger@users.noreply.github.com>, Simon Charette <charette.s@gmail.com>, Steve Dower <steve.dower@python.org>

BUILD FAILED: failed test (failure)


Summary of the results of the build (if available):
===================================================

== Tests result: FAILURE then FAILURE ==

395 tests OK.

10 slowest tests:
- test_multiprocessing_spawn: 2 min 43 sec
- test_concurrent_futures: 2 min 25 sec
- test_io: 1 min 43 sec
- test_largefile: 1 min 41 sec
- test_peg_generator: 1 min 39 sec
- test_mmap: 1 min 19 sec
- test_asyncio: 58.8 sec
- test_regrtest: 45.3 sec
- test_unparse: 44.1 sec
- test_tokenize: 41.1 sec

1 test failed:
    test_ssl

30 tests skipped:
    test_curses test_dbm_gnu test_dbm_ndbm test_devpoll test_epoll
    test_fcntl test_fork1 test_gdb test_grp test_ioctl test_kqueue
    test_multiprocessing_fork test_multiprocessing_forkserver test_nis
    test_openpty test_ossaudiodev test_pipes test_poll test_posix
    test_pty test_pwd test_readline test_resource test_spwd
    test_syslog test_threadsignals test_wait3 test_wait4
    test_xxtestfuzz test_zipfile64

1 re-run test:
    test_ssl

Total duration: 9 min 15 sec



Captured traceback
==================

Traceback (most recent call last):
   File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\test\test_ssl.py", line 2333, in wrap_conn
    self.sslconn = self.server.context.wrap_socket(
   File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\ssl.py", line 518, in wrap_socket
    return self.sslsocket_class._create(
   File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\ssl.py", line 1070, in _create
    self.do_handshake()
   File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\ssl.py", line 1339, in do_handshake
    self._sslobj.do_handshake()
 ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:969)


Traceback (most recent call last):
  File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\test\test_ssl.py", line 255, in wrapper
    return func(*args, **kw)
  File "D:\buildarea\3.x.ware-win81-release.nondebug\build\lib\test\test_ssl.py", line 3171, in test_wrong_cert_tls13
    self.fail("Use of invalid cert should have failed!")
AssertionError: Use of invalid cert should have failed!



Test report
===========

Failed tests:

- test_ssl

Failed subtests:

- test_wrong_cert_tls13 - test.test_ssl.ThreadedTests




Sincerely,
 -The Buildbot
msg391717 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-04-23 18:03
New changeset e047239eafefe8b19725efffe7756443495cf78b by Christian Heimes in branch 'master':
bpo-43921: ignore failing test_wrong_cert_tls13 on Windows (GH-25561)
https://github.com/python/cpython/commit/e047239eafefe8b19725efffe7756443495cf78b
msg391725 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-04-23 18:47
Another TLS 1.3 client cert auth test is flaky, too.

https://buildbot.python.org/all/#/builders/577/builds/14/steps/4/logs/stdio

======================================================================
FAIL: test_pha_required_nocert (test.test_ssl.TestPostHandshakeAuth)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\buildarea\pull_request.bolen-windows10\build\lib\test\test_ssl.py", line 4412, in test_pha_required_nocert
    s.recv(1024)
AssertionError: SSLError not raised
msg391775 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-04-24 13:08
New changeset ce9a0643496ba802ea97a3da20eace3a1117ea48 by Christian Heimes in branch 'master':
bpo-43921: also accept EOF in post-handshake auth test (GH-25574)
https://github.com/python/cpython/commit/ce9a0643496ba802ea97a3da20eace3a1117ea48
msg394317 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-05-25 10:27
Unfortunately this is still failing sporadically. Last failure 5 days ago:

https://buildbot.python.org/all/#/builders/405/builds/153
msg394318 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-05-25 10:28
The issue is marked as referred blocker, it won't block this beta release, but it will block the next, so we should decide how to proceed by then
msg394320 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-05-25 10:39
I neither have a macOS nor a Windows machine to reproduce and debug the issue. Since I'm cannot reproduce the problem on Linux, I'm unable to debug and fix it.
msg394340 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-05-25 12:55
Ok, but bear in mind that this will block the next beta and you are the expert in this area.
msg394343 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-05-25 13:13
Also, this started to happen after the changes in PEP 644, and seems that reverting all changes related to the pep fixes the issue (at least I cannot reproduce it anymore after manually reverting the changes).
msg394346 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-05-25 13:21
Also, I think the solution in:

https://github.com/python/cpython/pull/25561

(skipping the test) is not acceptable for the next beta. 

Christian, I know that this is complicated for you yo reproduce and that this is a hard issue to debug, but this seems linked to the changes in PEP 644 and the reality is that a release manager I don't know the cause of these new failures and the facts are that this is not happening in 3.9 or before.

If we don't fix this for the next beta, I will be forced to revert  PEP 644 until we have a permanent fix.
msg394397 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-05-25 20:33
I can't reproduce these on my own (Windows) machine either.

Looking at the output, I think the tests are just going to be inherently flakey. It's not testing the specific scenario directly enough, and relying heavily on implicit synchronization.

We probably just need a different approach to testing these. I think at the very least starting the server on the main test thread and using a helper to send the request is going to be easier to keep things in sync. 

But that's not going to be done in time to unblock. Skipping the test seems like an okay compromise.
msg394398 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-05-25 20:37
> Looking at the output, I think the tests are just going to be inherently flakey. It's not testing the specific scenario directly enough, and relying heavily on implicit synchronization.

My notes to debug race conditions:
https://pythondev.readthedocs.io/unstable_tests.html#debug-race-conditions

In general, you should run the same test in a loop in many processes in parallel *and* stress the machine with a random workload.

My favorite recipe:

* Terminal 1: python -m test -F -j20 <... options for the test ...>
* Terminal 2: python -m test -j0 -r -F

Sadly, there is no silver bullet for -j20: sometimes, the machine must be "more idle" to trigger the bug (ex: -j5), sometimes the machine must almost die, be more stressed (-j100).

Happy hacking!
msg394399 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-05-25 20:39
> Looking at the output, I think the tests are just going to be inherently flakey. It's not testing the specific scenario directly enough, and relying heavily on implicit synchronization.

But 3.9 and 3.8 are not failing on test_ssl, so do something has clearly changed. Are the failing tests new?
msg394400 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-05-25 20:42
Also, this seems to be reproducible in MacOS at least:

https://bugs.python.org/issue44229
msg394407 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-05-25 22:44
> But 3.9 and 3.8 are not failing on test_ssl, so do something has clearly changed. Are the failing tests new?

They're flakey in the sense that they're not failing reliably, when they clearly should be :)

If something changed about how any part of the response is written into the buffer, it could change when the OS decides to dispatch it and unblock the receiving thread, which is going to affect the state observed by the test at the time it observes it (and as we already know, there are inherent race conditions that sometimes recognise the close of the connection before it's finished reading). These are difficult tests to write, because there are so many weird interactions like this.

In terms of this actual change, my best guess is that one of the removed checks actually mattered on Windows, probably because of a different compile flag. We just use the defaults as generated by their Perl scripts, and I'd hope that there'd be compile errors if anything too blatant was changed (I don't see anything obvious looking through 39258d3595300bc7b952854c915f63ae2d4b9c3e). But short of that, I don't see anything deterministic changed here.
msg394408 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-05-26 00:19
Thanks for the explanation!

> They're flakey in the sense that they're not failing reliably, when they clearly should be :)

Yeah, I do agree with you in this, but I am still not happy that they *realiably* pass in 3.9 and 3.8 in two different OS where now they fail: Windows and MacOS

> But short of that, I don't see anything deterministic changed here.

Well, but the test *deterministically* fail in 3.10 and main and don't fail in 3.9 or 3.8. And it also happens in MacOS.

I really don't want to be annoying here, I really don't. But please, understand my position as RM when I say that I feel uncomfortable with this situation as the test pass without any problem or complication in previous releases and now they don't and there has been a huge change in OpenSSL in the middle.

Also, the asyncio tests for SSL now hang a lot in the refleak buildbots. For example:

https://buildbot.python.org/all/#/builders/673/builds/20

I feel uncomfortable with a solution that basically is "skip the tests", when the test before used to pass deterministically and without problems. I am not an expert in the SSL area by far, so I don't know exactly what is the risk here.

I really apologize if I am insisting on this, but I think is important.
msg394916 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-02 14:54
On the main branch, I can reproduce test_pha_required_nocert() failure:

vstinner@DESKTOP-DK7VBIL C:\vstinner\python\main>python -m test test_ssl -u all -v -F -j5 -m test_pha_required_nocert

test_pha_required_nocert (test.test_ssl.TestPostHandshakeAuth) ...  server:  new connection from ('127.0.0.1', 57456)
 client cert is None
 client did not provide a cert
 server: connection cipher is now ('TLS_AES_256_GCM_SHA384', 'TLSv1.3', 256)
TLS: (<ssl.SSLSocket fd=644, family=AF_INET, type=SOCK_STREAM, proto=0, laddr=('127.0.0.1', 57455), raddr=('127.0.0.1', 57456)>, 'write', TLSVersion.TLSv1_3, _TLSContentType.ALERT, _TLSAlertType.CERTIFICA
TE_REQUIRED, b'\x02t')
Test server failure:
Traceback (most recent call last):
   File "C:\vstinner\python\main\lib\test\test_ssl.py", line 2444, in run
    msg = self.read()
   File "C:\vstinner\python\main\lib\test\test_ssl.py", line 2421, in read
    return self.sslconn.read()
   File "C:\vstinner\python\main\lib\ssl.py", line 1131, in read
    return self._sslobj.read(len)
 ssl.SSLError: [SSL: PEER_DID_NOT_RETURN_A_CERTIFICATE] peer did not return a certificate (_ssl.c:2522)
FAIL

======================================================================
FAIL: test_pha_required_nocert (test.test_ssl.TestPostHandshakeAuth)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\vstinner\python\main\lib\test\test_ssl.py", line 4458, in test_pha_required_nocert
    with self.assertRaisesRegex(
AssertionError: SSLError not raised
msg394924 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-02 15:45
> FAIL: test_pha_required_nocert (test.test_ssl.TestPostHandshakeAuth)

When this bug occurs, s.recv(1024) returns an empty byte string (b'').

I wrote PR 26489 to handle this case.
msg394944 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-02 20:25
New changeset 320eaa7f42b413cd5e5436ec92d4dc5ba150395f by Victor Stinner in branch 'main':
bpo-43921: Fix test_ssl.test_pha_required_nocert() (GH-26489)
https://github.com/python/cpython/commit/320eaa7f42b413cd5e5436ec92d4dc5ba150395f
msg394955 - (view) Author: miss-islington (miss-islington) Date: 2021-06-02 23:48
New changeset e5e93e6145090a636e67766a53b758d7ac78e3ad by Miss Islington (bot) in branch '3.10':
bpo-43921: Fix test_ssl.test_pha_required_nocert() (GH-26489)
https://github.com/python/cpython/commit/e5e93e6145090a636e67766a53b758d7ac78e3ad
msg394967 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-03 01:37
See my comparison of read() and write() errors on Linux vs Windows:
https://github.com/python/cpython/pull/26501#issuecomment-853489167

I wrote PR 26502 to fix test_wrong_cert_tls13() on Windows (currently, the test is skipped).

On Linux, read() always raises an exception when the connection is reset.

On Windows, read() sometimes fails with SSL_ERROR_SYSCALL+WSAECONNRESET, and in this case the internal C function raises a SSLEOFError. But the outer Python wrapper method converts SSLEOFError to an empty string because the SSL socket is created with suppress_ragged_eofs=True by default.

I don't know why on Linux read() only fails with SSL_ERROR_SSL with ERR_peek_last_error()=0x14094418, whereas it's not the case on Windows. It may be an implementation detail, different between Windows socket and Linux socket.
msg394968 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-03 01:45
In Python 3.9, test_pha_required_nocert() looks more strict, it requires read() to fail with the 'tlsv13 alert certificate required' error message:

        # receive alert
        with self.assertRaisesRegex(
                ssl.SSLError,
                'tlsv13 alert certificate required'):
            s.recv(1024)

In the main branch, it tolerates "EOF occurred" error:

    # test sometimes fails with EOF error. Test passes as long as
    # server aborts connection with an error.
    with self.assertRaisesRegex(
        ssl.SSLError,
        '(certificate required|EOF occurred)'
    ):
        # receive CertificateRequest
        data = s.recv(1024)
        self.assertEqual(data, b'OK\n')

        # send empty Certificate + Finish
        s.write(b'HASCERT')

        # receive alert
        s.recv(1024)
msg395011 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-03 16:04
New changeset ea0210fa8ccca769896847f25fc6fadfe9a717bc by Victor Stinner in branch 'main':
bpo-43921: Fix test_ssl.test_wrong_cert_tls13() on Windows (GH-26502)
https://github.com/python/cpython/commit/ea0210fa8ccca769896847f25fc6fadfe9a717bc
msg395043 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-03 20:12
New changeset 5c2191df9a21a3b3d49dd0711b8d2b92591ce82b by Victor Stinner in branch 'main':
bpo-43921: Cleanup test_ssl.test_wrong_cert_tls13() (GH-26520)
https://github.com/python/cpython/commit/5c2191df9a21a3b3d49dd0711b8d2b92591ce82b
msg395044 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-03 20:15
New changeset d2ab15f5376aa06ed120164f1b84bb40adbdd068 by Miss Islington (bot) in branch '3.10':
bpo-43921: Fix test_ssl.test_wrong_cert_tls13() on Windows (GH-26502) (GH-26518)
https://github.com/python/cpython/commit/d2ab15f5376aa06ed120164f1b84bb40adbdd068
msg395045 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-03 20:23
Ok, test_wrong_cert_tls13() and test_pha_required_nocert() of test_ssl should now be more reliable on Windows. I consider that the initial issue is now fixed and I close the issue.
msg395072 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-06-04 10:46
Reopening

* GH-26502 is missing backport to 3.9. I also don't consider the changeset a proper fix. It's a patch that makes the test pass when something goes wrong. We have not yet figured out why something goes wrong on Windows sometimes.

* GH-26520 is missing backports to 3.9 and 3.10
msg395078 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-04 11:59
Christian: I mostly care about buildbots. I cannot reproduce the issue in 3.9. If it's not broken, I don't want to touch the code :-)

For GH-26520, I explained that I don't want to backport it (it's only in main). Moreover, you asked for automated backports, but it failed. If you consider that it should be backported, please go ahead :-)
https://github.com/python/cpython/pull/26520#issuecomment-854148173


"GH-26502 is missing backport to 3.9. I also don't consider the changeset a proper fix. It's a patch that makes the test pass when something goes wrong. We have not yet figured out why something goes wrong on Windows sometimes."

I am not convinced that getting an SSLEOFError on Windows is a bug. I cannot explain it why 3.10 and main branches behave differently, but I'm not interested to investigate.
History
Date User Action Args
2021-06-04 12:30:13vstinnersetnosy: - vstinner
2021-06-04 11:59:20vstinnersetmessages: + msg395078
2021-06-04 10:46:39christian.heimessetstatus: closed -> open
type: behavior
messages: + msg395072

resolution: fixed ->
stage: resolved -> needs patch
2021-06-03 20:39:58vstinnersetpriority: deferred blocker ->
2021-06-03 20:23:54vstinnersetstatus: open -> closed
messages: + msg395045

components: + Tests, SSL
resolution: fixed
stage: patch review -> resolved
2021-06-03 20:15:22vstinnersetmessages: + msg395044
2021-06-03 20:12:40vstinnersetmessages: + msg395043
2021-06-03 16:35:18vstinnersetpull_requests: + pull_request25114
2021-06-03 16:10:19miss-islingtonsetpull_requests: + pull_request25113
2021-06-03 16:04:29vstinnersetmessages: + msg395011
2021-06-03 01:45:46vstinnersetmessages: + msg394968
2021-06-03 01:38:54vstinnersettitle: test_ssl fails on Windows buildbots -> test_ssl: test_wrong_cert_tls13() and test_pha_required_nocert() fail randomly on Windows
2021-06-03 01:37:19vstinnersetmessages: + msg394967
2021-06-03 01:32:00vstinnersetpull_requests: + pull_request25098
2021-06-03 00:31:37vstinnersetpull_requests: + pull_request25097
2021-06-02 23:48:43miss-islingtonsetmessages: + msg394955
2021-06-02 23:25:34miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request25090
2021-06-02 20:25:35vstinnersetmessages: + msg394944
2021-06-02 15:45:36vstinnersetmessages: + msg394924
2021-06-02 15:42:17vstinnersetpull_requests: + pull_request25085
2021-06-02 14:54:20vstinnersetmessages: + msg394916
2021-05-26 00:19:05pablogsalsetmessages: + msg394408
2021-05-25 22:44:47steve.dowersetmessages: + msg394407
2021-05-25 20:42:08pablogsalsetmessages: + msg394400
2021-05-25 20:39:22pablogsalsetmessages: + msg394399
2021-05-25 20:37:03vstinnersetnosy: + vstinner
messages: + msg394398
2021-05-25 20:33:55steve.dowersetmessages: + msg394397
2021-05-25 13:21:20pablogsalsetmessages: + msg394346
2021-05-25 13:13:35pablogsalsetmessages: + msg394343
2021-05-25 12:55:29pablogsalsetmessages: + msg394340
2021-05-25 10:39:50christian.heimessetmessages: + msg394320
2021-05-25 10:28:41pablogsalsetmessages: + msg394318
2021-05-25 10:27:48pablogsalsetmessages: + msg394317
2021-04-24 13:08:16christian.heimessetmessages: + msg391775
2021-04-24 10:22:33christian.heimessetpull_requests: + pull_request24293
2021-04-23 18:47:42christian.heimessetmessages: + msg391725
2021-04-23 18:03:33christian.heimessetmessages: + msg391717
2021-04-23 17:37:13pablogsalsetpriority: normal -> deferred blocker
2021-04-23 17:26:43christian.heimessetkeywords: + patch
stage: patch review
pull_requests: + pull_request24280
2021-04-23 11:18:15christian.heimessetnosy: + steve.dower
2021-04-23 10:56:18pablogsalcreate