classification
Title: test_socket failing in solaris
Type: behavior Stage:
Components: Tests Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jim Crigler, blastwave, christian.heimes, petriborg, phantal
Priority: normal Keywords:

Created on 2017-01-13 20:55 by phantal, last changed 2020-08-04 16:01 by vstinner.

Messages (10)
msg285440 - (view) Author: Brian Vandenberg (phantal) Date: 2017-01-13 20:55
I started looking into this failure to see if I could figure out why but it looks like I'd have to spend more time than I have available to figure out the cause.

Environment/setup:
* air-gapped network (no internet access)
* sparc / Solaris 10
* Built with gcc 6.3.0
* Altered configure script to change -std=c99 to -std=gnu99 (see issue 29264)
* The only configure flags used were --prefix and --with-universal-archs=all


When I run test_socket I see the following 4 failures; please note, I'm hand typing the results so I may typo something:


ERROR: testCount (test.test_socket.SendfileUsingSendfileTest)
Traceback:
  File "(...)/test_socket.py", line 5248, in testCount
  File "(...)/test_socket.py", line 5151, in recv_data
MemoryError

Error: testCount (test.test_socket.SendfileUsingSendfileTest)
Traceback:
  File "(...)/test_socket.py", line 277, in _tearDown
  File "(...)/test_socket.py", line 289, in clientRun
  File "(...)/test_socket.py", line 5241, in _testCount
  File "(...)/Lib/socket.py", line 296, in _sendfile_use_sendfile
socket.timeout: timed out

Error: testWithTimeout (test.test_socket.SendfileUsingSendfileTest)
Traceback:
  File "(...)/test_socket.py", line 5318, in testWithTimeout
    data = self.recv_data(conn)
  File "(...)/test_socket.py", line 5151, in recv_data
    chunk = conn.recv(self.BUFSIZE)
MemoryError

Error: testWithTimeout (test.test_socket.SendfileUsingSendfileTest)
Traceback:
  File "(...)/test_socket.py", line 277, in _tearDown
    raise exc
  File "(...)/test_socket.py", line 289, in clientRun
    test_func()
  File "(...)/test_socket.py", line 5313, in _testWithTimeout
    sent = meth(file)
  File "(...)/Lib/socket.py", line 296, in _sendfile_use_sendfile
socket.timeout: timed out

Error: testCountWithOffset (test.test_socket.SendfileUsingSendfileTest)
Traceback:
  File "(...)/test_socket.py", line 5287, in testCountWithOffset
    self.assertEqual(len(data), count)
AssertionError: 4376231 != 100007

Ran 539 tests in 69.166s

FAILED (failures=1, errors=4, skipped=324)
test test_socket failed
msg296475 - (view) Author: Jim Crigler (Jim Crigler) Date: 2017-06-20 17:39
I'm having the same problem with gcc 6.2.

Is there any update?
msg296546 - (view) Author: Peter (petriborg) Date: 2017-06-21 11:26
Getting the same test_socket errors on Solaris 11 with Python 3.5.3.


======================================================================
ERROR: testCount (test.test_socket.SendfileUsingSendfileTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 5204, in testCount
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 5107, in recv_data
MemoryError

======================================================================
ERROR: testCount (test.test_socket.SendfileUsingSendfileTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 266, in _tearDown
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 278, in clientRun
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 5197, in _testCount
  File "/usr/local/src/Python-3.5.3/Lib/socket.py", line 286, in _sendfile_use_sendfile
    raise _socket.timeout('timed out')
socket.timeout: timed out

======================================================================
ERROR: testWithTimeout (test.test_socket.SendfileUsingSendfileTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 5274, in testWithTimeout
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 5107, in recv_data
MemoryError

======================================================================
ERROR: testWithTimeout (test.test_socket.SendfileUsingSendfileTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 266, in _tearDown
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 278, in clientRun
  File "/usr/local/src/Python-3.5.3/Lib/test/test_socket.py", line 5269, in _testWithTimeout
  File "/usr/local/src/Python-3.5.3/Lib/socket.py", line 286, in _sendfile_use_sendfile
    raise _socket.timeout('timed out')
socket.timeout: timed out

----------------------------------------------------------------------
Ran 530 tests in 54.577s

FAILED (errors=4, skipped=315)
test test_socket failed
1 test failed again:
    test_socket
msg301914 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-11 22:06
Since it seems like Solaris is dying, I'm not sure that it still makes sense to fix Python issues specific to Solaris. Here, I don't understand the issue, no patch is proposed and I'm not really interested to investigate :-/
msg374660 - (view) Author: Dennis Clarke (blastwave) Date: 2020-08-01 11:43
Well here we are in 2020 and Solaris systems are still running just fine. In fact, some big Fujitsu SPARC systems are running in production for years and years and also, no surprise, this test still fails horrifically on old stable Solaris 10. Python is turning into a piece of supposedly open source software with many commercial interests with their hands inside of it. I am not sure how to get this bug fixed but I can certainly report that it is still broken in 3.7.8 on a very stable and reliable platform.
msg374723 - (view) Author: Brian Vandenberg (phantal) Date: 2020-08-03 04:52
Solaris will be around for at least another 10-15 years.

The least you could do is look into it and offer some speculations.
msg374754 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-08-03 20:17
What do you expect us to do? No Python core dev has access to a Solaris machine. We cannot debug the issue and have to rely on external contributions. We have not declared Solaris as unsupported yet because people are still contributing fixes.

If you are looking for wild speculations: I guess Solari' sendfile() is either broken or does not behave like on other platforms.
msg374767 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-08-03 23:04
> I am not sure how to get this bug fixed (...)

Someone has to write a fix. You may contact Solaris vendor or a company using Solaris who wants to pay a developer to write a fix.
msg374790 - (view) Author: Brian Vandenberg (phantal) Date: 2020-08-04 03:19
Christian, you did exactly what I needed.  Thank you.

I don't have the means to do a git bisect to find where it broke.  It wasn't a problem around 3.3 timeframe and I'm not sure when this sendfile stuff was implemented.

The man page for sendfile says "The sendfile() function does not modify the current file pointer of in_fd, (...)".  In other words the read pointer for the input descriptor won't be advanced.  They expect you to use it like this:

offset = 0;
do {
  ret = sendfile(in, out, &offset, len);
} while( ret < 0 && (errno == EAGAIN || errno == EINTR) );

... though making that change in posixmodule.c would break this test severely since the send & receive code is running on the same thread.

In posixmodule.c I don't see anything that attempts to return the number of bytes successfully sent.  Since the input file descriptor won't have its read pointer advanced, the variable "offset" must be set to the correct offset value, otherwise it just keeps reading the first 32k of the file that was generated for the test.
msg374794 - (view) Author: Brian Vandenberg (phantal) Date: 2020-08-04 04:04
I accidentally hit submit too early.

I tried changing the code in posixmodule.c to use lseek(), something like the following:

offset = lseek( in, 0, SEEK_CUR );

do {
  ret = sendfile(...);
} while( ... );
lseek( in, offset, SEEK_SET );

... however, in addition to readfile not advancing the file pointer it also doesn't seem to cause an EOF condition.  In my first attempt at the above I was doing this after the loop:

lseek( in, offset, SEEK_CUR );

... and it just kept advancing the file pointer well beyond the end of the file and sendfile() had absolutely no qualms about reading beyond the end of the file.

I even tried adding a read() after the 2nd lseek to see if I could force an EOF condition but that didn't do it.
History
Date User Action Args
2020-08-04 16:01:56vstinnersetnosy: - vstinner
2020-08-04 04:04:56phantalsetmessages: + msg374794
2020-08-04 03:19:24phantalsetmessages: + msg374790
2020-08-03 23:04:49vstinnersetmessages: + msg374767
2020-08-03 20:17:35christian.heimessetnosy: + christian.heimes
messages: + msg374754
2020-08-03 04:52:47phantalsetmessages: + msg374723
2020-08-01 11:43:23blastwavesetnosy: + blastwave

messages: + msg374660
versions: + Python 3.7, - Python 3.6
2017-09-11 22:06:38vstinnersetnosy: + vstinner
messages: + msg301914
2017-06-21 11:26:31petriborgsetnosy: + petriborg
messages: + msg296546
2017-06-20 17:39:37Jim Criglersetnosy: + Jim Crigler
messages: + msg296475
2017-01-13 20:55:24phantalcreate