Message 346156 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	josnyder
Recipients	christian.heimes, josnyder
Date	2019-06-20.18:42:48
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1561056169.29.0.814417679233.issue37355@roundup.psfhosted.org>
In-reply-to

Content
Background: SSLSocket.read drops the GIL and performs exactly one successful call to OpenSSL's `SSL_read`, whose documentation states "At most the contents of one record will be returned". TLS records are at most 16KB, so high throughput (especially multithreaded) TLS reception can become bottlenecked on the GIL. Proposal: For non-blocking sockets, call SSL_read in a loop until the user-supplied limit is reached or no bytes are available on the socket. I don't know of a way to safely improve performance for blocking sockets. Initial testing: I performed initial testing using 32 threads pinned to 16 cores, downloading and re-assembling a single 140270MB file from a "real world" TLS sender. This resulted in a 4x increase in throughput, a 6.6x reduction in voluntary context switches, a 3.5x reduction in system time. User time did increase by 43%, so the overall reduction in CPU usage was only 2.67x. before after wall clock time (s) : 29.637 7.116 user time (s) : 8.793 12.584 system time (s) : 105.118 30.010 user + system time (s) : 113.911 42.594 cpu utilization (%) : 384 599 voluntary switches : 1,653,065 248,484 speed (MB/s) : 4733 19712 My git branch (currently a draft) is at https://github.com/hashbrowncipher/cpython/commits/faster_tls

Background:

SSLSocket.read drops the GIL and performs exactly one successful call to  OpenSSL's `SSL_read`, whose documentation states "At most the contents of one record will be returned". TLS records are at most 16KB, so high throughput (especially multithreaded) TLS reception can become bottlenecked on the GIL.

Proposal:

For non-blocking sockets, call SSL_read in a loop until the user-supplied limit is reached or no bytes are available on the socket. I don't know of a way to safely improve performance for blocking sockets.

Initial testing:

I performed initial testing using 32 threads pinned to 16 cores, downloading and re-assembling a single 140270MB file from a "real world" TLS sender. This resulted in a 4x increase in throughput, a 6.6x reduction in voluntary context switches, a 3.5x reduction in system time. User time did increase by 43%, so the overall reduction in CPU usage was only 2.67x.

                                 before     after
     wall clock time (s)    :    29.637     7.116
     user time (s)          :     8.793    12.584
     system time (s)        :   105.118    30.010
     user + system time (s) :   113.911    42.594
     cpu utilization (%)    :       384       599 
     voluntary switches     : 1,653,065   248,484
     speed (MB/s)           :      4733     19712

My git branch (currently a draft) is at https://github.com/hashbrowncipher/cpython/commits/faster_tls

History
Date	User	Action	Args
2019-06-20 18:42:49	josnyder	set	recipients: + josnyder, christian.heimes
2019-06-20 18:42:49	josnyder	set	messageid: <1561056169.29.0.814417679233.issue37355@roundup.psfhosted.org>
2019-06-20 18:42:49	josnyder	link	issue37355 messages
2019-06-20 18:42:48	josnyder	create