Message 310032 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	njs
Recipients	benjamin.peterson, njs, stutzbach
Date	2018-01-16.02:07:19
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1516068442.29.0.467229070634.issue32561@psf.upfronthosting.co.za>
In-reply-to

Content
Background: Doing I/O to files on disk has a hugely bimodal latency. If the I/O happens to be in or going to cache (either user-space cache, like in io.BufferedIOBase, or the OS's page cache), then the operation returns instantly (~1 µs) without blocking. OTOH if the I/O isn't cached (for reads) or cacheable (for writes), then the operation may block for 10 ms or more. This creates a problem for async programs that want to do disk I/O. You have to use a thread pool for reads/writes, because sometimes they block for a long time, and you want to let your event loop keep doing other useful work while it's waiting. But dispatching to a thread pool adds a lot of overhead (~100 µs), so you'd really rather not do it for operations that can be serviced directly through cache. For uncached operations a thread gives a 100x speedup, but for cached operations it's a 100x slowdown, and -- this is the kicker -- there's no way to predict which ahead of time. But, io.BufferedIOBase at least knows when it can satisfy a request directly from its buffer without issuing any syscalls. And in Linux 4.14, it's even possible to issue a non-blocking read to the kernel that will only succeed if the data is immediately available in page cache (bpo-31368). So, it would be very nice if there were some way to ask a Python file object to do a "nonblocking read/write", which either succeeds immediately or else raises an error. The intended usage pattern would be: async def read(self, args): try: self._fileobj.read(args, nonblock=True) except BlockingIOError: # maybe? return await run_in_worker_thread(self._fileobj.read, args) It would really* help for this to be in the Python core, because right now the convenient way to do non-blocking disk I/O is to re-use the existing Python I/O stack, with worker threads. (This is how both aiofiles and trio's async file support work. I think maybe curio's too.) But to implement this feature ourselves, we'd have to first reimplement the whole I/O stack, because the important caching information, and choice of what syscall to use, are hidden inside.

Background: Doing I/O to files on disk has a hugely bimodal latency. If the I/O happens to be in or going to cache (either user-space cache, like in io.BufferedIOBase, or the OS's page cache), then the operation returns instantly (~1 µs) without blocking. OTOH if the I/O isn't cached (for reads) or cacheable (for writes), then the operation may block for 10 ms or more.

This creates a problem for async programs that want to do disk I/O. You have to use a thread pool for reads/writes, because sometimes they block for a long time, and you want to let your event loop keep doing other useful work while it's waiting. But dispatching to a thread pool adds a lot of overhead (~100 µs), so you'd really rather not do it for operations that can be serviced directly through cache. For uncached operations a thread gives a 100x speedup, but for cached operations it's a 100x slowdown, and -- this is the kicker -- there's no way to predict which ahead of time.

But, io.BufferedIOBase at least knows when it can satisfy a request directly from its buffer without issuing any syscalls. And in Linux 4.14, it's even possible to issue a non-blocking read to the kernel that will only succeed if the data is immediately available in page cache (bpo-31368).

So, it would be very nice if there were some way to ask a Python file object to do a "nonblocking read/write", which either succeeds immediately or else raises an error. The intended usage pattern would be:

async def read(self, *args):
    try:
        self._fileobj.read(*args, nonblock=True)
    except BlockingIOError: # maybe?
        return await run_in_worker_thread(self._fileobj.read, *args)

It would *really* help for this to be in the Python core, because right now the convenient way to do non-blocking disk I/O is to re-use the existing Python I/O stack, with worker threads. (This is how both aiofiles and trio's async file support work. I think maybe curio's too.) But to implement this feature ourselves, we'd have to first reimplement the whole I/O stack, because the important caching information, and choice of what syscall to use, are hidden inside.

History
Date	User	Action	Args
2018-01-16 02:07:22	njs	set	recipients: + njs, benjamin.peterson, stutzbach
2018-01-16 02:07:22	njs	set	messageid: <1516068442.29.0.467229070634.issue32561@psf.upfronthosting.co.za>
2018-01-16 02:07:22	njs	link	issue32561 messages
2018-01-16 02:07:19	njs	create