classification
Title: Race condition in asyncore may access the wrong dispatcher
Type: resource usage Stage:
Components: Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: giampaolo.rodola Nosy List: Nir Soffer, giampaolo.rodola, haypo, walkhour
Priority: normal Keywords:

Created on 2017-07-14 16:44 by walkhour, last changed 2017-07-29 09:53 by giampaolo.rodola.

Pull Requests
URL Status Linked Edit
PR 2707 open walkhour, 2017-07-14 16:48
PR 2764 closed walkhour, 2017-07-19 15:54
PR 2854 closed haypo, 2017-07-24 22:29
Messages (15)
msg298359 - (view) Author: Jaume (walkhour) * Date: 2017-07-14 16:44
If a socket is closed and polled here https://github.com/python/cpython/blob/0d0a32fb91cdfea1626e6c6b77a9bc44e15a2b8a/Lib/asyncore.py#L183 the flag returned will be select.POLLNVAL (note that this can happen despite dispatcher.close() being called).

This wouldn't be a problem but before here https://github.com/python/cpython/blob/0d0a32fb91cdfea1626e6c6b77a9bc44e15a2b8a/Lib/asyncore.py#L185 a new socket may be created with the same fd (which wouldn't be strange since that fd is now available), so the retrieved socket will be the newly created one instead of the old one and the new one will be close.

This is regularly happening to us, I could try to explain better, but that's really it.
msg298362 - (view) Author: Nir Soffer (Nir Soffer) Date: 2017-07-14 17:17
Can you provide a minimal reproducer, or best add a failing test?
msg298682 - (view) Author: Nir Soffer (Nir Soffer) Date: 2017-07-19 16:21
Adding more info after discussion in github.

After polling readable/writeable dispatchers, asyncore.poll(2) receive a list of ready file descriptors, and invoke callbacks on the dispatcher objects.

If a dispatchers is closed and and a new dispatcher is created, the new dispatcher may get the same file descriptor. If the file descriptor was in the ready list returned from poll()/select(), asyncore may try to invoke one of the callbacks (e.g. handle_write) on the new dispatcher.

Here is an example in asycore.poll()

        r, w, e = select.select(r, w, e, timeout)

        for fd in r:
            obj = map.get(fd)
            if obj is None:
                continue
            read(obj)

read close obj, removing fd from map, then creates a new one
getting the same fd...

        for fd in w:
            obj = map.get(fd)

this get the new object from the map, instead of the closed one.

            if obj is None:
                continue
            write(obj)

invoke write on the wrong socket, which is not writable

        for fd in e:
            obj = map.get(fd)

same issue here

            if obj is None:
                continue
            _exception(obj)


asyncore.poll2() has same issue:

        r = pollster.poll(timeout)
        for fd, flags in r:
            obj = map.get(fd)
            if obj is None:
                continue
            readwrite(obj, flags)

fd may have been closed and recreated by in a previous iteration of the loop.

This issue is demonstrated in the failing test:
https://github.com/python/cpython/pull/2707/commits/5aeb0098d2347984f3a89cf036c305edd2b8382b
msg299030 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-24 23:14
On my PR 2854, Nir added these comments (extract):

"And now we try to read from close dispatcher. I think we should wait for #2804 (...)"

Sorry, I don't understand. My PR fixes described the bug that you described in msg298682:

"If a dispatchers is closed and and a new dispatcher is created, the new dispatcher may get the same file descriptor. If the file descriptor was in the ready list returned from poll()/select(), asyncore may try to invoke one of the callbacks (e.g. handle_write) on the new dispatcher."

About reading from closed dispatcher: sorry, I don't know the asyncore concurrency model, but I know the asyncio concurrency model. In asyncio, I don't see why a "dispatcher" would affect any "unrelated" dispatcher. Said differently: if a dispatcher A closes the dispatcher B and dispatcher B is "readable", the read event will be handled anyway. The "close event" will be handled at the next loop iteration. The oversimplified asyncio model is "each iteration is atomic".

Closing a dispatcher calls its del_channel() which removes its file descriptor from the asyncore map.

--

If I understand correctly, the *current behaviour* depends on the file descriptor order, since select.select() and select.poll.poll() returns ordered file descriptors (smallest to largest). If a dispatcher A is called and closes a dispatcher B, the dispatcher B is called or not depending on the file descriptor. If A.fileno() < B.fileno(), B is not called. If A.fileno() > B.fileno(), B was already called. Am I right?

What is the use case of a dispatcher closing another one? What is the consequence on my PR? Does the dispatcher B expects to not be called to read if it was already closed?

I see that close() also immediately closes the underlying socket. Does it mean that my PR can triggered OSError since we try recv() on a closed socket?
msg299132 - (view) Author: Nir Soffer (Nir Soffer) Date: 2017-07-25 18:58
On my PR 2854, Nir added these comments (extract):

> "And now we try to read from close dispatcher. I think we should wait
> for #2804 (...)"

> Sorry, I don't understand. My PR fixes described the bug that you
> described in msg298682:

> "If a dispatchers is closed and and a new dispatcher is created, the
> new dispatcher may get the same file descriptor. If the file
> descriptor was in the ready list returned from poll()/select(),
> asyncore may try to invoke one of the callbacks (e.g.  handle_write)
> on the new dispatcher."

> About reading from closed dispatcher: sorry, I don't know the asyncore
> concurrency model, but I know the asyncio concurrency model. In
> asyncio, I don't see why a "dispatcher" would affect any "unrelated"
> dispatcher. Said differently: if a dispatcher A closes the dispatcher
> B and dispatcher B is "readable", the read event will be handled
> anyway. The "close event" will be handled at the next loop iteration.
> The oversimplified asyncio model is "each iteration is atomic".

I don't know asyncio enough, but I suspect this issue may effect it if
it is using the fd to check if a reader/writer is closed. This is the
main issue in asyncore, assuming that during the loop:

    obj = map.get(fd)

Means that a obj is the dispatcher that owned fd when preparing for
poll()/select(), and we can use obj for calling the I/O callbacks.

The dispatcher owning this fd may have been closed during the loop, and
a new dispatcher may have been created with the *same* fd. The only way
to check if a dispatcher is closed is to check the object.

> Closing a dispatcher calls its del_channel() which removes its file
> descriptor from the asyncore map.

And creating new one at that point will add a new channel using the same
fd.

> If I understand correctly, the *current behaviour* depends on the file
> descriptor order, since select.select() and select.poll.poll() returns
> ordered file descriptors (smallest to largest). If a dispatcher A is
> called and closes a dispatcher B, the dispatcher B is called or not
> depending on the file descriptor. If A.fileno() < B.fileno(), B is not
> called. If A.fileno() > B.fileno(), B was already called. Am I right?

I don't think the order matter, only the fact that closing a dispatcher
and creating a new one is likely to use the same fd.

> What is the use case of a dispatcher closing another one? What is the
> consequence on my PR? Does the dispatcher B expects to not be called
> to read if it was already closed?

I don't know what is the use case of the reporter, but here is one
possible example - you implement a proxy, each proxy connection has 2
legs, each using a dispatcher connected to an endpoint. If one leg is
closed, you close also the other leg (one dispatcher closing another).
Now if you create a new dispatcher during the same loop iteration, it
may get the same fd of the other closed leg, and asyncore may try to
read/write with this dispatcher. This may work or not depending on how
you implement the dispatcher.

This is domonstrated in
https://github.com/python/cpython/pull/2764/commits/5aeb0098d2347984f3a89cf036c305edd2b8382b

> I see that close() also immediately closes the underlying socket. Does
> it mean that my PR can triggered OSError since we try recv() on a
> closed socket?

Yes, this is a typical issue in asyncore if you don't protect your
subclass to handle double close. asyncore.file_wrapper is protected, but
asyncore.dispatcher and its subclasses are not.
msg299144 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-07-25 21:21
> Yes, this is a typical issue in asyncore if you don't protect your
> subclass to handle double close.

I use the same trick all over the place in pyftpdlib:
https://github.com/giampaolo/pyftpdlib/blob/1268bb185cd63c657d78bc33309041628e62360a/pyftpdlib/handlers.py#L537
In practical terms, does this bug report aim to fix this issue?
msg299150 - (view) Author: Nir Soffer (Nir Soffer) Date: 2017-07-25 21:52
> I use the same trick all over the place in pyftpdlib:
> https://github.com/giampaolo/pyftpdlib/blob/1268bb185cd63c657d78bc33309041628e62360a/pyftpdlib/handlers.py#L537

This allow detection of closed sockets, but does not fix the issue of
accessing the wrong dispatcher.

> In practical terms, does this bug report aim to fix this issue?

Yes, see the attached PR's.
msg299156 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-25 22:42
I'm not sure that it's a good idea to compare asyncore and asyncio. While their name are similar, their design are *very* different. I'm only talking about the kernel, the core event loop checking for file descriptors.

In asyncio, when you close a transport, the transport is not *closed* immediately. It is scheduled to be closed as soon as possible: usually in the next loop iteration, but it can longer to complex transports like subprocesses or TLS connections.

Thanks to this design, asyncio doesn't have the race condition described in this issue.
msg299158 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-25 22:51
If I understood correctly, to fix this race condition, there are two main constraints:

(A) We must call handlers of dispatcher registered before calling select()/poll(): dispatchers can be closed and their file descriptor can be reused while we execute other dispatchers

(B) Don't call closed dispatcher

--

For (A), I only see one option: copy somehow the map, for example using: ready = [(map[fd], flags) for fd, flags in r]

--

For (B), it's more tricky.

The bpo-30985 proposes to start using the closing: I'm not sure that it's doable in a stable version like 3.6, or worse in 2.7, since existing code may use closing differently.
http://bugs.python.org/issue30985#msg299154

I'm also concerned about subclassing: if a subclass overrides *completely* close(), we would fail to detect that the dispatcher was closed. Maybe this corner doesn't occur in the wild?

I see a different approach: while iterating on "ready" objects, check again if map.get(fd) still is the expected dispatcher. If the file descriptor is no more registered in map: it was closed. If we get a new dispatcher: not only the previous dispatcher was closed, but a new dispatcher also got the same file descriptor and was registered in the map.

Do you think that this approach would work? Would it be safer in term of backward compatibility?
msg299162 - (view) Author: Jaume (walkhour) * Date: 2017-07-25 23:08
There's an alternative fix which follows a similar approach to the one you mention: https://github.com/python/cpython/pull/2707/.

I personally don't like too much to reuse again again an old variable as it's true that we don't know how people are using it (although I take they use it as a boolean since it's assigned to False).

On the other side this approach copies the map when it's not strictly necessary to do use just because we can't do it the proper way by using closing, as shown in the PR using it: https://github.com/python/cpython/pull/2764/files
msg299164 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-25 23:21
> There's an alternative fix which follows a similar approach to the one you mention: https://github.com/python/cpython/pull/2707/.

Sorry, I'm slow to understand. Now that the bug and expected behaviour is better explained, I can now review correctly your PR and I like it :-)

While you wrote you previous comment, I updated my PR to implement a similar idea:
https://github.com/python/cpython/pull/2854

My code now detects if a dispatcher was closed, but also if it's file descriptor was reused by a newly registered dispatcher.

Our PR are similar:

* create a copy before starting to run handler
* check if our copy is still consistent to the current map: if not, skip the dispatcher handler

My PR is different:

* It only creates a list of ready objects: it might be cheaper if only few file descriptors are ready, whereas the map is large... I don't think that it's a real performance bottleneck in practice, but well, I'm just trying to compare our implementations :-) ... To be honest, my PR is a copy of your other PR, not you wrote both versions in fact ;-) (that's why I credited you as co-author in my PR)

* My PR has a simpler and more specific unit test: only testing the poll() function, no server, no client: just handlers.

* My PR also tests when a dispatcher was only closed, not replaced.

--

Good! It seems like slowly we converge to a solution, no? Now, it's more bikeshedding on the exact implementation ;-)
msg299300 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-27 10:11
Ok, wow, the discussion on this issue, bpo-30985, PR 2854, PR 2804, etc. was very productive. We identified much more race conditions than the first one that I spotted.


(Bug 1) The dispatcher A closes the dispatcher B. Currently, asyncore calls the handlers of the dispatcher B.

Technically, the close() method of a dispatcher closes immediately a socket object. So it's very likely that handlers of the dispatcher B fails on recv(), send(), or any other operation on the socket.

There is also a risk that somehow dispatcher B tries to closes again the socket (technically, socket, pipe or something else) and gets a OSError(EBADF) or worse: closes an unrelated socket!


(Bug 2) Similar to (Bug 1), but the subtle coming from the current implementation causes even worse bugs. The dispatcher A closes the dispatcher B, and then the dispatcher A (or another dispatcher, it doesn't matter who make that) creates a new dispatcher C which reuses the same file descriptor than the old closed file descriptor of dispatcher B.

The consequence is that handlers of dispatcher C are called even if the socket is not ready to read or to write. Since sockets are non-blocking, recv() and send() are likely to fail with BlockingIOError. These errors are usually correctly handled in asynchronous code, but they can cause bugs. More generally, calling handlers when the socket is not ready can cause many kinds of bugs.


(Bug 3) If a socket is ready to read and ready to write, handle_read() and handle_write() handlers are expected to be called. The problem is when handle_read() (first called handler) closes the dispatcher.

It was decided to not call handle_write() is that case.


(Bug 4) Ok, this one is the best one :-) When a socket is ready, asyncore calls methods like handle_write_event(). These methods usually call a single handler like handle_write(), *but* sometimes can call multiple handlers like handle_connect() followed by handle_write(). If a first handler closes the dispatcher, the following handler may fail badly on trying to use a closed socket.

IMHO (Bug 3) and (Bug 4) are less surprising than (Bug 1) or (Bug 2) since they are determistic and so less "race condition". Since these bugs are determistic, I expect that applications are already written to handle these corner cases, somehow. The simplest option is to catch and ignore errors, asyncore already ignores many errors for us. For example asyncore already ignores EBADF, whereas IMHO it hides a design issue (the 4 bugs I just listed).

Technically, fixing (Bug 3) using my PR 2854 design (check if map[fd] is still the same dispatcher) is safe since it doesn't rely at all on the actual implementation of asyncore handle_xxx_event() (which can be overriden) not handle_xxx() methods defined by third party code.

Fixing (Bug 4) is more complex since it requires to decide how to detect that a dispatcher was closed (it seems like "dispatcher._fileno is None" test is the best choice), and to modify handle_xxx_event() methods, whereas these methods can and *are* overridden in subclasses.

--

About backward compatibility, in short, fixing any of these bug changes the exact behaviour, since fixing all these bugs require to *not* call a handler if the dispatcher was closed. *If* a specific application requires that handlers are called anyway, changes are likely to break these applications.

A workaround for these applications is to copy asyncore.py from an old Python version to keep the exact same behaviour. Another smoother workaround is to monkey-patch poll() and poll2(), and maybe also override handle_xxx_event() methods, to restore the old behaviour.

I don't want to add a flag to re-enable the old behaviour, since it seems like we all agree that the described bugs are clearly very bad design bugs and that they must be fixed.

--

Another option, Nir will probably not like it, is to consider that asyncore is dead and will not be fixed.

That's not my favorite option. Nir explained that on Python 2.7, asyncore is the most convenient module for his use case, he deeply rely on it, and so "just want" to fix know bugs.

--

I suggest to use this bpo to fix (Bug 1), (Bug 2) and (Bug 3), since they can all be fixed in poll() and poll2() with the same design, like my PR 2854.

For (Bug 4), I suggest to use bpo-30985 where we already discussed many options how to detect that a dispatcher was closed. I have no strong opinion on this bug, if it should be fixed or not. IMHO it's less important and applications already know how handle the bug.
msg299301 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-27 10:14
Much shorter summary:

* Handlers of dispatcher must not be called if the dispatcher was called.
* If a dispatcher is closed between the execution of two of his handlers, the next handlers should not be called

My PR 2854 adds 3 unit tests to verify this behaviour of 3 different "race conditions".
msg299350 - (view) Author: Nir Soffer (Nir Soffer) Date: 2017-07-27 20:47
Victor, I mostly agree with you, but I think we have here several bugs, and
we should do the minimal fix for each of them. Your PR is too big, trying
to fix too much.

(Bug 1) The dispatcher A closes the dispatcher B. Currently, asyncore calls the handlers of the dispatcher B.

We don't have such bug now - if dispatcher B is closed, it is not in the map, 
and current code will skip the fd when checking:

    for fd in r:
        obj = map.get(fd)
        if obj is None:
            continue

(Bug 2) Dispatcher A close Dispatcher B, create Dispatcher C. C get may use the same
fd as B. If B was ready, asyncore may get C from the map and access it instead of B.

This issue is reported in this bug.

(Bug 3) if handle_read() closed the dispatcher, asyncore will call handle_write() on
a closed dispatcher. This is a very old bug with asyncore.readwrite().

(Bug 4) handle_xxx_event() internal methods may call multiple handle_xxx() user
methods, again not checking if dispatcher was closed after each invocation.
Same, very old bug.

So I suggest that we fix *only* bug 2 in https://github.com/python/cpython/pull/2854.

The issue in readwrite() can be solved with the approach you suggest, but I prefer
to make small an careful steps so we don't introduce regressions in stable versions.

Also, there is already too much noise here an in the various PRs about this issue,
better file bugs for the rest of the issues and discuss them separately.

This commit
https://github.com/python/cpython/pull/2854/commits/bbd2d09ab999fa2214cbbd2589ae3642facd3057

Looks fine with the test_poll_close_replace_dispatchers test added in the later
commit.
msg299461 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2017-07-29 09:52
+1, I would push https://github.com/python/cpython/pull/2854/ first and fix the race condition only.
History
Date User Action Args
2017-07-29 09:53:08giampaolo.rodolasetassignee: giampaolo.rodola
2017-07-29 09:52:24giampaolo.rodolasetmessages: + msg299461
2017-07-27 20:47:50Nir Soffersetmessages: + msg299350
2017-07-27 10:14:20hayposetmessages: + msg299301
2017-07-27 10:11:28hayposetmessages: + msg299300
2017-07-25 23:21:08hayposetmessages: + msg299164
2017-07-25 23:08:20walkhoursetmessages: + msg299162
2017-07-25 22:51:31hayposetmessages: + msg299158
2017-07-25 22:42:04hayposetmessages: + msg299156
2017-07-25 21:52:05Nir Soffersetmessages: + msg299150
2017-07-25 21:21:02giampaolo.rodolasetnosy: + giampaolo.rodola
messages: + msg299144
2017-07-25 18:58:07Nir Soffersetmessages: + msg299132
2017-07-24 23:14:22hayposetnosy: + haypo
messages: + msg299030
2017-07-24 22:29:18hayposetpull_requests: + pull_request2905
2017-07-21 00:07:26hayposetversions: - Python 3.3, Python 3.4
2017-07-19 16:21:53Nir Soffersetmessages: + msg298682
title: Race condition in asyncore wrongly closes channel -> Race condition in asyncore may access the wrong dispatcher
2017-07-19 15:54:35walkhoursetpull_requests: + pull_request2823
2017-07-14 17:17:04Nir Soffersetnosy: + Nir Soffer
messages: + msg298362
2017-07-14 16:48:25walkhoursetpull_requests: + pull_request2771
2017-07-14 16:44:47walkhoursettype: resource usage
versions: + Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7
2017-07-14 16:44:08walkhourcreate