I didn't start working on the PR, so please go ahead if you're interested.

One small suggestion: If you're implementing this, please note that the proof-of-concept implementation shown in the description is not very efficient because each call to `wait` has to iterate over all the futures (which can be potentially large in number) to set up and tear down the done callbacks on each one. A more efficient implementation would set up the callbacks only once - see for an example.
