-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][Runtime] Add runtime Executor class #2267
Conversation
Description: This commit adds `Executor`, an interface for a component of the Glow runtime that is capable of running a partitioned directed acyclic Glow graph and returning the results of that execution. This commit also adds `ThreadPoolExecutor`, an implementation of this interface that runs graphs using a thread pool. The workers essentially do a concurrent breadth-first search through the node graph to execute all nodes, repeatedly copying partition components outputs to their descendents along the way. Testing: It compiles. Documentation: Comments.
Overall, I think this is simpler and cleaner than v1, but there is still some iteration and cleanup required. There is still some complicated locking/unlocking going on, and there is probably a cleaner way to split up the state and responsibilities between |
Does this PR include the changes in #2249? The device manager stuff looks similar... |
@bertmaher it's got the interface and runtime types from that PR yes, ideally we land #2249 first and rebase onto it. |
Yep, @nickgg is correct. I will rebase on top of that PR after it lands. |
@@ -59,6 +65,8 @@ class RuntimeBundle { | |||
size_t getValueOffset(const Named *v) const; | |||
/// Helper function, gets symbol info for \p v. | |||
RuntimeSymbolInfo getSymbolInfo(const Named *v) const; | |||
/// Get a const reference to the symbol table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Channeling @bertmaher and since I'm a frequent transgressor, remember vertical whitespace between declarations.
|
||
using DeviceNetworkID = size_t; | ||
|
||
enum ResultCode { READY, EXECUTED, FAILED, CANCELLED }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ResultCode enum leaks into the global namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These now live in RuntimeTypes.h anyway so can just include that
@@ -30,13 +30,19 @@ struct RuntimeSymbolInfo { | |||
size_t offset; | |||
/// Type of symbol. | |||
Type type; | |||
/// Is the symbol an input for the function. | |||
bool input; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input -> isInput
#include "glow/Graph/Graph.h" | ||
#include "glow/Support/ThreadPool.h" | ||
|
||
#include <atomic> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move some of the includes from the header file to the cpp file?
namespace runtime { | ||
|
||
/// Forward declaration of getDeviceManager function. | ||
/// TODO: Talk to gcatron about the details of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't leave usernames and todos in the codebase. You can file an issue.
|
||
using DeviceNetworkID = size_t; | ||
|
||
enum ResultCode { READY, EXECUTED, FAILED, CANCELLED }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These now live in RuntimeTypes.h anyway so can just include that
}; | ||
|
||
/// Create a executor of kind \p kind. | ||
Executor *createExecutor(ExecutorKind executorKind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could have a default argument since we wouldn't expect to call it more than once in the runtime.
namespace runtime { | ||
|
||
/// This enum lists the available executors. | ||
enum class ExecutorKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we need this? I could see maybe we decide on a different executor than a Thread Pool but would we still want to support both?
executionStateLocks_[runId]; | ||
executionStates_.insert(std::make_pair( | ||
runId, std::make_shared<ExecutionState>(runId, std::move(cb)))); | ||
lock.unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really prefer using scope to control locking rather than a manual unlock()
, its easy for the next developer to introduce a path that locks or unlocks at the wrong point.
DAGNode *node, | ||
Context *ctx) { | ||
// Get the execution state for the run. | ||
std::shared_ptr<ExecutionState> executionState = executionStates_[runId]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to hold the executionStateLocksMtx_
lock while reading executionStates_
since it may be being modified by another thread. If you want to release the lock during the propagation process I think you should pass the ExecutionState* in rather than the RunIdentifier.
// Run the node using the DeviceManager. | ||
deviceManager->runFunction( | ||
node->name, std::move(nodeCtx), | ||
[this, runId, node](RunIdentifierTy id, ResultCode resultCode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id
is unused, but it was added specifically for this use case - looks like you have your own runId. Do we still need the RunIdenitifier generated by runFunction?
/// Locks for the execution state objects. These are used to ensure that | ||
/// the shared state contained in the ExecutionState object for a run is | ||
/// mutated by only one thread at a time. | ||
std::unordered_map<RunIdentifierTy, std::mutex> executionStateLocks_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lifetime of these locks appears to be identical to the ExecutionState
, why not make the lock a member of the state?
// Get the execution state for the run. | ||
auto executionStateIt = executionStates_.find(runId); | ||
if (executionStateIt == executionStates_.end()) { | ||
// This should never happen. TODO: Log, assert, something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels safer to start with an assert than not.
|
||
std::shared_ptr<ExecutionState> executionState = executionStateIt->second; | ||
|
||
if (resultCode == ResultCode::CANCELLED || resultCode == ResultCode::FAILED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This seems clearer as if (resultCode != ResultCode::EXECUTED)
if ((executionState->inflightNodes).empty()) { | ||
// If there are no nodes inflight, that means all nodes are done. Call | ||
// the callback and erase the state information. | ||
executionState->cb(runId, ResultCode::EXECUTED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holding a lock while calling the callback is dangerous, this could deadlock if the cb calls back into the Executor.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
This PR has been automatically closed due to being stale for 5 days. Thank you for your contributions and feel free to reopen it in case of further progress. |
Description
This commit adds
Executor
, an interface for a component of the Glowruntime that is capable of running a partitioned directed acyclic Glow
graph and returning the results of that execution. This commit also adds
ThreadPoolExecutor
, an implementation of this interface that runsgraphs using a thread pool. The workers essentially do a
concurrent breadth-first search through the node graph to execute all
nodes, repeatedly copying partition components outputs to their descendants along
the way.
Testing
Tests WIP. I need to write some code to mock DAGs, symbol information and
DeviceManager
; let me know if you've done this already and can share the code.