Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Runtime] Add runtime Executor class #2267

Closed
wants to merge 1 commit into from

Conversation

SplitInfinity
Copy link
Contributor

Description
This commit adds Executor, an interface for a component of the Glow
runtime that is capable of running a partitioned directed acyclic Glow
graph and returning the results of that execution. This commit also adds
ThreadPoolExecutor, an implementation of this interface that runs
graphs using a thread pool. The workers essentially do a
concurrent breadth-first search through the node graph to execute all
nodes, repeatedly copying partition components outputs to their descendants along
the way.

Testing
Tests WIP. I need to write some code to mock DAGs, symbol information and DeviceManager; let me know if you've done this already and can share the code.

Description:
This commit adds `Executor`, an interface for a component of the Glow
runtime that is capable of running a partitioned directed acyclic Glow
graph and returning the results of that execution. This commit also adds
`ThreadPoolExecutor`, an implementation of this interface that runs
graphs using a thread pool. The  workers essentially do a
concurrent breadth-first search through the node graph to execute all
nodes, repeatedly copying partition components outputs to their descendents along
the way.

Testing:
It compiles.

Documentation:
Comments.
@SplitInfinity
Copy link
Contributor Author

Overall, I think this is simpler and cleaner than v1, but there is still some iteration and cleanup required. There is still some complicated locking/unlocking going on, and there is probably a cleaner way to split up the state and responsibilities between ThreadPoolExecutor and ExecutionState. Right now, the latter is a simple contained for all the state related to a single run that allows me to avoid creating several instance variables in ThreadPoolExecutor of type std::unordered_map<RuntimeIdTy, std::unordered_map<DAGNode*, T>> which is annoying to read and reason about.

@bertmaher
Copy link
Contributor

Does this PR include the changes in #2249? The device manager stuff looks similar...

@nickgg
Copy link
Contributor

nickgg commented Jan 15, 2019

@bertmaher it's got the interface and runtime types from that PR yes, ideally we land #2249 first and rebase onto it.

@SplitInfinity
Copy link
Contributor Author

Yep, @nickgg is correct. I will rebase on top of that PR after it lands.

@@ -59,6 +65,8 @@ class RuntimeBundle {
size_t getValueOffset(const Named *v) const;
/// Helper function, gets symbol info for \p v.
RuntimeSymbolInfo getSymbolInfo(const Named *v) const;
/// Get a const reference to the symbol table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channeling @bertmaher and since I'm a frequent transgressor, remember vertical whitespace between declarations.


using DeviceNetworkID = size_t;

enum ResultCode { READY, EXECUTED, FAILED, CANCELLED };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResultCode enum leaks into the global namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These now live in RuntimeTypes.h anyway so can just include that

@@ -30,13 +30,19 @@ struct RuntimeSymbolInfo {
size_t offset;
/// Type of symbol.
Type type;
/// Is the symbol an input for the function.
bool input;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input -> isInput

#include "glow/Graph/Graph.h"
#include "glow/Support/ThreadPool.h"

#include <atomic>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move some of the includes from the header file to the cpp file?

namespace runtime {

/// Forward declaration of getDeviceManager function.
/// TODO: Talk to gcatron about the details of this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't leave usernames and todos in the codebase. You can file an issue.


using DeviceNetworkID = size_t;

enum ResultCode { READY, EXECUTED, FAILED, CANCELLED };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These now live in RuntimeTypes.h anyway so can just include that

};

/// Create a executor of kind \p kind.
Executor *createExecutor(ExecutorKind executorKind);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have a default argument since we wouldn't expect to call it more than once in the runtime.

namespace runtime {

/// This enum lists the available executors.
enum class ExecutorKind {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we need this? I could see maybe we decide on a different executor than a Thread Pool but would we still want to support both?

executionStateLocks_[runId];
executionStates_.insert(std::make_pair(
runId, std::make_shared<ExecutionState>(runId, std::move(cb))));
lock.unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really prefer using scope to control locking rather than a manual unlock(), its easy for the next developer to introduce a path that locks or unlocks at the wrong point.

DAGNode *node,
Context *ctx) {
// Get the execution state for the run.
std::shared_ptr<ExecutionState> executionState = executionStates_[runId];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to hold the executionStateLocksMtx_ lock while reading executionStates_ since it may be being modified by another thread. If you want to release the lock during the propagation process I think you should pass the ExecutionState* in rather than the RunIdentifier.

// Run the node using the DeviceManager.
deviceManager->runFunction(
node->name, std::move(nodeCtx),
[this, runId, node](RunIdentifierTy id, ResultCode resultCode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idis unused, but it was added specifically for this use case - looks like you have your own runId. Do we still need the RunIdenitifier generated by runFunction?

/// Locks for the execution state objects. These are used to ensure that
/// the shared state contained in the ExecutionState object for a run is
/// mutated by only one thread at a time.
std::unordered_map<RunIdentifierTy, std::mutex> executionStateLocks_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lifetime of these locks appears to be identical to the ExecutionState, why not make the lock a member of the state?

// Get the execution state for the run.
auto executionStateIt = executionStates_.find(runId);
if (executionStateIt == executionStates_.end()) {
// This should never happen. TODO: Log, assert, something.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels safer to start with an assert than not.


std::shared_ptr<ExecutionState> executionState = executionStateIt->second;

if (resultCode == ResultCode::CANCELLED || resultCode == ResultCode::FAILED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This seems clearer as if (resultCode != ResultCode::EXECUTED)

if ((executionState->inflightNodes).empty()) {
// If there are no nodes inflight, that means all nodes are done. Call
// the callback and erase the state information.
executionState->cb(runId, ResultCode::EXECUTED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding a lock while calling the callback is dangerous, this could deadlock if the cb calls back into the Executor.

@stale
Copy link

stale bot commented Jan 22, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Jan 27, 2019

This PR has been automatically closed due to being stale for 5 days. Thank you for your contributions and feel free to reopen it in case of further progress.

@stale stale bot closed this Jan 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants