Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add host manager #2125

Closed
wants to merge 7 commits into from
Closed

Conversation

gcatron
Copy link
Contributor

@gcatron gcatron commented Dec 5, 2018

Description: This is a work in progress to add the HostManager to glow.
The HostManager will serve as the entry point for execution calls and handle orchestrating the rest of the runtime components.
Testing: Not Yet
Documentation:

@@ -0,0 +1,58 @@
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a space in the path name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make it one word, like runtime or ExecutionEngine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, space removed.

@opti-mix
Copy link
Contributor

opti-mix commented Dec 5, 2018

@gcatron Quick comment about naming. Please do not use spaces in the directory or file names.

@nadavrot
Copy link
Contributor

nadavrot commented Dec 5, 2018

ONNXifi is an example for one client that runs the multi-threaded inference. Correct? We should allow other clients (for example our unit tests and image-classifier) to also run inference concurrently.

@gcatron
Copy link
Contributor Author

gcatron commented Dec 5, 2018

ONNXifi is an example for one client that runs the multi-threaded inference. Correct? We should allow other clients (for example our unit tests and image-classifier) to also run inference concurrently.

Agreed, The description was poorly worded. This would serve as the execution entry point for any client.

void clearHost();
/// Runs the network specified by \p networkID and \p functionName using the
/// provided \p context.
bool runNetwork(int networkID, llvm::StringRef functionName, Context context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where function name is coming from? how is it used?

Can you expand on what each method does in details? (e.g., what removeNetwork actually does, etc)

/// provided \p context.
bool runNetwork(int networkID, llvm::StringRef functionName, Context context);
HostManager();
~HostManager();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either virtual or class should be final

~HostManager();
};

} // end namespace glow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no end here to be consistent with the rest of the codebase

@rdzhabarov
Copy link
Contributor

rdzhabarov commented Dec 5, 2018

Do we have a design doc shared on github?

/// Returns true if \p networkID is already added to the host.
bool networkAdded(int networkID);
/// Removes all networks from the host.
void clearHost();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also stops devices?

@SplitInfinity SplitInfinity changed the title Add host manager [WIP] Add host manager Dec 5, 2018
@gcatron
Copy link
Contributor Author

gcatron commented Dec 5, 2018

Added a description of the design to #2045, the high level task tracking this work.

class Executor {}; // temporary the Executor header will need to be included
class Context;
class HostManager final {
int activeCount_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why int and not unsigned? Can we have negative active tasks?

@yinghai
Copy link
Contributor

yinghai commented Dec 6, 2018

What's the relationship between host manager and the onnxifi manager that @jackm321 is adding?

@jackm321
Copy link
Contributor

jackm321 commented Dec 6, 2018

@yinghai good question, @gcatron and I were talking about about this offline a bit as well. Basically I think both of these things are fairly complementary and will be useful for their own tasks.
GlowOnnxifiManager will continue to manage things across the Onnxifi interface (especially things like onnxifi Events that we don't want to pull further into glow) and probably should continue to manage onnxifi Graphs too because it needs to manage not just the underlying glow graph (though this graph will probably be replaced with just the id returned from HostManager:: addNetwork) but it also will need to manage the context which can be updated by calls to onnxSetGraphIO(). Then a call to onnxRunGraph() become as simple as taking the onnxifi Graph and calling HostManager ::runNetwork() with the networkID generated by HostManager:: addNetwork and the context that it has stored for this graph in GlowOnnxifiManager.

class HostManager final {
int activeCount_;
int totalCount_;
std::unordered_map<int, DependencyGraph> networks_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an overkill IMHO. Dense vector with index should be good. Better performance too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yinghai networks can be added and removed but their Ids (the keys in this map) are monotonically increasing so a vector could end up being sparse and unnecessarily wasting space

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, how do we initiate removal of network? We'd probably need high level ONNXIFI support for this.
I think use case is when model needs to be refreshed with new weights. There might be other use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We remove network because high level application wants to retire certain model. From c2 level, onnxReleaseGraph is called and corresponding graph hit refcouter==0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agree

/// Adds the network to the host and does the necessary setup work. This
/// includes partitioning, provisioning, compiling and initializing backends.
/// Returns the networkID of the network.
int addNetwork(Function *F);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who compiles the Function? Where is weight sharing and partitioning done?

}

HostManager::~HostManager() { clearHost(); }
int HostManager::addNetwork(Function *F) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any of these methods should consider thread safety.

@yinghai
Copy link
Contributor

yinghai commented Dec 6, 2018

@jackm321 @gcatron Thanks for the explanation. Let's put onnxifi manager to the glow runtime design doc so that we can see where it sits.

class HostManager final {
/// Count of current networks being run.
std::atomic<unsigned int> activeCount_;
/// Count of networks initialized on this node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"node" is kind of ambiguous here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also totalCount_ is never decremented (because it's used to make networkIDs) so it's the total number of networks ever seen on the host not the total number loaded right now but this is confusing because activeCount_ sounds like it might be this however it is the number of in flight requests not networks (above it says "Count of current networks being run." but it could be multiple requests to the same network right so that should be changed?). I think it would be better to call these activeRequestsCount_ (same thing with activeRequestsLimit_) and something like nextNetworkId_ or something instead of totalCount_

auto result = promise.get_future();
executor_.runNetwork(networks_[networkID], functionName, context,
[&promise](ResultCode id) { promise.set_value(id); });
result.wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is runNetwork() blocking when everything beneath it is async? Onnxifi's runGraph() is async and it will have to call runNetwork() so this should probably be async also or at least have an async version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually since onnxifi has it's own thread pool this doesn't have to be async but it would be nice if we could just replace the onnxifi thread pool with this, otherwise we'll just have a bunch of onnxifi threads that are just sitting here blocking on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just chain the callback to signal the output event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use one threadpool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yinghai yeah that's what I was thinking too

if (networks_.find(networkID) == networks_.end()) {
return FAILED;
}
if (activeCount_ > activeLimit_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of doing this over just blocking until activeCount_ < activeLimit_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is failing out would allow the request to get routed elsewhere. Blocking until available could lead to failure because too many requests have piled up.

if (networks_.find(networkID) == networks_.end()) {
return FAILED;
}
if (activeCount_ > activeLimit_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this should be if (activeCount_ >= activeLimit_)

int networkID = totalCount_;
auto dependencyGraph = partitioner_.partition(M);
auto executionDAG = provisioner_.provision(dependencyGraph);
networks_.emplace(networkID, executionDAG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a lock around networks_? Otherwise onnxifi layer would have to handle synchronization between calls to initGraph(), runGraph() etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help it we specify which methods needs to handle concurrency in the host manager.

}; // This will likely be defined in one common runtime place.
class HostManager final {
/// Count of current networks being run.
std::atomic<unsigned int> activeCount_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why this is atomic, but there is no locking around other structs here.

}

HostManager::~HostManager() { clearHost(); }
unsigned int HostManager::addNetwork(Module *M) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, if this takes in a module, how do we reuse weights in functions across module? @qcolombet told us that weights can be shared within module but not across module.

auto result = promise.get_future();
executor_.runNetwork(networks_[networkID], functionName, context,
[&promise](ResultCode id) { promise.set_value(id); });
result.wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just chain the callback to signal the output event.

auto result = promise.get_future();
executor_.runNetwork(networks_[networkID], functionName, context,
[&promise](ResultCode id) { promise.set_value(id); });
result.wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use one threadpool.

@stale
Copy link

stale bot commented Jan 11, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Jan 16, 2019

This PR has been automatically closed due to being stale for 5 days. Thank you for your contributions and feel free to reopen it in case of further progress.

@stale stale bot closed this Jan 16, 2019
@jackm321
Copy link
Contributor

@gcatron can we resurrect this? I think once we have this in I can get moving on moving all of the onnxifi stuff over to use host manager interface instead of ExecutionEngine which would be great.

@gcatron
Copy link
Contributor Author

gcatron commented Jan 19, 2019

@gcatron can we resurrect this? I think once we have this in I can get moving on moving all of the onnxifi stuff over to use host manager interface instead of ExecutionEngine which would be great.

Absolutely, I just pushed some updates, looks like I can't reopen this because I did a force push. Opened #2284.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants