-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add host manager #2125
[WIP] Add host manager #2125
Conversation
@@ -0,0 +1,58 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a space in the path name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make it one word, like runtime or ExecutionEngine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, space removed.
@gcatron Quick comment about naming. Please do not use spaces in the directory or file names. |
ONNXifi is an example for one client that runs the multi-threaded inference. Correct? We should allow other clients (for example our unit tests and image-classifier) to also run inference concurrently. |
Agreed, The description was poorly worded. This would serve as the execution entry point for any client. |
void clearHost(); | ||
/// Runs the network specified by \p networkID and \p functionName using the | ||
/// provided \p context. | ||
bool runNetwork(int networkID, llvm::StringRef functionName, Context context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where function name is coming from? how is it used?
Can you expand on what each method does in details? (e.g., what removeNetwork actually does, etc)
/// provided \p context. | ||
bool runNetwork(int networkID, llvm::StringRef functionName, Context context); | ||
HostManager(); | ||
~HostManager(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either virtual or class should be final
~HostManager(); | ||
}; | ||
|
||
} // end namespace glow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no end
here to be consistent with the rest of the codebase
Do we have a design doc shared on github? |
/// Returns true if \p networkID is already added to the host. | ||
bool networkAdded(int networkID); | ||
/// Removes all networks from the host. | ||
void clearHost(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also stops devices?
Added a description of the design to #2045, the high level task tracking this work. |
class Executor {}; // temporary the Executor header will need to be included | ||
class Context; | ||
class HostManager final { | ||
int activeCount_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why int and not unsigned? Can we have negative active tasks?
What's the relationship between host manager and the onnxifi manager that @jackm321 is adding? |
@yinghai good question, @gcatron and I were talking about about this offline a bit as well. Basically I think both of these things are fairly complementary and will be useful for their own tasks. |
class HostManager final { | ||
int activeCount_; | ||
int totalCount_; | ||
std::unordered_map<int, DependencyGraph> networks_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an overkill IMHO. Dense vector with index should be good. Better performance too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yinghai networks can be added and removed but their Ids (the keys in this map) are monotonically increasing so a vector could end up being sparse and unnecessarily wasting space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, how do we initiate removal of network? We'd probably need high level ONNXIFI support for this.
I think use case is when model needs to be refreshed with new weights. There might be other use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We remove network because high level application wants to retire certain model. From c2 level, onnxReleaseGraph
is called and corresponding graph hit refcouter==0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, agree
/// Adds the network to the host and does the necessary setup work. This | ||
/// includes partitioning, provisioning, compiling and initializing backends. | ||
/// Returns the networkID of the network. | ||
int addNetwork(Function *F); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who compiles the Function? Where is weight sharing and partitioning done?
} | ||
|
||
HostManager::~HostManager() { clearHost(); } | ||
int HostManager::addNetwork(Function *F) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any of these methods should consider thread safety.
class HostManager final { | ||
/// Count of current networks being run. | ||
std::atomic<unsigned int> activeCount_; | ||
/// Count of networks initialized on this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"node" is kind of ambiguous here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also totalCount_
is never decremented (because it's used to make networkID
s) so it's the total number of networks ever seen on the host not the total number loaded right now but this is confusing because activeCount_
sounds like it might be this however it is the number of in flight requests not networks (above it says "Count of current networks being run." but it could be multiple requests to the same network right so that should be changed?). I think it would be better to call these activeRequestsCount_
(same thing with activeRequestsLimit_
) and something like nextNetworkId_
or something instead of totalCount_
auto result = promise.get_future(); | ||
executor_.runNetwork(networks_[networkID], functionName, context, | ||
[&promise](ResultCode id) { promise.set_value(id); }); | ||
result.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is runNetwork()
blocking when everything beneath it is async? Onnxifi's runGraph()
is async and it will have to call runNetwork()
so this should probably be async also or at least have an async version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually since onnxifi has it's own thread pool this doesn't have to be async but it would be nice if we could just replace the onnxifi thread pool with this, otherwise we'll just have a bunch of onnxifi threads that are just sitting here blocking on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just chain the callback to signal the output event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use one threadpool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yinghai yeah that's what I was thinking too
if (networks_.find(networkID) == networks_.end()) { | ||
return FAILED; | ||
} | ||
if (activeCount_ > activeLimit_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of doing this over just blocking until activeCount_ < activeLimit_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is failing out would allow the request to get routed elsewhere. Blocking until available could lead to failure because too many requests have piled up.
if (networks_.find(networkID) == networks_.end()) { | ||
return FAILED; | ||
} | ||
if (activeCount_ > activeLimit_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this should be if (activeCount_ >= activeLimit_)
int networkID = totalCount_; | ||
auto dependencyGraph = partitioner_.partition(M); | ||
auto executionDAG = provisioner_.provision(dependencyGraph); | ||
networks_.emplace(networkID, executionDAG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a lock around networks_
? Otherwise onnxifi layer would have to handle synchronization between calls to initGraph()
, runGraph()
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help it we specify which methods needs to handle concurrency in the host manager.
}; // This will likely be defined in one common runtime place. | ||
class HostManager final { | ||
/// Count of current networks being run. | ||
std::atomic<unsigned int> activeCount_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why this is atomic, but there is no locking around other structs here.
} | ||
|
||
HostManager::~HostManager() { clearHost(); } | ||
unsigned int HostManager::addNetwork(Module *M) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, if this takes in a module, how do we reuse weights in functions across module? @qcolombet told us that weights can be shared within module but not across module.
auto result = promise.get_future(); | ||
executor_.runNetwork(networks_[networkID], functionName, context, | ||
[&promise](ResultCode id) { promise.set_value(id); }); | ||
result.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just chain the callback to signal the output event.
auto result = promise.get_future(); | ||
executor_.runNetwork(networks_[networkID], functionName, context, | ||
[&promise](ResultCode id) { promise.set_value(id); }); | ||
result.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use one threadpool.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
This PR has been automatically closed due to being stale for 5 days. Thank you for your contributions and feel free to reopen it in case of further progress. |
@gcatron can we resurrect this? I think once we have this in I can get moving on moving all of the onnxifi stuff over to use host manager interface instead of ExecutionEngine which would be great. |
Absolutely, I just pushed some updates, looks like I can't reopen this because I did a force push. Opened #2284. |
Description: This is a work in progress to add the HostManager to glow.
The HostManager will serve as the entry point for execution calls and handle orchestrating the rest of the runtime components.
Testing: Not Yet
Documentation: