Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeviceManager interface and CPUDeviceManager #2249

Merged
merged 1 commit into from
Jan 16, 2019

Conversation

nickgg
Copy link
Contributor

@nickgg nickgg commented Jan 9, 2019

Description: A second prototype of the DeviceManager, which is the interface to a single inference Device. The interface supports three operations:

** addNetwork() - load an existing compiled module (or sub module) onto the device ready to be executed.
** evictNetwork() - unload a module from the device and from memory.
** runFunction() - run a Function (subgraph) with the provided context and callback with results.

The DeviceManager operates in it's own thread, which for the CPU backend implemented here also does compilation and running of the function.

Testing: new tests, parallelized example in a follow up.
Documentation:
Part of the high level Runtime issue #2045.

@opti-mix
Copy link
Contributor

opti-mix commented Jan 9, 2019

@nickgg Quick question: What happens if someone would call evictNetwork while this network is running because it was launched by runFunction? Do we need an API to cancel the execution of the network before it finishes on its own?

@nickgg
Copy link
Contributor Author

nickgg commented Jan 9, 2019

@opti-mix I think we should keep the interface here as simple as possible, I'm not sure that all devices Glow supports will support being able to interrupt a running inference, so we'd be setting ourselves up for failure by supporting it in the DeviceManger layer. Do you know a use case that would require this?

The basic idea here is that we're serializing all calls into a single queue, and so the evict would take place after previous runs but prevent future runs.

@nickgg
Copy link
Contributor Author

nickgg commented Jan 9, 2019

That ASAN failure doesn't seem related, the alloc & overrun both happen within the call to compile(). Does anyone know about this, if not i'll look into it.

return maxMemoryBytes >= (usedMemoryBytes + estimate);
}

void CPUDeviceManager::addNetworkImpl(const Module *module,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Module or Function? We are loading a specific function to be executed. Np?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're loading a set of functions (which we expect to share constants, e.g. the same graph with different batch sizes). The FunctionMap contains CompiledFunctions, but to collect the constants we need access back to the Placeholders which are owned by the module.

#include <chrono>
#include <future>

using namespace glow;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great unit tests!

@nickgg
Copy link
Contributor Author

nickgg commented Jan 11, 2019

Added RuntimeTypes, fixed some formatting. Didn't attack the header include part yet.

@nickgg nickgg force-pushed the deviceManger2 branch 3 times, most recently from 4e85dd1 to b30b3c1 Compare January 11, 2019 18:33
@nickgg
Copy link
Contributor Author

nickgg commented Jan 11, 2019

Pulled out implementations to a QueueBackedDeviceMangerhelper class.

Copy link
Contributor

@rdzhabarov rdzhabarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and comments.

include/glow/Backends/DeviceManager.h Outdated Show resolved Hide resolved
include/glow/Backends/DeviceManager.h Show resolved Hide resolved
include/glow/Backends/DeviceManager.h Outdated Show resolved Hide resolved
include/glow/Backends/DeviceManager.h Outdated Show resolved Hide resolved
include/glow/Backends/DeviceManager.h Outdated Show resolved Hide resolved

/// Remove (and delete) the provided network and all it's functions, freeing
/// up space on the device.
virtual void evictNetwork(const Module *module) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know if eviction succeeded? (e.g. stats/logging etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could generate the event/log from within the device manager.

include/glow/Backends/QueueBackedDeviceManager.h Outdated Show resolved Hide resolved
include/glow/Backends/QueueBackedDeviceManager.h Outdated Show resolved Hide resolved

/// Remove (and delete) the provided network and all it's functions, freeing
/// up space on the device.
void evictNetwork(const Module *module);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can module be nullptr? if not const Module &module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep it consistent with addNetwork and the ReadyCB, and since those are used to do comparisons I think better to keep it as pointer (ie. do a ptr compare, not an operator==).

Honestly I think the raw ptrs here should be shared_ptrs.

include/glow/Backends/QueueBackedDeviceManager.h Outdated Show resolved Hide resolved
@nickgg nickgg force-pushed the deviceManger2 branch 2 times, most recently from 71760ef to addd995 Compare January 14, 2019 20:05
@nickgg
Copy link
Contributor Author

nickgg commented Jan 15, 2019

@rdzhabarov are you satisfied with those explanations or do you want me to change the interface?

Copy link
Contributor

@bertmaher bertmaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time to unblock this :-)

Copy link
Contributor

@rdzhabarov rdzhabarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickgg yes, thanks!

include/glow/Runtime/RuntimeTypes.h Outdated Show resolved Hide resolved
lib/Backends/CPU/CMakeLists.txt Outdated Show resolved Hide resolved
include/glow/Backends/DeviceManager.h Show resolved Hide resolved
@nickgg
Copy link
Contributor Author

nickgg commented Jan 16, 2019

Fixed remaining comments and will commit, build permitting.

@nickgg nickgg merged commit d34207a into pytorch:master Jan 16, 2019
@ayermolo
Copy link
Contributor

@nickgg nickgg deleted the deviceManger2 branch February 5, 2019 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants