Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate workspace/allocations from graph #57

Open
jerinphilip opened this issue Sep 21, 2021 · 7 comments
Open

Separate workspace/allocations from graph #57

jerinphilip opened this issue Sep 21, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@jerinphilip
Copy link

Feature description

In the current state of browsermt/marian-dev, the concept of a workspace which manages allocation of tensors is placed behind a graph accessible to the library API bergamot-translator uses. This leads to a temporarily inefficient implementation of multiple-models handling (browsermt/bergamot-translator#210), where the workspaces grow proportional to the number of models active.

@XapaJIaMnu and @kpu have previously solved swapping multiple models by means of swapping tensors onto an active graph. This is "dynamic" and a reference implementation available at https://github.com/kpu/marian-dev/blob/dynamic_swap_mvp/src/translator/swappable.cpp. While this is doable in the case of shared-architectures without incurring much expense, a change in architecture involves reconstructing the graph (eg: tied embedding model swapped out for a non-tied embedding model).

It is optimal to keep the concept of a workspace bound to threads/workers active instead, separate the graph and architecture aside to avoid the blow-up in memory usage than what is originally required.

This issue is intended to investigate how best to make the modifications to solve the above problem in this repository.

/cc @graemenail

@jerinphilip jerinphilip added the enhancement New feature or request label Sep 21, 2021
@jerinphilip
Copy link
Author

jerinphilip commented Sep 21, 2021

Found this function, which replaces tensors_ with something supplied externally:

void reuseWorkspace(Ptr<ExpressionGraph> graph) {
tensors_ = graph->tensors_;
}

The following example snippet from translator is what eventually makes way into bergamot-translator.

auto graph = New<ExpressionGraph>(true);
auto prec = options_->get<std::vector<std::string>>("precision", {"float32"});
graph->setDefaultElementType(typeFromString(prec[0]));
graph->setDevice(device);
graph->getBackend()->configureDevice(options_);
graph->reserveWorkspaceMB(options_->get<size_t>("workspace"));
graphs_[id] = graph;
std::vector<Ptr<Scorer>> scorers;
if(options_->get<bool>("model-mmap", false)) {
scorers = createScorers(options_, model_mmaps_);
}
else {
scorers = createScorers(options_, model_items_);
}
for(auto scorer : scorers) {
scorer->init(graph);
if(shortlistGenerator_)
scorer->setShortlistGenerator(shortlistGenerator_);
}
scorers_[id] = scorers;
graph->forward();
};

This is super stateful, taking a shot at figuring what's happening below:

// creates empty vessel
auto graph = New<ExpressionGraph>(/*inference=*/true); 
auto prec = options_->get<std::vector<std::string>>("precision", {"float32"}); 
graph->setDefaultElementType(typeFromString(prec[0])); 

// configures device (CPU/GPU) and prepares a Tensors/TensorAllocator.
graph->setDevice(device); 

// GEMM config stuff.
graph->getBackend()->configureDevice(options_); 

// Grows the static storage on the Tensors/TensorAllocator and manually
// manages allocations ahead on this to minimize lot of malloc troubles.
graph->reserveWorkspaceMB(options_->get<size_t>("workspace")); 

// Could replace with a graph supplied from worker externally?
// Overwrites an already allocated something, which hasn't grown to workspace-size. 
// (Why are things like this?).
graph->reuseWorkspace(workerGraph);

// Later down the line
   scorer->init(graph); 
   // ^ Only here are the notions of an architecture (transformer, s2s etc coming). 
   // So this should be where graph is constructed. Allocations done.
   // When are these deallocated? 

The above has to be "dynamic", i.e, check if the worker already has a graph and related created for the particular TranslationModel. If not trigger the creation routine. Say a thread never sees a TranslationModel, it doesn't create unnecessary allocations in this case.

We additionally want the allocations created on this workerGraph by a given TranslationModel M to be freed when the M's lifetime is over. Figuring out lifetime seems to be tricky with New<...> and graph being provided everywhere. Can I trust whatever memory management inside to figure if something is in use or not in use and free space?

Current understanding is that scorer->init(graph) triggers the creation of architecture notions on the graph.

There is a long-term and short-term memory of which long-term corresponds to constant nodes which are not "parameters" (Can't all weights be treated thus in inference mode?). Short-term memory is everything else, so should be gradients, parameters being modified, inputs and that sort. I'm guessing keeping long-term has performance benefits.

Where are these getting freed?

The following clears everything except longterm-storage.

void clear() {
// clear everything apart from parameters and memoized nodes
count_ = 0;
nodesForward_.clear();
nodesBackward_.clear();
topNodes_.clear();
tensors_->clear();
}

I understand this to be triggered every BeamSearch call.

for(auto scorer : scorers_) {
scorer->clear(graph);
}

Marian operates always with one-model? Does this imply that the long-term created corresponding to constants or params are alive forever?

void clearLongtermMemory() { longterm_->clear(); }

The above function is available to clear everything in long-term, but is unused in source(?).

@jerinphilip
Copy link
Author

jerinphilip commented Sep 22, 2021

If I understand the graph subsystem to be an EDSL with storage using some hash-map of string-identifiers and corresponding allocated tensors, there seems to be a way to hijack creating scorers onto the same graph by modifying the fname below:

for(auto ptr : ptrs) {
std::string fname = "F" + std::to_string(i);
// load options specific for the scorer
auto modelOptions = New<Options>(options->clone());
if(!options->get<bool>("ignore-model-config")) {
YAML::Node modelYaml;
io::getYamlFromModel(modelYaml, "special:model.yml", ptr);
if(!modelYaml.IsNull()) {
LOG(info, "Loaded model config");
modelOptions->merge(modelYaml, true);
}
else {
LOG(warn, "No model settings found in model file");
}
}
scorers.push_back(scorerByType(fname, weights[i], ptr, modelOptions));

I expect all further parameters would be prefixed (namespaced) by this string

Do I create a new method to find all variable names by prefix and trigger a free along with TranslationModel destruction? Seems like an idea. I can filter for keys in the storage and manually free them?

@jerinphilip
Copy link
Author

Global config object fail. I'm guessing I have to fix API to take only parameters required (this is common to TranslationModels, I no longer have Options to create this. Going to do horror for now).

void configureDevice(Ptr<const Options> options) override {
setClip(options->get<float>("clip-gemm"));
setGemmPrecision(options);
setLegacyBatchedGemm(options->get<bool>("use-legacy-batching"));
}

@jerinphilip
Copy link
Author

jerinphilip commented Sep 22, 2021

The case of same arch models being init on same graph. If params dim match, they get loaded and all sorts of funny things can happen?

(p, params) = findParams(name, elementType, typeSpecified);
if(!params) {
params = New<Parameters>(elementType);
params->init(backend_);
paramsByElementType_.insert({elementType, params});
} else {
if(p) {
// if yes add to tape and return
ABORT_IF(shape != p->shape(),
"Requested shape {} for existing parameter '{}' does not match "
"original shape {}",
shape,
name,
p->shape());
p->setTrainable(!fixed);
add(p);
return p;
}

This can potentially be fixed by providing an alternate way to createScorers namespacing these correctly?

for(auto ptr : ptrs) {
std::string fname = "F" + std::to_string(i);
// load options specific for the scorer
auto modelOptions = New<Options>(options->clone());
if(!options->get<bool>("ignore-model-config")) {
YAML::Node modelYaml;
io::getYamlFromModel(modelYaml, "special:model.yml", ptr);
if(!modelYaml.IsNull()) {
LOG(info, "Loaded model config");
modelOptions->merge(modelYaml, true);
}
else {
LOG(warn, "No model settings found in model file");
}
}
scorers.push_back(scorerByType(fname, weights[i], ptr, modelOptions));

@jerinphilip
Copy link
Author

jerinphilip commented Sep 23, 2021

@XapaJIaMnu suspects the above namespace hijack might not be feasible due to possible AD assertions of all nodes in the graph to be connected.

In this case, the map containing variable names to tensors is with expression-graph, at:

ElementTypeParamsMap paramsByElementType_;

In this case, we can have dummy graph at each worker (workerGraph), place another graph with TranslationModel and share storage with the following function by grabbing from workerGraph?

void reuseWorkspace(Ptr<ExpressionGraph> graph) {
tensors_ = graph->tensors_;
}

Nobody appears to be using the above function, at least in master, so uncharted territory. On the bright side, we can do with this function what we want?

@graemenail mentioned some concern about this not inclusive of cache, but looks like it is:

class Tensors {
private:
Ptr<TensorAllocator> tensors_;
Ptr<TensorAllocator> cache_;

@jerinphilip
Copy link
Author

jerinphilip commented Oct 5, 2021

There is a long-term and short-term memory of which long-term corresponds to constant nodes which are not "parameters" (Can't all weights be treated thus in inference mode?).

So if inference mode, params become memoized... possibly long-term and sticks forever...

ParamNode::ParamNode(Ptr<ExpressionGraph> graph,
const Shape& shape,
const Ptr<inits::NodeInitializer>& init,
Type valueType,
bool fixed)
: Node(graph, shape, valueType),
init_(init),
initialized_(false) {
init_->setAllocator(graph->allocator());
setTrainable(!fixed);
setMemoize(graph->isInference());
}

@jerinphilip
Copy link
Author

browsermt/bergamot-translator#225 tried to share workspaces (which are identified as TensorAllocators) and ran into finding tensors active during translation which if manually cleared would corrupt translations afterwards.

While I can identify which tensors are associated with which models, I'll need to put mutexes to protect TensorAllocator to die clearing the tensors associated with a model (with workspaces associated with threads and hence races). @kpu's recommendation is to get the processed form (after conversion to intgemm or whatever), store it as const for the lifetime of the model and all threads can read-access it if need be. When the model is destructed, workers are not holding any shared_ptrs and it's safe to clear these tensors since no translation with this model is active and/or waiting.

Development is on pause until the higher powers decide the right direction/approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant