Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Dev weight sharing #568

Merged
Merged
Show file tree
Hide file tree
Changes from 92 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
fb8bc25
add pycharm project files to .gitignore list
leckie-chn Oct 9, 2018
0bf454c
update pylintrc to conform vscode settings
leckie-chn Oct 11, 2018
7f0a418
Merge remote-tracking branch 'upstream/master'
leckie-chn Oct 11, 2018
9fce9bb
Merge remote-tracking branch 'upstream/master'
leckie-chn Oct 17, 2018
73b9b58
Merge remote-tracking branch 'upstream/master'
Oct 18, 2018
d83aa70
Merge remote-tracking branch 'upstream/master'
leckie-chn Oct 19, 2018
16be159
Merge remote-tracking branch 'upstream/master'
leckie-chn Oct 25, 2018
69466b8
fix RemoteMachineMode for wrong trainingServicePlatform
leckie-chn Oct 25, 2018
f1f8339
Merge branch 'master' of https://github.com/leckie-chn/nni
leckie-chn Oct 25, 2018
7a1be57
simple weight sharing
leckie-chn Oct 25, 2018
99b1402
update gitignore file
leckie-chn Oct 25, 2018
253068d
change tuner codedir to relative path
leckie-chn Oct 25, 2018
2faac01
add python cache files to gitignore list
leckie-chn Oct 26, 2018
e412cb5
move extract scalar reward logic from dispatcher to tuner
leckie-chn Oct 26, 2018
e159afb
update tuner code corresponding to last commit
leckie-chn Oct 26, 2018
d664f4f
update doc for receive_trial_result api change
leckie-chn Oct 26, 2018
b5d5f2e
add numpy to package whitelist of pylint
leckie-chn Oct 26, 2018
3b280cd
distinguish param value from return reward for tuner.extract_scalar_r…
leckie-chn Oct 26, 2018
a384da0
update pylintrc
leckie-chn Oct 29, 2018
cd50908
Merge remote-tracking branch 'upstream/master' into dev-mac-support
Oct 29, 2018
5b1320a
add comments to dispatcher.handle_report_metric_data
leckie-chn Oct 29, 2018
3aee412
update install for mac support
Oct 29, 2018
60db733
fix root mode bug on Makefile
triflame92 Oct 29, 2018
094436d
Quick fix bug: nnictl port value error (#245)
SparkSnail Oct 18, 2018
8fca02e
Dev exp stop more (#221)
QuanluZhang Oct 18, 2018
cbc808b
update Makefile (#246)
Crysple Oct 18, 2018
ce17fa3
quick fix for ci (#248)
Crysple Oct 18, 2018
e337541
add update trialNum and fix bugs (#261)
Crysple Oct 23, 2018
07d51dd
Add builtin tuner to CI (#247)
Crysple Oct 23, 2018
0fbe564
Doc refactor (#258)
scarlett2018 Oct 23, 2018
e35f96d
Refactor nnictl to support listing stopped experiments. (#256)
SparkSnail Oct 23, 2018
71dc1ca
Show experiment parameters more beautifully (#262)
lvybriage Oct 24, 2018
5c65cef
fix error on example of RemoteMachineMode (#269)
leckie-chn Oct 25, 2018
07fe4ef
Update docker file to use latest nni release (#263)
chicm-ms Oct 25, 2018
dc688b8
fix bug about execDuration and endTime (#270)
QuanluZhang Oct 26, 2018
f8b131c
Refactor dockerfile (#264)
SparkSnail Oct 26, 2018
ec0c1d5
Support nnictl tensorboard (#268)
SparkSnail Oct 26, 2018
95d8666
Sdk update (#272)
chicm-ms Oct 26, 2018
a3b60cc
add experiment log path to experiment profile (#276)
chicm-ms Oct 30, 2018
a550686
Merge branch 'dev-mac-support' into dev-mac-support
leckie-chn Oct 30, 2018
d4c383a
refactor extract reward from dict by tuner
leckie-chn Oct 31, 2018
4ba1916
Merge remote-tracking branch 'upstream/master'
leckie-chn Nov 1, 2018
641f5d7
Merge remote-tracking branch 'upstream/master'
leckie-chn Nov 1, 2018
78e4209
Merge remote-tracking branch 'upstream/master'
leckie-chn Nov 8, 2018
909d9d9
Merge remote-tracking branch 'upstream/master'
leckie-chn Nov 12, 2018
221e951
Merge remote-tracking branch 'upstream/master'
leckie-chn Nov 13, 2018
cca83a2
Merge branch 'master' into dev-mac-support
leckie-chn Nov 13, 2018
5d001f7
Merge pull request #1 from leckie-chn/dev-mac-support
leckie-chn Nov 13, 2018
dd12229
Merge pull request #1 from Microsoft/master
chicm-ms Nov 13, 2018
8f696ac
update Makefile for mac support, wait for aka.ms support
triflame92 Nov 13, 2018
df36c1f
Merge remote-tracking branch 'upstream/master'
leckie-chn Nov 14, 2018
bedc6fd
Merge branch 'master' into dev-weight-sharing
leckie-chn Nov 14, 2018
3583b52
refix Makefile for colorful echo
leckie-chn Nov 14, 2018
8aeea2e
Merge branch 'master' into dev-weight-sharing
leckie-chn Nov 14, 2018
b9cdde5
unversion config.yml with machine information
leckie-chn Nov 14, 2018
ae979c9
sync graph.py between tuners & trial of ga_squad
leckie-chn Nov 16, 2018
001ecd5
sync graph.py between tuners & trial of ga_squad
leckie-chn Nov 16, 2018
a67a6b8
Merge pull request #2 from Microsoft/master
chicm-ms Nov 16, 2018
75fd2f1
Merge pull request #3 from Microsoft/master
chicm-ms Nov 19, 2018
10e998f
Merge pull request #4 from Microsoft/master
chicm-ms Nov 27, 2018
a0f361c
Merge pull request #5 from Microsoft/master
chicm-ms Nov 27, 2018
76d7142
Merge pull request #6 from Microsoft/master
chicm-ms Nov 28, 2018
bc10bf7
Merge pull request #7 from Microsoft/master
chicm-ms Nov 29, 2018
d76deb3
Merge pull request #8 from Microsoft/master
chicm-ms Dec 4, 2018
af1137c
copy weight shared ga_squad under weight_sharing folder
leckie-chn Dec 10, 2018
7086cb5
mv ga_squad code back to master
leckie-chn Dec 10, 2018
c20abca
Merge remote-tracking branch 'upstream/dev-weight-sharing_1' into dev…
leckie-chn Dec 10, 2018
a0fd7ed
simple tuner & trial ready
leckie-chn Dec 10, 2018
0ffc4b4
Fix nnictl multiThread option
chicm-ms Dec 11, 2018
b3a1131
Merge remote-tracking branch 'chicm/fix_multithread' into dev-weight-…
leckie-chn Dec 11, 2018
4788ad5
weight sharing with async dispatcher simple example ready
leckie-chn Dec 11, 2018
efac915
update for ga_squad
leckie-chn Dec 17, 2018
56105e0
fix bug
leckie-chn Dec 21, 2018
93b8afe
modify multihead attention name
leckie-chn Dec 24, 2018
c9c64e5
add min_layer_num to Graph
leckie-chn Dec 24, 2018
adb9b40
fix bug
leckie-chn Dec 24, 2018
9c15d1a
update share id calc
leckie-chn Dec 25, 2018
041099b
fix bug
leckie-chn Dec 25, 2018
e3ee26f
add save logging
leckie-chn Dec 25, 2018
a9b7e58
fix ga_squad tuner bug
leckie-chn Dec 26, 2018
b5675fa
sync bug fix for ga_squad tuner
leckie-chn Dec 26, 2018
db1d96f
fix same hash_id bug
leckie-chn Dec 27, 2018
20586e3
add lock to simple tuner in weight sharing
leckie-chn Jan 2, 2019
5228657
Add readme to simple weight sharing
leckie-chn Jan 2, 2019
3a0176e
update
leckie-chn Jan 2, 2019
a9f5d62
update
leckie-chn Jan 2, 2019
c6098ed
add paper link
leckie-chn Jan 2, 2019
2469f48
update
leckie-chn Jan 2, 2019
61bc21c
reformat with autopep8
leckie-chn Jan 4, 2019
85d3076
add documentation for weight sharing
leckie-chn Jan 4, 2019
9fc984e
test for weight sharing
leckie-chn Jan 4, 2019
ddbcead
delete irrelevant files
leckie-chn Jan 4, 2019
500852e
move details of weight sharing in to code comments
leckie-chn Jan 7, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions docs/AdvancedNAS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Tutorial for Advanced Neural Architecture Search
leckie-chn marked this conversation as resolved.
Show resolved Hide resolved
Currently many of the NAS algorithms leverage the technique of **weight sharing** among trials to accelerate its training process. For example, [ENAS][1] delivers 1000x effiency with '_parameter sharing between child models_', compared with the previous [NASNet][2] algorithm. Other NAS algorithms such as [DARTS][3], [Network Morphism][4], and [Evolution][5] is also leveraging, or has the potential to leverage weight sharing.
This is a tutorial on how to enable weight sharing in NNI. The example we use is based on the example of [Neural Architecture Search for Reading Comprehension](../examples/trials/ga_squad/), and is placed [here](../examples/trials/weight_sharing/ga_squad).

## Weight Sharing among trials
Currently we recommend sharing weights through NFS (Network File System), which supports sharing files across machines, and is light-weighted, (relatively) efficient. We also welcome contributions from the community on more efficient techniques.

### NFS Setup
In NFS, files are physically stored on a server machine, and trials on the client machine can read/write those files in the same way that they access local files.

#### Install NFS on server machine
First, install NFS server:
```bash
leckie-chn marked this conversation as resolved.
Show resolved Hide resolved
sudo apt-get install nfs-kernel-server
```
Suppose `/tmp/nni/shared` is used as the physical storage, then run:
```bash
sudo mkdir -p /tmp/nni/shared
sudo echo "/tmp/nni/shared *(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
sudo service nfs-kernel-server restart
```
You can check if the above directory is successfully exported by NFS using `sudo showmount -e localhost`

#### Install NFS on client machine
First, install NFS client:
```bash
leckie-chn marked this conversation as resolved.
Show resolved Hide resolved
sudo apt-get install nfs-common
```
Then create & mount the mounted directory of shared files:
```bash
sudo mkdir -p /mnt/nfs/nni/
sudo mount -t nfs 10.10.10.10:/tmp/nni/shared /mnt/nfs/nni
```
where `10.10.10.10` should be replaced by the real IP of NFS server machine in practice.

### Example code for trial
In our example, we assign each layer a `hash_id` to identify whether a previously trained model weight is sharable,and construct the tensorflow graph using `hash_id` as the variable scope name:
leckie-chn marked this conversation as resolved.
Show resolved Hide resolved
```python
with tf.variable_scope(p_graph.layers[i].hash_id, reuse=tf.AUTO_REUSE):
# generate tensorflow operators for p_graph.layers[i]
...
```
With hashes of all the sharable layer fed as `shared_id` hyper parameter, we can automatically initialize all the sharable layer from the previous trained model:
```python
tf.init_from_checkpoint(param['restore_path'], dict(zip(param['shared_id'], param['shared_id'])))
```
Where `param` is retrieved from customized tuner with `nni.get_next_parameter()`. An example configuration is shown as follows:
```json
{
"shared_id": [
"4a11b2ef9cb7211590dfe81039b27670",
"370af04de24985e5ea5b3d72b12644c9",
"11f646e9f650f5f3fedc12b6349ec60f",
"0604e5350b9c734dd2d770ee877cfb26",
"6dbeb8b022083396acb721267335f228",
"ba55380d6c84f5caeb87155d1c5fa654"
],
"graph": {
"layers": [
...
{
"hash_id": "ba55380d6c84f5caeb87155d1c5fa654",
"is_delete": false,
"size": "x",
"graph_type": 0,
"output": [
6
],
"output_size": 1,
"input": [
7,
1
],
"input_size": 2
},
...
]
},
"restore_dir": "/mnt/nfs/nni/ga_squad/87",
"save_dir": "/mnt/nfs/nni/ga_squad/95"
}
```

### Tuner customization for sharing policy
We recommend implementing sharing policy for customized tuner through the calculation of `Layer.hash_id`. In our example, a layer is sharable iff. the configurations of the layer itself and all its previous layers are not changed. For details, see `Layer.update_hash` and `Graph.update_hash` function in [graph.py](../examples/tuners/weight_sharing/ga_customer_tuner/graph.py)
leckie-chn marked this conversation as resolved.
Show resolved Hide resolved


## Asynchornous Dispatcher Mode for trial dependency control
The feature of weight sharing enables trials from different machines, in which most of the time **read after write** consistency must be assured. After all, the child model should not load parent model before parent trial finishes training. To deal with this, users can enable **asynchronous dispatcher mode** with `multiThread: true` in `config.yml` in NNI, where the dispatcher assign a tuner thread each time a `NEW_TRIAL` request comes in, and the tuner thread can decide when to submit a new trial by blocking and unblocking the thread itself. For example:
```python
def generate_parameters(self, parameter_id):
self.thread_lock.acquire()
indiv = # configuration for a new trial
self.events[parameter_id] = threading.Event()
self.thread_lock.release()
if indiv.parent_id is not None:
self.events[indiv.parent_id].wait()

def receive_trial_result(self, parameter_id, parameters, reward):
self.thread_lock.acquire()
# code for processing trial results
self.thread_lock.release()
self.events[parameter_id].set()
```


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add the link from simple example in test here?

[1]: https://arxiv.org/abs/1802.03268
[2]: https://arxiv.org/abs/1707.07012
[3]: https://arxiv.org/abs/1806.09055
[4]: https://arxiv.org/abs/1806.10282
[5]: https://arxiv.org/abs/1703.01041
6 changes: 3 additions & 3 deletions examples/trials/ga_squad/trial.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ def train_with_graph(graph, qp_pairs, dev_qp_pairs):
answers = generate_predict_json(
position1, position2, ids, contexts)
if save_path is not None:
with open(save_path + 'epoch%d.prediction' % epoch, 'w') as file:
with open(os.path.join(save_path, 'epoch%d.prediction' % epoch), 'w') as file:
json.dump(answers, file)
else:
answers = json.dumps(answers)
Expand All @@ -359,8 +359,8 @@ def train_with_graph(graph, qp_pairs, dev_qp_pairs):
bestacc = acc

if save_path is not None:
saver.save(sess, save_path + 'epoch%d.model' % epoch)
with open(save_path + 'epoch%d.score' % epoch, 'wb') as file:
saver.save(os.path.join(sess, save_path + 'epoch%d.model' % epoch))
with open(os.path.join(save_path, 'epoch%d.score' % epoch), 'wb') as file:
pickle.dump(
(position1, position2, ids, contexts), file)
logger.debug('epoch %d acc %g bestacc %g' %
Expand Down
171 changes: 171 additions & 0 deletions examples/trials/weight_sharing/ga_squad/attention.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge,
# to any person obtaining a copy of this software and associated
# documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

import math

import tensorflow as tf
from tensorflow.python.ops.rnn_cell_impl import RNNCell


def _get_variable(variable_dict, name, shape, initializer=None, dtype=tf.float32):
if name not in variable_dict:
variable_dict[name] = tf.get_variable(
name=name, shape=shape, initializer=initializer, dtype=dtype)
return variable_dict[name]


class DotAttention:
'''
DotAttention
'''

def __init__(self, name,
hidden_dim,
is_vanilla=True,
is_identity_transform=False,
need_padding=False):
self._name = '/'.join([name, 'dot_att'])
self._hidden_dim = hidden_dim
self._is_identity_transform = is_identity_transform
self._need_padding = need_padding
self._is_vanilla = is_vanilla
self._var = {}

@property
def is_identity_transform(self):
return self._is_identity_transform

@property
def is_vanilla(self):
return self._is_vanilla

@property
def need_padding(self):
return self._need_padding

@property
def hidden_dim(self):
return self._hidden_dim

@property
def name(self):
return self._name

@property
def var(self):
return self._var

def _get_var(self, name, shape, initializer=None):
with tf.variable_scope(self.name):
return _get_variable(self.var, name, shape, initializer)

def _define_params(self, src_dim, tgt_dim):
hidden_dim = self.hidden_dim
self._get_var('W', [src_dim, hidden_dim])
if not self.is_vanilla:
self._get_var('V', [src_dim, hidden_dim])
if self.need_padding:
self._get_var('V_s', [src_dim, src_dim])
self._get_var('V_t', [tgt_dim, tgt_dim])
if not self.is_identity_transform:
self._get_var('T', [tgt_dim, src_dim])
self._get_var('U', [tgt_dim, hidden_dim])
self._get_var('b', [1, hidden_dim])
self._get_var('v', [hidden_dim, 1])

def get_pre_compute(self, s):
'''
:param s: [src_sequence, batch_size, src_dim]
:return: [src_sequence, batch_size. hidden_dim]
'''
hidden_dim = self.hidden_dim
src_dim = s.get_shape().as_list()[-1]
assert src_dim is not None, 'src dim must be defined'
W = self._get_var('W', shape=[src_dim, hidden_dim])
b = self._get_var('b', shape=[1, hidden_dim])
return tf.tensordot(s, W, [[2], [0]]) + b

def get_prob(self, src, tgt, mask, pre_compute, return_logits=False):
'''
:param s: [src_sequence_length, batch_size, src_dim]
:param h: [batch_size, tgt_dim] or [tgt_sequence_length, batch_size, tgt_dim]
:param mask: [src_sequence_length, batch_size]\
or [tgt_sequence_length, src_sequence_length, batch_sizse]
:param pre_compute: [src_sequence_length, batch_size, hidden_dim]
:return: [src_sequence_length, batch_size]\
or [tgt_sequence_length, src_sequence_length, batch_size]
'''
s_shape = src.get_shape().as_list()
h_shape = tgt.get_shape().as_list()
src_dim = s_shape[-1]
tgt_dim = h_shape[-1]
assert src_dim is not None, 'src dimension must be defined'
assert tgt_dim is not None, 'tgt dimension must be defined'

self._define_params(src_dim, tgt_dim)

if len(h_shape) == 2:
tgt = tf.expand_dims(tgt, 0)
if pre_compute is None:
pre_compute = self.get_pre_compute(src)

buf0 = pre_compute
buf1 = tf.tensordot(tgt, self.var['U'], axes=[[2], [0]])
buf2 = tf.tanh(tf.expand_dims(buf0, 0) + tf.expand_dims(buf1, 1))

if not self.is_vanilla:
xh1 = tgt
xh2 = tgt
s1 = src
if self.need_padding:
xh1 = tf.tensordot(xh1, self.var['V_t'], 1)
xh2 = tf.tensordot(xh2, self.var['S_t'], 1)
s1 = tf.tensordot(s1, self.var['V_s'], 1)
if not self.is_identity_transform:
xh1 = tf.tensordot(xh1, self.var['T'], 1)
xh2 = tf.tensordot(xh2, self.var['T'], 1)
buf3 = tf.expand_dims(s1, 0) * tf.expand_dims(xh1, 1)
buf3 = tf.tanh(tf.tensordot(buf3, self.var['V'], axes=[[3], [0]]))
buf = tf.reshape(tf.tanh(buf2 + buf3), shape=tf.shape(buf3))
else:
buf = buf2
v = self.var['v']
e = tf.tensordot(buf, v, [[3], [0]])
e = tf.squeeze(e, axis=[3])
tmp = tf.reshape(e + (mask - 1) * 10000.0, shape=tf.shape(e))
prob = tf.nn.softmax(tmp, 1)
if len(h_shape) == 2:
prob = tf.squeeze(prob, axis=[0])
tmp = tf.squeeze(tmp, axis=[0])
if return_logits:
return prob, tmp
return prob

def get_att(self, s, prob):
'''
:param s: [src_sequence_length, batch_size, src_dim]
:param prob: [src_sequence_length, batch_size]\
or [tgt_sequence_length, src_sequence_length, batch_size]
:return: [batch_size, src_dim] or [tgt_sequence_length, batch_size, src_dim]
'''
buf = s * tf.expand_dims(prob, axis=-1)
att = tf.reduce_sum(buf, axis=-3)
return att
31 changes: 31 additions & 0 deletions examples/trials/weight_sharing/ga_squad/config_remote.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
authorName: default
experimentName: ga_squad_weight_sharing
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 200
#choice: local, remote, pai
trainingServicePlatform: remote
#choice: true, false
useAnnotation: false
multiThread: true
tuner:
codeDir: ../../../tuners/weight_sharing/ga_customer_tuner
classFileName: customer_tuner.py
className: CustomerTuner
classArgs:
optimize_mode: maximize
population_size: 32
save_dir_root: /mnt/nfs/nni/ga_squad
trial:
command: python3 trial.py --input_file /mnt/nfs/nni/train-v1.1.json --dev_file /mnt/nfs/nni/dev-v1.1.json --max_epoch 1 --embedding_file /mnt/nfs/nni/glove.6B.300d.txt
codeDir: .
gpuNum: 1
machineList:
- ip: remote-ip-0
port: 8022
username: root
passwd: screencast
- ip: remote-ip-1
port: 8022
username: root
passwd: screencast
Loading