Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

2bit gradient compression #8662

Merged
merged 262 commits into from
Nov 19, 2017
Merged
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
262 commits
Select commit Hold shift + click to select a range
407c01a
update two bit compression
aksnzhy Aug 29, 2017
8cbb7f6
Update trainer.py
aksnzhy Aug 29, 2017
0dd1874
Update test_operator.py
aksnzhy Aug 29, 2017
bbd21e4
update two bit compression
aksnzhy Aug 31, 2017
72640e9
update two bit compression
aksnzhy Aug 31, 2017
aaafa84
update two bit compression
aksnzhy Aug 31, 2017
2d85430
update
aksnzhy Aug 31, 2017
5a99e6a
update
aksnzhy Sep 1, 2017
861fca5
update two bit compression
aksnzhy Sep 1, 2017
54c6f06
update two bit compression
aksnzhy Sep 1, 2017
03e47a4
update two bit compression
aksnzhy Sep 1, 2017
fedd4b4
update two bit compression
aksnzhy Sep 5, 2017
13ff1bc
update two bit compression
aksnzhy Sep 5, 2017
b75d7ca
update two bit compression
aksnzhy Sep 5, 2017
b84b762
update two bit compression
aksnzhy Sep 5, 2017
260b606
update two bit compression
aksnzhy Sep 5, 2017
7d78e3a
update two bit compression
aksnzhy Sep 5, 2017
1b550eb
Merge branch 'master' into master
aksnzhy Sep 6, 2017
2a90dae
update two bit compression
aksnzhy Sep 6, 2017
baba1d8
update two bit compression
aksnzhy Sep 6, 2017
aac5292
update two bit compression
aksnzhy Sep 6, 2017
b63673a
Update comm.h
aksnzhy Sep 11, 2017
ce0f3b2
Merge branch 'master' into master
aksnzhy Sep 12, 2017
3d3ac92
add original size in comrpessed array
aksnzhy Sep 14, 2017
f797271
update comm.h
aksnzhy Sep 14, 2017
5807469
update distributed training
aksnzhy Sep 15, 2017
7dbce8b
update distributed training
aksnzhy Sep 15, 2017
d1fdfc4
Merge branch 'master' into master
aksnzhy Sep 15, 2017
112b683
Update ndarray_function.cu
aksnzhy Sep 15, 2017
fe10b7a
Update kvstore_dist.h
aksnzhy Sep 15, 2017
cea9199
Update kvstore_dist.h
aksnzhy Sep 15, 2017
0ad7acc
update
aksnzhy Sep 15, 2017
e44f8fb
update
aksnzhy Sep 15, 2017
09ceb54
update
aksnzhy Sep 18, 2017
09971bf
fix bug
aksnzhy Sep 18, 2017
237dc9b
fix
aksnzhy Sep 18, 2017
2ffcfeb
add GC test
rahul003 Sep 19, 2017
c91cca3
fix bug in push
aksnzhy Sep 19, 2017
261e244
merged changes to kvstore compress setting with test of distsync
rahul003 Sep 19, 2017
54bb44b
fix push and pull
aksnzhy Sep 19, 2017
6baa79e
merge with fix for push/pull
rahul003 Sep 19, 2017
91df1b3
fix
aksnzhy Sep 19, 2017
ec8bbc7
fix
aksnzhy Sep 19, 2017
39f2e44
uncompiled
rahul003 Sep 19, 2017
fd42f8c
kvstore dist changes. added cpp_package. changed strtof function calls
rahul003 Sep 20, 2017
f743ab1
fix usage of keys in dict
rahul003 Sep 20, 2017
e18331f
fix push and pull
aksnzhy Sep 20, 2017
7ec80ed
fix
aksnzhy Sep 20, 2017
e24bc9b
working 2bit dist merged
rahul003 Sep 20, 2017
3657869
fix_test
rahul003 Sep 20, 2017
95e073e
fix_test
rahul003 Sep 20, 2017
d595fa5
fix_test
rahul003 Sep 20, 2017
f6e2b92
add print statements
Sep 20, 2017
6cf214e
more print statements and move send command to server
rahul003 Sep 21, 2017
4b0e756
set compress handling
rahul003 Sep 21, 2017
bc31c4c
kvstore dist changes
rahul003 Sep 21, 2017
8630dbe
working kvstore push and pull. not sure if I commited that. from this…
rahul003 Sep 22, 2017
3a8f709
cleanup test
rahul003 Sep 23, 2017
3f9256e
debug prints
rahul003 Sep 25, 2017
d6be11f
working kvstore dist. includes mutation of inputs and setting thresho…
rahul003 Sep 25, 2017
75363bb
merge changes
rahul003 Sep 25, 2017
e34d263
fix operator
rahul003 Sep 25, 2017
c0894b1
kvstore dist changes
rahul003 Sep 26, 2017
381941e
fix compress kvstore issues. non compress is broken
rahul003 Sep 26, 2017
d5c37eb
fix sparse push issue
rahul003 Sep 26, 2017
c0dc329
fix read lock issue
rahul003 Sep 28, 2017
38e94f5
optimizer is the only issue now?
rahul003 Sep 28, 2017
3f27a14
fix all issues with gc dist
rahul003 Sep 28, 2017
5888e31
fix read lock issue
rahul003 Sep 28, 2017
dbcec87
pushing sharded data works
rahul003 Oct 2, 2017
2d53722
works most times. sometimes val instead of 0 has parts of 1 or 1.5...
rahul003 Oct 3, 2017
0bc1da3
fix read lock issue
rahul003 Oct 4, 2017
d120e9a
prev commit fixed seg fault issue on pull without push in a server
rahul003 Oct 4, 2017
4d5315a
add waittowrite to fix pull before push problems
rahul003 Oct 4, 2017
13b2ce5
refactor quantizing for sharded data
rahul003 Oct 5, 2017
648e0e9
redo break up of data across servers,clearer split
rahul003 Oct 6, 2017
a9bdcdc
refactor to use param for thresholds.
rahul003 Oct 9, 2017
1fac41f
Added many checks for 0
rahul003 Oct 9, 2017
1fdbdf0
cmake changes
rahul003 Oct 9, 2017
5b4e405
merge master
rahul003 Oct 9, 2017
c1a9add
formatting issues for easier merge
rahul003 Oct 9, 2017
24bc361
fix rate
rahul003 Oct 9, 2017
3a7985a
fix compilation errors after merge
rahul003 Oct 9, 2017
953ca95
fix compile error and ndarray thresholds in dequantize
rahul003 Oct 10, 2017
8c6ba4f
fix compile error and ndarray thresholds in dequantize
rahul003 Oct 10, 2017
96fa9b3
fix compile error
rahul003 Oct 10, 2017
baae59d
fix compile error, and add comments
rahul003 Oct 10, 2017
2d5696e
update operator comments
rahul003 Oct 10, 2017
36e1b51
comment checks
rahul003 Oct 10, 2017
f73e463
comment checks
rahul003 Oct 10, 2017
647b2ef
compile error
rahul003 Oct 10, 2017
8fd1cde
working on local kvstore compress test
rahul003 Oct 10, 2017
a0c2a2a
fix module api compressparams, and change quantize tblob to inside en…
rahul003 Oct 11, 2017
52f47e5
2bit arg wrong kvstore
rahul003 Oct 11, 2017
a334924
remove log
rahul003 Oct 11, 2017
be8d01d
fix gpu dequantize and tests
rahul003 Oct 11, 2017
bb473a4
fix seg fault in quantize and test indent
rahul003 Oct 11, 2017
c8cfae5
tests print more info
rahul003 Oct 11, 2017
e2b405a
assert almost equal
rahul003 Oct 12, 2017
3ee9249
more debug stuff
rahul003 Oct 13, 2017
15e4f9c
intermediate test rewrite
rahul003 Oct 13, 2017
39f3bac
small change in pushing op to engineh
rahul003 Oct 13, 2017
50fa0fa
fix concurrency of quantization
rahul003 Oct 16, 2017
6bb1411
wait on kernel
rahul003 Oct 17, 2017
558e1b5
Merge branch 'compress_params' of https://github.com/rahul003/mxnet i…
rahul003 Oct 17, 2017
69f9e11
updated tests and removed prints
rahul003 Oct 17, 2017
48591f2
comment unnecessary stuff
rahul003 Oct 17, 2017
4146690
fix test
rahul003 Oct 18, 2017
71296f8
remove print
rahul003 Oct 18, 2017
25cdda3
Update dist_sync_kvstore.py
rahul003 Oct 18, 2017
3234aa4
remove slow kernel launch init
rahul003 Oct 18, 2017
5e333bf
Merge branch 'gc' of https://github.com/rahul003/mxnet into gc
rahul003 Oct 18, 2017
72d28b6
cleanup
rahul003 Oct 18, 2017
8357301
merge master
rahul003 Oct 18, 2017
287e040
undo changes in submodule
rahul003 Oct 18, 2017
9290c23
submodule reset
rahul003 Oct 18, 2017
99154c9
remove files
rahul003 Oct 18, 2017
52b6905
undo changes unrelated to project
rahul003 Oct 18, 2017
b560d25
undo changes unrelated to project
rahul003 Oct 18, 2017
60b1b69
Comments and cleanup.
rahul003 Oct 18, 2017
e3153ce
more cleanup and comments
rahul003 Oct 18, 2017
eeb454b
comments for tests
rahul003 Oct 18, 2017
2f936ee
lint changes and comments
rahul003 Oct 18, 2017
5e849e1
speed up operator test by reducing asnumpy() calls
rahul003 Oct 18, 2017
69608da
random data for test_kvstore_local
rahul003 Oct 18, 2017
847a7f2
fix variable confusion error in test
rahul003 Oct 18, 2017
2f8e86e
fix randomized data test for local kvstore
rahul003 Oct 19, 2017
69af018
add nrepeat for test_kvstore
rahul003 Oct 19, 2017
32b9e7c
merge and fix local kvstore random test
rahul003 Oct 19, 2017
39e2d22
change keys after merge from master introduced same keys
rahul003 Oct 19, 2017
bf3ea61
correct test which fails because grad changes
rahul003 Oct 19, 2017
9c9ae58
change to bit ops
rahul003 Oct 22, 2017
5c42ebb
change to bit ops
rahul003 Oct 23, 2017
49e4ee0
use bit array and revert sign changes
rahul003 Oct 24, 2017
44b20e7
merge conflict. remove server changes
rahul003 Oct 24, 2017
f74d317
correct bits setting to 10 as 2
rahul003 Oct 24, 2017
b67a392
remove switch in dequantize
rahul003 Oct 24, 2017
804f7d1
image classification example changes and remove cpp-api
rahul003 Oct 24, 2017
9629410
merge all quantize, and new type in dist server
rahul003 Oct 25, 2017
0feabd5
fix ndarray dequantize
rahul003 Oct 26, 2017
d3e4df8
debug stuff
rahul003 Oct 26, 2017
d6801dd
fix bug
rahul003 Oct 26, 2017
d63e0b4
Merge remote-tracking branch 'origin/gc-quantall' into gc-quantall
rahul003 Oct 26, 2017
e97c477
trying merge dequntize
rahul003 Oct 26, 2017
18df71e
Frmework and validation tests for operator validation and performance…
cjolivier01 Oct 26, 2017
9f480ee
Remove obsolete file
cjolivier01 Oct 26, 2017
92dd85f
Fix compile error for non-CUDA build
cjolivier01 Oct 26, 2017
505d3e7
tweaks in quantize
rahul003 Oct 26, 2017
51d0349
Allow for no backward pass
Oct 26, 2017
3e17ec3
Remove unused var
Oct 26, 2017
12d4499
merge chris pr
rahul003 Oct 26, 2017
248908c
making quantize all compatible as operators
rahul003 Oct 27, 2017
35b42f7
separate mshadow and loop operators
rahul003 Oct 27, 2017
cabb948
working profiler, dequantize mshadow is slow
rahul003 Oct 27, 2017
b8d2b50
fix mshadow dequantize
rahul003 Oct 27, 2017
e09a8fd
fix quantize call by kvdist
rahul003 Oct 27, 2017
b2c9f29
making quantize all compatible as operators
rahul003 Oct 27, 2017
6e651ed
add profile to measure.py
rahul003 Oct 27, 2017
0c48ebb
minor profiler changes
rahul003 Oct 27, 2017
fe66ef9
timing print in cpp operator
rahul003 Oct 27, 2017
f5204ca
time quantize
rahul003 Oct 27, 2017
5e473b2
saving data feature added
rahul003 Oct 27, 2017
88cc0fd
cleanup test
rahul003 Oct 27, 2017
5c7a1ff
small updates
rahul003 Oct 28, 2017
5283035
cleanup
rahul003 Oct 28, 2017
5294d4d
minor fix
rahul003 Oct 28, 2017
6bb9933
passing additional environment variables through launch.py
rahul003 Oct 31, 2017
2a7f2f5
update local test
rahul003 Oct 31, 2017
feaae67
update master
rahul003 Oct 31, 2017
080882b
Merge branch 'pass-env' of https://github.com/rahul003/mxnet into gc-…
rahul003 Oct 31, 2017
a5abca4
update dmlc with pass-env
rahul003 Oct 31, 2017
7e5301d
fix launch pass env issue
rahul003 Oct 31, 2017
594b40c
update with pass-env changes
rahul003 Oct 31, 2017
642cfe4
fix operator increment of block, remove unncessary commented code
rahul003 Oct 31, 2017
3c8686a
fix operator increment of block, remove unncessary commented code
rahul003 Oct 31, 2017
483d610
fix operator increment of block, remove unncessary commented code
rahul003 Oct 31, 2017
bc245b4
fix operator increment of block, remove unncessary commented code
rahul003 Oct 31, 2017
2f257e5
bring back quantize
rahul003 Oct 31, 2017
106feb8
Merge remote-tracking branch 'origin/gc-quantall' into gc-quantall
rahul003 Oct 31, 2017
46cbf5c
fix test
rahul003 Oct 31, 2017
a351723
fix bug with increment of char pointer
rahul003 Nov 1, 2017
c84af06
fix bug with increment of char pointer
rahul003 Nov 1, 2017
d316700
debug module
rahul003 Nov 1, 2017
f044830
Merge branch 'gc-quantall' of https://github.com/rahul003/mxnet into …
rahul003 Nov 1, 2017
e8aa9b5
update test
rahul003 Nov 1, 2017
5f130dd
comment all debug statements
rahul003 Nov 1, 2017
c1fbeb7
change init to normal for now
rahul003 Nov 2, 2017
180af91
Merge branch 'gc-quantall' of https://github.com/rahul003/mxnet into …
rahul003 Nov 2, 2017
4e0bded
remove debug changes
rahul003 Nov 2, 2017
8a083d2
reorg to create gc class, add delayed start to gc, untested: causing …
rahul003 Nov 3, 2017
c4d9a45
redo header files
rahul003 Nov 7, 2017
3a2060b
remove ps
rahul003 Nov 7, 2017
193586e
remove unused header
rahul003 Nov 7, 2017
75399ff
fix compile issues
rahul003 Nov 7, 2017
ac2886a
merge master
rahul003 Nov 8, 2017
e6e41e4
remove multiple delete of gc
rahul003 Nov 8, 2017
a7d6c68
add expected to local kvstore test
rahul003 Nov 8, 2017
970acbb
fix operator compile issues
rahul003 Nov 8, 2017
7ec0655
fix operator compile issues
rahul003 Nov 8, 2017
b72df8e
fix operator compile and link issues
rahul003 Nov 8, 2017
2913b56
remove gc.cpp
rahul003 Nov 8, 2017
f2e2469
add split function
rahul003 Nov 8, 2017
30eae11
move setting of active gc
rahul003 Nov 8, 2017
f5ddf7f
move all to gc.cpp, compile works for cpu
rahul003 Nov 8, 2017
d3b668d
WIP gpu compile
rahul003 Nov 8, 2017
f19e7ee
compiles and links on both cpu and gpu
rahul003 Nov 8, 2017
42cdbdf
move prototypes to header
rahul003 Nov 8, 2017
82f7964
add split function
rahul003 Nov 9, 2017
b96c3c0
undo changes from master
rahul003 Nov 9, 2017
4bb3701
remove cpp perf quantize
rahul003 Nov 9, 2017
5c0114c
undo more changes
rahul003 Nov 9, 2017
9557795
Merge branch 'master' of https://github.com/dmlc/mxnet into gc-quantall
rahul003 Nov 9, 2017
3945a8f
add inactive function so that multiple kvstore dist inits have no com…
rahul003 Nov 9, 2017
bbdfe1a
undo some formatting changes
rahul003 Nov 9, 2017
80957a7
make sharding same when inactive and active
rahul003 Nov 9, 2017
222f33c
remove counts and get_active_type
rahul003 Nov 9, 2017
dc3b8e6
remove print
rahul003 Nov 10, 2017
ac55cdc
add train caltech
rahul003 Nov 10, 2017
48d54df
increase size of mlp
rahul003 Nov 10, 2017
eea86ff
update to alexa mlp
rahul003 Nov 11, 2017
b37d36d
Merge branch 'gc-quantall' of https://github.com/rahul003/mxnet into …
rahul003 Nov 11, 2017
aa6fb6f
pass-env changes
rahul003 Nov 11, 2017
b694f15
add bucketing module compression
rahul003 Nov 12, 2017
e4c46e0
Merge remote-tracking branch 'origin/gc-quantall' into gc-quantall
rahul003 Nov 12, 2017
b84f179
attempts for alexnet training
rahul003 Nov 14, 2017
2578883
prepare for merge
rahul003 Nov 15, 2017
b60b3fb
fix lint issues
rahul003 Nov 15, 2017
8328923
fix lint issues
rahul003 Nov 15, 2017
aa242b8
remove caltech
rahul003 Nov 15, 2017
62c5255
address some comments: shared_ptr, documentation, indentaion, new fun…
rahul003 Nov 16, 2017
b8b1d66
move header
rahul003 Nov 16, 2017
b66a3f2
include header corrected
rahul003 Nov 16, 2017
f32b391
include header corrected
rahul003 Nov 16, 2017
0743f60
indents, documentation and test update
rahul003 Nov 16, 2017
6fd68f7
lint
rahul003 Nov 16, 2017
d7aea02
pylint
rahul003 Nov 16, 2017
40f71f8
rename class, fix local kvstore test, remove confusing active method
rahul003 Nov 16, 2017
eabc503
fix importing of compute expected in test_kvstore
rahul003 Nov 16, 2017
806586f
fix bug in device kvstore
rahul003 Nov 16, 2017
6070450
remove active comment in pull
rahul003 Nov 16, 2017
2289129
docstring
rahul003 Nov 16, 2017
f41e102
use dmlc params, enums,
rahul003 Nov 16, 2017
5acbc9a
doc updates
rahul003 Nov 16, 2017
3c1bacb
lint
rahul003 Nov 16, 2017
18d6a90
update from master
rahul003 Nov 16, 2017
dfe7a7d
typo
rahul003 Nov 16, 2017
4b6f34a
rename field to type
rahul003 Nov 16, 2017
30a197b
fix distributed kvstore stopping issue.
rahul003 Nov 17, 2017
3073bf7
Trigger CI
rahul003 Nov 17, 2017
d5e4b2e
trigger CI
rahul003 Nov 17, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 26 additions & 18 deletions example/image-classification/common/fit.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,11 @@ def add_fit_args(parser):
help='1 means test reading speed without training')
train.add_argument('--dtype', type=str, default='float32',
help='precision: float32 or float16')
train.add_argument('--gc-type', type=str, default='none',
help='type of gradient compression to use, \
takes `2bit` or `none` for now')
train.add_argument('--gc-threshold', type=float, default=0.5,
help='threshold for 2bit gradient compression')
return train

def fit(args, network, data_loader, **kwargs):
Expand All @@ -114,6 +119,9 @@ def fit(args, network, data_loader, **kwargs):
"""
# kvstore
kv = mx.kvstore.create(args.kv_store)
if args.gc_type != 'none':
kv.set_gradient_compression({'compression': args.gc_type,
'threshold': args.gc_threshold})

# logging
head = '%(asctime)-15s Node[' + str(kv.rank) + '] %(message)s'
Expand Down Expand Up @@ -162,10 +170,10 @@ def fit(args, network, data_loader, **kwargs):

lr_scheduler = lr_scheduler
optimizer_params = {
'learning_rate': lr,
'wd' : args.wd,
'lr_scheduler': lr_scheduler,
'multi_precision': True}
'learning_rate': lr,
'wd' : args.wd,
'lr_scheduler': lr_scheduler,
'multi_precision': True}

# Only a limited number of optimizers have 'momentum' property
has_momentum = {'sgd', 'dcasgd', 'nag'}
Expand Down Expand Up @@ -195,17 +203,17 @@ def fit(args, network, data_loader, **kwargs):

# run
model.fit(train,
begin_epoch = args.load_epoch if args.load_epoch else 0,
num_epoch = args.num_epochs,
eval_data = val,
eval_metric = eval_metrics,
kvstore = kv,
optimizer = args.optimizer,
optimizer_params = optimizer_params,
initializer = initializer,
arg_params = arg_params,
aux_params = aux_params,
batch_end_callback = batch_end_callbacks,
epoch_end_callback = checkpoint,
allow_missing = True,
monitor = monitor)
begin_epoch = args.load_epoch if args.load_epoch else 0,
num_epoch = args.num_epochs,
eval_data = val,
eval_metric = eval_metrics,
kvstore = kv,
optimizer = args.optimizer,
optimizer_params = optimizer_params,
initializer = initializer,
arg_params = arg_params,
aux_params = aux_params,
batch_end_callback = batch_end_callbacks,
epoch_end_callback = checkpoint,
allow_missing = True,
monitor = monitor)
8 changes: 6 additions & 2 deletions example/rnn/lstm_bucketing.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,10 @@
help='the batch size.')
parser.add_argument('--disp-batches', type=int, default=50,
help='show progress for every n batches')

parser.add_argument('--gc-type', type=str, default='none',
help='type of gradient compression')
parser.add_argument('--gc-threshold', type=float, default=0.5,
help='threshold for 2bit gradient compression')

def tokenize_text(fname, vocab=None, invalid_label=-1, start_label=0):
if not os.path.isfile(fname):
Expand Down Expand Up @@ -111,7 +114,8 @@ def sym_gen(seq_len):
model = mx.mod.BucketingModule(
sym_gen = sym_gen,
default_bucket_key = data_train.default_bucket_key,
context = contexts)
context = contexts,
compression_params = {'compression': args.gc_type})

model.fit(
train_data = data_train,
Expand Down
12 changes: 12 additions & 0 deletions include/mxnet/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -1530,6 +1530,18 @@ MXNET_DLL int MXInitPSEnv(mx_uint num_vars,
*/
MXNET_DLL int MXKVStoreCreate(const char *type,
KVStoreHandle *out);

/*!
* \brief Set parameters to use low-bit compressed gradients
* \param handle handle to the kvstore
* \param compression type of compression
* \param threshold set the threshold for 2bit compression
* \return 0 when success, -1 when failure happens
*/
MXNET_DLL int MXKVStoreSetGradientCompression(KVStoreHandle handle,
const char *compression,
const float threshold);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API should be
MXKVStoreSetGradientCompression(KVStoreHandle handle, mx_uint num_params, const char **keys, const char **vals)
The values should be parsed in backend with dmlc::Parameter


/*!
* \brief Delete a KVStore handle.
* \param handle handle to the kvstore
Expand Down
16 changes: 16 additions & 0 deletions include/mxnet/kvstore.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
#include <string>
#include <functional>
#include <atomic>
#include "../../src/kvstore/gradient_compression.h"
#include "./ndarray.h"
#if MXNET_USE_DIST_KVSTORE
#include "ps/ps.h"
Expand Down Expand Up @@ -64,6 +65,14 @@ class KVStore {
*/
inline const std::string& type() { return type_; }

/**
* \brief Set parameters to use low-bit compressed gradients
* \param compression_type type of compression
* \param threshold threshold for 2bit compression
*/
virtual void SetGradientCompression(const std::string& compression_type,
const float threshold) = 0;

/*!
* \brief Initialize a list of key-value pair to the store.
*
Expand Down Expand Up @@ -387,6 +396,13 @@ class KVStore {
*/
std::string type_;

/** \brief Gradient compression object starts with GC_NONE mode
* Used if SetGradientCompression sets the type.
* Currently there is no support for un-setting gradient compression
*/
std::shared_ptr<kvstore::GradientCompression> gradient_compression_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no support for un-setting gradient compression ? What happens if an user tries to unset it?

Copy link
Member Author

@rahul003 rahul003 Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if user uses kvstore.set_gradient_compression({'type':'none'} after setting it to 2bit, it throws an error because none can't be a type.
If users sets 2bit again with different threshold, then new threshold will be used from then on, but there might be a period in transition when gradients quantized with old threshold will be dequantized with new threshold, because of delay in sychronization.



/**
* \brief whether to do barrier when finalize
*/
Expand Down
11 changes: 9 additions & 2 deletions python/mxnet/gluon/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,20 @@ class Trainer(object):
kvstore : str or KVStore
kvstore type for multi-gpu and distributed training. See help on
:any:`mxnet.kvstore.create` for more information.
compression_params : dict
Specifies type of gradient compression and additional arguments depending
on the type of compression being used. For example, 2bit compression requires a threshold.
Arguments would then be {'compression':'2bit', 'threshold':0.5}
See mxnet.KVStore.set_gradient_compression method for more details on gradient compression.

Properties
----------
learning_rate: float
The current learning rate of the optimizer. Given an Optimizer object
optimizer, its learning rate can be accessed as optimizer.learning_rate.
"""
def __init__(self, params, optimizer, optimizer_params=None, kvstore='device'):
def __init__(self, params, optimizer, optimizer_params=None, kvstore='device',
compression_params=None):
if isinstance(params, (dict, ParameterDict)):
params = list(params.values())
if not isinstance(params, (list, tuple)):
Expand All @@ -65,7 +71,7 @@ def __init__(self, params, optimizer, optimizer_params=None, kvstore='device'):
"First argument must be a list or dict of Parameters, " \
"got list of %s."%(type(param)))
self._params.append(param)

self._compression_params = compression_params
optimizer_params = optimizer_params if optimizer_params else {}
self._scale = optimizer_params.get('rescale_grad', 1.0)
self._contexts = self._check_contexts()
Expand Down Expand Up @@ -104,6 +110,7 @@ def _init_kvstore(self):
kvstore, update_on_kvstore = _create_kvstore(self._kvstore, len(self._contexts),
arg_arrays)
if kvstore:
kvstore.set_gradient_compression(self._compression_params)
if 'dist' in kvstore.type:
update_on_kvstore = False
for i, param in enumerate(self._params):
Expand Down
73 changes: 72 additions & 1 deletion python/mxnet/kvstore.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
from .ndarray import NDArray
from .ndarray import _ndarray_cls
from .base import _LIB
from .base import check_call, c_array, c_str, string_types, mx_uint, py_str
from .base import check_call, c_array, c_str, string_types, numeric_types, mx_uint, mx_float, py_str
from .base import NDArrayHandle, KVStoreHandle
from . import optimizer as opt

Expand Down Expand Up @@ -349,6 +349,77 @@ def row_sparse_pull(self, key, out=None, priority=0, row_ids=None):
check_call(_LIB.MXKVStorePullRowSparse(
self.handle, mx_uint(len(ckeys)), ckeys, cvals, crow_ids, ctypes.c_int(priority)))

def set_gradient_compression(self, compression_params=(('compression', '2bit'),)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there should be a default value at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename key compression to type

""" Specifies type of low-bit quantization for gradient compression if any, \
and additional arguments depending on the type of compression being used.

2bit Gradient Compression takes a positive float `threshold`.
The technique works by thresholding values such that positive values in the
gradient above threshold will be set to threshold. Negative values whose absolute
values are higher than threshold, will be set to the negative of threshold.
Values whose absolute values are less than threshold will be set to 0.
By doing so, each value in the gradient is in one of three states. 2bits are
used to represent these states, and every 16 float values in the original
gradient can be represented using one float. This compressed representation
can reduce communication costs. The difference between these thresholded values and
original values is stored at the sender's end as residual and added to the
gradient in the next iteration.

When kvstore is 'local', gradient compression is used to reduce communication
between multiple devices (gpus). Gradient is quantized on each GPU which
computed the gradients, then sent to the GPU which merges the gradients. This
receiving GPU dequantizes the gradients and merges them. Note that this
increases memory usage on each GPU because of the residual array stored.

When kvstore is 'dist', gradient compression is used to reduce communication
from worker to sender. Gradient is quantized on each worker which
computed the gradients, then sent to the server which dequantizes
this data and merges the gradients from each worker. Note that this
increases CPU memory usage on each worker because of the residual array stored.
Only worker to server communication is compressed in this setting.
If each machine has multiple GPUs, currently this GPU to GPU communication is
not compressed. Server to worker communication (in the case of pull) is also not
compressed.

To use 2bit compression, we need to specify `compression` as `2bit`.
Only specifying `compression` would use default value for the threshold.
To completely specify the arguments for 2bit compression, we would need to pass
a dictionary which includes `threshold` like:
{'compression': '2bit', 'threshold': 0.5}

Parameters
----------
compression_params : dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this doc render correctly?

Copy link
Member Author

@rahul003 rahul003 Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it a bit and verified
See here

A dictionary specifying the type and parameters for gradient compression.
The key `compression` in this dictionary is a
required string argument and specifies the type of gradient compression.
Other keys in this dictionary are optional and specific to the type
of gradient compression. Defaults to (('compression', '2bit'),).
The default value is not a dict,
just to avoid pylint warning on dangerous default values.
"""
if compression_params:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superfluous if?

if not isinstance(compression_params, dict):
raise ValueError("compression_params needs to be a dictionary")
if 'compression' not in compression_params:
raise ValueError('compression_params requires `compression` to be set')
elif not isinstance(compression_params['compression'], string_types):
raise TypeError('compression must be a string')
elif compression_params['compression'] not in ['none', '2bit']:
raise ValueError('Unsupported type of compression')

if compression_params['compression'] == '2bit':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parsing should be done in backend with dmlc::Parameter

The frontend should pass strings of key value pairs.

if 'threshold' in compression_params:
if not isinstance(compression_params['threshold'], numeric_types):
raise TypeError('threshold must be a numeric type')
if compression_params['threshold'] <= 0:
raise ValueError('threshold must be greater than 0')
else:
compression_params['threshold'] = 0.5

check_call(_LIB.MXKVStoreSetGradientCompression(
self.handle, c_str(compression_params['compression']),
mx_float(compression_params['threshold'])))

def set_optimizer(self, optimizer):
""" Registers an optimizer with the kvstore.
Expand Down
15 changes: 12 additions & 3 deletions python/mxnet/module/bucketing_module.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,16 @@ class BucketingModule(BaseModule):
state_names : list of str
States are similar to data and label, but not provided by data iterator.
Instead they are initialized to 0 and can be set by set_states()
compression_params : dict
Specifies type of gradient compression and additional arguments depending
on the type of compression being used. For example, 2bit compression requires a threshold.
Arguments would then be {'compression':'2bit', 'threshold':0.5}
See mxnet.KVStore.set_gradient_compression method for more details on gradient compression.
"""
def __init__(self, sym_gen, default_bucket_key=None, logger=logging,
context=ctx.cpu(), work_load_list=None,
fixed_param_names=None, state_names=None):
fixed_param_names=None, state_names=None,
compression_params=None):
super(BucketingModule, self).__init__(logger=logger)

assert default_bucket_key is not None
Expand All @@ -73,6 +79,7 @@ def __init__(self, sym_gen, default_bucket_key=None, logger=logging,
_check_input_names(symbol, state_names, "state", True)
_check_input_names(symbol, fixed_param_names, "fixed_param", True)

self._compression_params = compression_params
self._fixed_param_names = fixed_param_names
self._state_names = state_names
self._context = context
Expand Down Expand Up @@ -319,7 +326,8 @@ def bind(self, data_shapes, label_shapes=None, for_training=True,
module = Module(symbol, data_names, label_names, logger=self.logger,
context=self._context, work_load_list=self._work_load_list,
fixed_param_names=self._fixed_param_names,
state_names=self._state_names)
state_names=self._state_names,
compression_params=self._compression_params)
module.bind(data_shapes, label_shapes, for_training, inputs_need_grad,
force_rebind=False, shared_module=None, grad_req=grad_req)
self._curr_module = module
Expand Down Expand Up @@ -349,7 +357,8 @@ def switch_bucket(self, bucket_key, data_shapes, label_shapes=None):
logger=self.logger, context=self._context,
work_load_list=self._work_load_list,
fixed_param_names=self._fixed_param_names,
state_names=self._state_names)
state_names=self._state_names,
compression_params=self._compression_params)
module.bind(data_shapes, label_shapes, self._curr_module.for_training,
self._curr_module.inputs_need_grad,
force_rebind=False, shared_module=self._buckets[self._default_bucket_key])
Expand Down
9 changes: 8 additions & 1 deletion python/mxnet/module/module.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,15 @@ class Module(BaseModule):
state_names : list of str
states are similar to data and label, but not provided by data iterator.
Instead they are initialized to 0 and can be set by `set_states()`.
compression_params : dict
Specifies type of gradient compression and additional arguments depending
on the type of compression being used. For example, 2bit compression requires a threshold.
Arguments would then be {'compression':'2bit', 'threshold':0.5}
See mxnet.KVStore.set_gradient_compression method for more details on gradient compression.
"""
def __init__(self, symbol, data_names=('data',), label_names=('softmax_label',),
logger=logging, context=ctx.cpu(), work_load_list=None,
fixed_param_names=None, state_names=None):
fixed_param_names=None, state_names=None, compression_params=None):
super(Module, self).__init__(logger=logger)

if isinstance(context, ctx.Context):
Expand Down Expand Up @@ -99,6 +104,7 @@ def __init__(self, symbol, data_names=('data',), label_names=('softmax_label',),
self._aux_params = None
self._params_dirty = False

self._compression_params = compression_params
self._optimizer = None
self._kvstore = None
self._update_on_kvstore = None
Expand Down Expand Up @@ -521,6 +527,7 @@ def init_optimizer(self, kvstore='local', optimizer='sgd',
self._updater = None

if kvstore:
kvstore.set_gradient_compression(self._compression_params)
# copy initialized local parameters to kvstore
_initialize_kvstore(kvstore=kvstore,
param_arrays=self._exec_group.param_arrays,
Expand Down
7 changes: 7 additions & 0 deletions src/c_api/c_api.cc
Original file line number Diff line number Diff line change
Expand Up @@ -733,6 +733,13 @@ int MXKVStoreCreate(const char *type,
API_END();
}

int MXKVStoreSetGradientCompression(KVStoreHandle handle,
const char *compression, const float threshold) {
API_BEGIN();
static_cast<KVStore*>(handle)->SetGradientCompression(compression, threshold);
API_END();
}

int MXKVStoreFree(KVStoreHandle handle) {
API_BEGIN();
delete static_cast<KVStore*>(handle);
Expand Down
Loading