This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
2bit gradient compression #8662
Merged
Merged
Changes from 250 commits
Commits
Show all changes
262 commits
Select commit
Hold shift + click to select a range
407c01a
update two bit compression
aksnzhy 8cbb7f6
Update trainer.py
aksnzhy 0dd1874
Update test_operator.py
aksnzhy bbd21e4
update two bit compression
aksnzhy 72640e9
update two bit compression
aksnzhy aaafa84
update two bit compression
aksnzhy 2d85430
update
aksnzhy 5a99e6a
update
aksnzhy 861fca5
update two bit compression
aksnzhy 54c6f06
update two bit compression
aksnzhy 03e47a4
update two bit compression
aksnzhy fedd4b4
update two bit compression
aksnzhy 13ff1bc
update two bit compression
aksnzhy b75d7ca
update two bit compression
aksnzhy b84b762
update two bit compression
aksnzhy 260b606
update two bit compression
aksnzhy 7d78e3a
update two bit compression
aksnzhy 1b550eb
Merge branch 'master' into master
aksnzhy 2a90dae
update two bit compression
aksnzhy baba1d8
update two bit compression
aksnzhy aac5292
update two bit compression
aksnzhy b63673a
Update comm.h
aksnzhy ce0f3b2
Merge branch 'master' into master
aksnzhy 3d3ac92
add original size in comrpessed array
aksnzhy f797271
update comm.h
aksnzhy 5807469
update distributed training
aksnzhy 7dbce8b
update distributed training
aksnzhy d1fdfc4
Merge branch 'master' into master
aksnzhy 112b683
Update ndarray_function.cu
aksnzhy fe10b7a
Update kvstore_dist.h
aksnzhy cea9199
Update kvstore_dist.h
aksnzhy 0ad7acc
update
aksnzhy e44f8fb
update
aksnzhy 09ceb54
update
aksnzhy 09971bf
fix bug
aksnzhy 237dc9b
fix
aksnzhy 2ffcfeb
add GC test
rahul003 c91cca3
fix bug in push
aksnzhy 261e244
merged changes to kvstore compress setting with test of distsync
rahul003 54bb44b
fix push and pull
aksnzhy 6baa79e
merge with fix for push/pull
rahul003 91df1b3
fix
aksnzhy ec8bbc7
fix
aksnzhy 39f2e44
uncompiled
rahul003 fd42f8c
kvstore dist changes. added cpp_package. changed strtof function calls
rahul003 f743ab1
fix usage of keys in dict
rahul003 e18331f
fix push and pull
aksnzhy 7ec80ed
fix
aksnzhy e24bc9b
working 2bit dist merged
rahul003 3657869
fix_test
rahul003 95e073e
fix_test
rahul003 d595fa5
fix_test
rahul003 f6e2b92
add print statements
6cf214e
more print statements and move send command to server
rahul003 4b0e756
set compress handling
rahul003 bc31c4c
kvstore dist changes
rahul003 8630dbe
working kvstore push and pull. not sure if I commited that. from this…
rahul003 3a8f709
cleanup test
rahul003 3f9256e
debug prints
rahul003 d6be11f
working kvstore dist. includes mutation of inputs and setting thresho…
rahul003 75363bb
merge changes
rahul003 e34d263
fix operator
rahul003 c0894b1
kvstore dist changes
rahul003 381941e
fix compress kvstore issues. non compress is broken
rahul003 d5c37eb
fix sparse push issue
rahul003 c0dc329
fix read lock issue
rahul003 38e94f5
optimizer is the only issue now?
rahul003 3f27a14
fix all issues with gc dist
rahul003 5888e31
fix read lock issue
rahul003 dbcec87
pushing sharded data works
rahul003 2d53722
works most times. sometimes val instead of 0 has parts of 1 or 1.5...
rahul003 0bc1da3
fix read lock issue
rahul003 d120e9a
prev commit fixed seg fault issue on pull without push in a server
rahul003 4d5315a
add waittowrite to fix pull before push problems
rahul003 13b2ce5
refactor quantizing for sharded data
rahul003 648e0e9
redo break up of data across servers,clearer split
rahul003 a9bdcdc
refactor to use param for thresholds.
rahul003 1fac41f
Added many checks for 0
rahul003 1fdbdf0
cmake changes
rahul003 5b4e405
merge master
rahul003 c1a9add
formatting issues for easier merge
rahul003 24bc361
fix rate
rahul003 3a7985a
fix compilation errors after merge
rahul003 953ca95
fix compile error and ndarray thresholds in dequantize
rahul003 8c6ba4f
fix compile error and ndarray thresholds in dequantize
rahul003 96fa9b3
fix compile error
rahul003 baae59d
fix compile error, and add comments
rahul003 2d5696e
update operator comments
rahul003 36e1b51
comment checks
rahul003 f73e463
comment checks
rahul003 647b2ef
compile error
rahul003 8fd1cde
working on local kvstore compress test
rahul003 a0c2a2a
fix module api compressparams, and change quantize tblob to inside en…
rahul003 52f47e5
2bit arg wrong kvstore
rahul003 a334924
remove log
rahul003 be8d01d
fix gpu dequantize and tests
rahul003 bb473a4
fix seg fault in quantize and test indent
rahul003 c8cfae5
tests print more info
rahul003 e2b405a
assert almost equal
rahul003 3ee9249
more debug stuff
rahul003 15e4f9c
intermediate test rewrite
rahul003 39f3bac
small change in pushing op to engineh
rahul003 50fa0fa
fix concurrency of quantization
rahul003 6bb1411
wait on kernel
rahul003 558e1b5
Merge branch 'compress_params' of https://github.com/rahul003/mxnet i…
rahul003 69f9e11
updated tests and removed prints
rahul003 48591f2
comment unnecessary stuff
rahul003 4146690
fix test
rahul003 71296f8
remove print
rahul003 25cdda3
Update dist_sync_kvstore.py
rahul003 3234aa4
remove slow kernel launch init
rahul003 5e333bf
Merge branch 'gc' of https://github.com/rahul003/mxnet into gc
rahul003 72d28b6
cleanup
rahul003 8357301
merge master
rahul003 287e040
undo changes in submodule
rahul003 9290c23
submodule reset
rahul003 99154c9
remove files
rahul003 52b6905
undo changes unrelated to project
rahul003 b560d25
undo changes unrelated to project
rahul003 60b1b69
Comments and cleanup.
rahul003 e3153ce
more cleanup and comments
rahul003 eeb454b
comments for tests
rahul003 2f936ee
lint changes and comments
rahul003 5e849e1
speed up operator test by reducing asnumpy() calls
rahul003 69608da
random data for test_kvstore_local
rahul003 847a7f2
fix variable confusion error in test
rahul003 2f8e86e
fix randomized data test for local kvstore
rahul003 69af018
add nrepeat for test_kvstore
rahul003 32b9e7c
merge and fix local kvstore random test
rahul003 39e2d22
change keys after merge from master introduced same keys
rahul003 bf3ea61
correct test which fails because grad changes
rahul003 9c9ae58
change to bit ops
rahul003 5c42ebb
change to bit ops
rahul003 49e4ee0
use bit array and revert sign changes
rahul003 44b20e7
merge conflict. remove server changes
rahul003 f74d317
correct bits setting to 10 as 2
rahul003 b67a392
remove switch in dequantize
rahul003 804f7d1
image classification example changes and remove cpp-api
rahul003 9629410
merge all quantize, and new type in dist server
rahul003 0feabd5
fix ndarray dequantize
rahul003 d3e4df8
debug stuff
rahul003 d6801dd
fix bug
rahul003 d63e0b4
Merge remote-tracking branch 'origin/gc-quantall' into gc-quantall
rahul003 e97c477
trying merge dequntize
rahul003 18df71e
Frmework and validation tests for operator validation and performance…
cjolivier01 9f480ee
Remove obsolete file
cjolivier01 92dd85f
Fix compile error for non-CUDA build
cjolivier01 505d3e7
tweaks in quantize
rahul003 51d0349
Allow for no backward pass
3e17ec3
Remove unused var
12d4499
merge chris pr
rahul003 248908c
making quantize all compatible as operators
rahul003 35b42f7
separate mshadow and loop operators
rahul003 cabb948
working profiler, dequantize mshadow is slow
rahul003 b8d2b50
fix mshadow dequantize
rahul003 e09a8fd
fix quantize call by kvdist
rahul003 b2c9f29
making quantize all compatible as operators
rahul003 6e651ed
add profile to measure.py
rahul003 0c48ebb
minor profiler changes
rahul003 fe66ef9
timing print in cpp operator
rahul003 f5204ca
time quantize
rahul003 5e473b2
saving data feature added
rahul003 88cc0fd
cleanup test
rahul003 5c7a1ff
small updates
rahul003 5283035
cleanup
rahul003 5294d4d
minor fix
rahul003 6bb9933
passing additional environment variables through launch.py
rahul003 2a7f2f5
update local test
rahul003 feaae67
update master
rahul003 080882b
Merge branch 'pass-env' of https://github.com/rahul003/mxnet into gc-…
rahul003 a5abca4
update dmlc with pass-env
rahul003 7e5301d
fix launch pass env issue
rahul003 594b40c
update with pass-env changes
rahul003 642cfe4
fix operator increment of block, remove unncessary commented code
rahul003 3c8686a
fix operator increment of block, remove unncessary commented code
rahul003 483d610
fix operator increment of block, remove unncessary commented code
rahul003 bc245b4
fix operator increment of block, remove unncessary commented code
rahul003 2f257e5
bring back quantize
rahul003 106feb8
Merge remote-tracking branch 'origin/gc-quantall' into gc-quantall
rahul003 46cbf5c
fix test
rahul003 a351723
fix bug with increment of char pointer
rahul003 c84af06
fix bug with increment of char pointer
rahul003 d316700
debug module
rahul003 f044830
Merge branch 'gc-quantall' of https://github.com/rahul003/mxnet into …
rahul003 e8aa9b5
update test
rahul003 5f130dd
comment all debug statements
rahul003 c1fbeb7
change init to normal for now
rahul003 180af91
Merge branch 'gc-quantall' of https://github.com/rahul003/mxnet into …
rahul003 4e0bded
remove debug changes
rahul003 8a083d2
reorg to create gc class, add delayed start to gc, untested: causing …
rahul003 c4d9a45
redo header files
rahul003 3a2060b
remove ps
rahul003 193586e
remove unused header
rahul003 75399ff
fix compile issues
rahul003 ac2886a
merge master
rahul003 e6e41e4
remove multiple delete of gc
rahul003 a7d6c68
add expected to local kvstore test
rahul003 970acbb
fix operator compile issues
rahul003 7ec0655
fix operator compile issues
rahul003 b72df8e
fix operator compile and link issues
rahul003 2913b56
remove gc.cpp
rahul003 f2e2469
add split function
rahul003 30eae11
move setting of active gc
rahul003 f5ddf7f
move all to gc.cpp, compile works for cpu
rahul003 d3b668d
WIP gpu compile
rahul003 f19e7ee
compiles and links on both cpu and gpu
rahul003 42cdbdf
move prototypes to header
rahul003 82f7964
add split function
rahul003 b96c3c0
undo changes from master
rahul003 4bb3701
remove cpp perf quantize
rahul003 5c0114c
undo more changes
rahul003 9557795
Merge branch 'master' of https://github.com/dmlc/mxnet into gc-quantall
rahul003 3945a8f
add inactive function so that multiple kvstore dist inits have no com…
rahul003 bbdfe1a
undo some formatting changes
rahul003 80957a7
make sharding same when inactive and active
rahul003 222f33c
remove counts and get_active_type
rahul003 dc3b8e6
remove print
rahul003 ac55cdc
add train caltech
rahul003 48d54df
increase size of mlp
rahul003 eea86ff
update to alexa mlp
rahul003 b37d36d
Merge branch 'gc-quantall' of https://github.com/rahul003/mxnet into …
rahul003 aa6fb6f
pass-env changes
rahul003 b694f15
add bucketing module compression
rahul003 e4c46e0
Merge remote-tracking branch 'origin/gc-quantall' into gc-quantall
rahul003 b84f179
attempts for alexnet training
rahul003 2578883
prepare for merge
rahul003 b60b3fb
fix lint issues
rahul003 8328923
fix lint issues
rahul003 aa242b8
remove caltech
rahul003 62c5255
address some comments: shared_ptr, documentation, indentaion, new fun…
rahul003 b8b1d66
move header
rahul003 b66a3f2
include header corrected
rahul003 f32b391
include header corrected
rahul003 0743f60
indents, documentation and test update
rahul003 6fd68f7
lint
rahul003 d7aea02
pylint
rahul003 40f71f8
rename class, fix local kvstore test, remove confusing active method
rahul003 eabc503
fix importing of compute expected in test_kvstore
rahul003 806586f
fix bug in device kvstore
rahul003 6070450
remove active comment in pull
rahul003 2289129
docstring
rahul003 f41e102
use dmlc params, enums,
rahul003 5acbc9a
doc updates
rahul003 3c1bacb
lint
rahul003 18d6a90
update from master
rahul003 dfe7a7d
typo
rahul003 4b6f34a
rename field to type
rahul003 30a197b
fix distributed kvstore stopping issue.
rahul003 3073bf7
Trigger CI
rahul003 d5e4b2e
trigger CI
rahul003 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -63,6 +63,16 @@ def _ctype_key_value(keys, vals): | |
else c_array(ctypes.c_int, [keys] * len(vals)) | ||
return (c_keys, c_array(NDArrayHandle, [value.handle for value in vals]), use_str_keys) | ||
|
||
def _ctype_dict(param_dict): | ||
""" | ||
Returns ctype arrays for keys and values(converted to strings) in a dictionary | ||
""" | ||
assert(isinstance(param_dict, dict)), \ | ||
"unexpected type for param_dict: " + str(type(param_dict)) | ||
c_keys = c_array(ctypes.c_char_p, [c_str(k) for k in param_dict.keys()]) | ||
c_vals = c_array(ctypes.c_char_p, [c_str(str(v)) for v in param_dict.values()]) | ||
return (c_keys, c_vals) | ||
|
||
def _updater_wrapper(updater): | ||
"""A wrapper for the user-defined handle.""" | ||
def updater_handle(key, lhs_handle, rhs_handle, _): | ||
|
@@ -349,6 +359,58 @@ def row_sparse_pull(self, key, out=None, priority=0, row_ids=None): | |
check_call(_LIB.MXKVStorePullRowSparse( | ||
self.handle, mx_uint(len(ckeys)), ckeys, cvals, crow_ids, ctypes.c_int(priority))) | ||
|
||
def set_gradient_compression(self, compression_params): | ||
""" Specifies type of low-bit quantization for gradient compression \ | ||
and additional arguments depending on the type of compression being used. | ||
|
||
2bit Gradient Compression takes a positive float `threshold`. | ||
The technique works by thresholding values such that positive values in the | ||
gradient above threshold will be set to threshold. Negative values whose absolute | ||
values are higher than threshold, will be set to the negative of threshold. | ||
Values whose absolute values are less than threshold will be set to 0. | ||
By doing so, each value in the gradient is in one of three states. 2bits are | ||
used to represent these states, and every 16 float values in the original | ||
gradient can be represented using one float. This compressed representation | ||
can reduce communication costs. The difference between these thresholded values and | ||
original values is stored at the sender's end as residual and added to the | ||
gradient in the next iteration. | ||
|
||
When kvstore is 'local', gradient compression is used to reduce communication | ||
between multiple devices (gpus). Gradient is quantized on each GPU which | ||
computed the gradients, then sent to the GPU which merges the gradients. This | ||
receiving GPU dequantizes the gradients and merges them. Note that this | ||
increases memory usage on each GPU because of the residual array stored. | ||
|
||
When kvstore is 'dist', gradient compression is used to reduce communication | ||
from worker to sender. Gradient is quantized on each worker which | ||
computed the gradients, then sent to the server which dequantizes | ||
this data and merges the gradients from each worker. Note that this | ||
increases CPU memory usage on each worker because of the residual array stored. | ||
Only worker to server communication is compressed in this setting. | ||
If each machine has multiple GPUs, currently this GPU to GPU or GPU to CPU communication | ||
is not compressed. Server to worker communication (in the case of pull) | ||
is also not compressed. | ||
|
||
To use 2bit compression, we need to specify `type` as `2bit`. | ||
Only specifying `type` would use default value for the threshold. | ||
To completely specify the arguments for 2bit compression, we would need to pass | ||
a dictionary which includes `threshold` like: | ||
{'type': '2bit', 'threshold': 0.5} | ||
|
||
Parameters | ||
---------- | ||
compression_params : dict | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this doc render correctly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
A dictionary specifying the type and parameters for gradient compression. | ||
The key `type` in this dictionary is a | ||
required string argument and specifies the type of gradient compression. | ||
Currently `type` can be only `2bit` | ||
Other keys in this dictionary are optional and specific to the type | ||
of gradient compression. | ||
""" | ||
ckeys, cvals = _ctype_dict(compression_params) | ||
check_call(_LIB.MXKVStoreSetGradientCompression(self.handle, | ||
mx_uint(len(compression_params)), | ||
ckeys, cvals)) | ||
|
||
def set_optimizer(self, optimizer): | ||
""" Registers an optimizer with the kvstore. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no support for un-setting gradient compression
? What happens if an user tries to unset it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if user uses
kvstore.set_gradient_compression({'type':'none'}
after setting it to2bit
, it throws an error becausenone
can't be a type.If users sets
2bit
again with different threshold, then new threshold will be used from then on, but there might be a period in transition when gradients quantized with old threshold will be dequantized with new threshold, because of delay in sychronization.