From 7f954fa82a63ca4b27b27ecd38ff48f059af0fcb Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Wed, 24 Jun 2020 18:16:22 +0800
Subject: [PATCH 01/15] update doc for model compression

---
 docs/en_US/Compressor/Framework.md  | 140 +++++++++++-
 docs/en_US/Compressor/Overview.md   | 322 +++-------------------------
 docs/en_US/Compressor/Pruner.md     |   3 +-
 docs/en_US/Compressor/Quantizer.md  |  13 +-
 docs/en_US/Compressor/QuickStart.md | 131 ++++++++++-
 docs/en_US/model_compression.rst    |   8 +-
 6 files changed, 298 insertions(+), 319 deletions(-)

diff --git a/docs/en_US/Compressor/Framework.md b/docs/en_US/Compressor/Framework.md
index 87c23329d3..00c1c1e6bf 100644
--- a/docs/en_US/Compressor/Framework.md
+++ b/docs/en_US/Compressor/Framework.md
@@ -1,6 +1,12 @@
-# Design Doc
+# Customize A New Compression Algorithm
 
-## Overview
+To simplify writing a new compression algorithm, we design programming interfaces which are simple but flexible enough. There are interfaces for pruning and quantization respectively. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
+
+## Customize a new pruning algorithm
+
+To better demonstrate how to customize a new pruning algorithm, it is necessary for users to first understand the framework for supporting various pruning algorithms in NNI.
+
+### Framework overview for pruning algorithms
 
 Following example shows how to use a pruner:
 
@@ -26,11 +32,11 @@ A pruner receives `model`, `config_list` and `optimizer` as arguments. It prunes
 
 From implementation perspective, a pruner consists of a `weight masker` instance and multiple `module wrapper` instances.
 
-### Weight masker
+#### Weight masker
 
 A `weight masker` is the implementation of pruning algorithms, it can prune a specified layer wrapped by `module wrapper` with specified sparsity.
 
-### Module wrapper
+#### Module wrapper
 
 A `module wrapper` is a module containing:
 
@@ -43,7 +49,7 @@ the reasons to use `module wrapper`:
 1. some buffers are needed by `calc_mask` to calculate masks and these buffers should be registered in `module wrapper` so that the original modules are not contaminated.
 2. a new `forward` method is needed to apply masks to weight before calling the real `forward` method.
 
-### Pruner
+#### Pruner
 
 A `pruner` is responsible for:
 
@@ -52,7 +58,7 @@ A `pruner` is responsible for:
 3. Use `weight masker` to calculate masks of layers while pruning.
 4. Export pruned model weights and masks.
 
-## Implement a new pruning algorithm
+### Implement a new pruning algorithm
 
 Implementing a new pruning algorithm requires implementing a `weight masker` class which shoud be a subclass of `WeightMasker`, and a `pruner` class, which should a subclass `Pruner`.
 
@@ -142,3 +148,125 @@ self.pruner.remove_activation_collector(collector_id)
 
 On multi-GPU training, buffers and parameters are copied to multiple GPU every time the `forward` method runs on multiple GPU. If buffers and parameters are updated in the `forward` method, an `in-place` update is needed to ensure the update is effective.
 Since `calc_mask` is called in the `optimizer.step` method, which happens after the `forward` method and happens only on one GPU, it supports multi-GPU naturally.
+
+
+## Customize a new quantization algorithm
+
+To write a new quantization algorithm, you can write a class that inherits `nni.compression.torch.Quantizer`. Then, override the member functions with the logic of your algorithm. The member function to override is `quantize_weight`. `quantize_weight` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
+
+```python
+from nni.compression.torch import Quantizer
+
+class YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        """
+        Suggest you to use the NNI defined spec for config
+        """
+        super().__init__(model, config_list)
+
+    def quantize_weight(self, weight, config, **kwargs):
+        """
+        quantize should overload this method to quantize weight tensors.
+        This method is effectively hooked to :meth:`forward` of the model.
+
+        Parameters
+        ----------
+        weight : Tensor
+            weight that needs to be quantized
+        config : dict
+            the configuration for weight quantization
+        """
+
+        # Put your code to generate `new_weight` here
+
+        return new_weight
+    
+    def quantize_output(self, output, config, **kwargs):
+        """
+        quantize should overload this method to quantize output.
+        This method is effectively hooked to `:meth:`forward` of the model.
+
+        Parameters
+        ----------
+        output : Tensor
+            output that needs to be quantized
+        config : dict
+            the configuration for output quantization
+        """
+
+        # Put your code to generate `new_output` here
+
+        return new_output
+
+    def quantize_input(self, *inputs, config, **kwargs):
+        """
+        quantize should overload this method to quantize input.
+        This method is effectively hooked to :meth:`forward` of the model.
+
+        Parameters
+        ----------
+        inputs : Tensor
+            inputs that needs to be quantized
+        config : dict
+            the configuration for inputs quantization
+        """
+
+        # Put your code to generate `new_input` here
+
+        return new_input
+
+    def update_epoch(self, epoch_num):
+        pass
+
+    def step(self):
+        """
+        Can do some processing based on the model or weights binded
+        in the func bind_model
+        """
+        pass
+```
+
+### Customize backward function
+
+Sometimes it's necessary for a quantization operation to have a customized backward function, such as [Straight-Through Estimator](https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste), user can customize a backward function as follow:
+
+```python
+from nni.compression.torch.compressor import Quantizer, QuantGrad, QuantType
+
+class ClipGrad(QuantGrad):
+    @staticmethod
+    def quant_backward(tensor, grad_output, quant_type):
+        """
+        This method should be overrided by subclass to provide customized backward function,
+        default implementation is Straight-Through Estimator
+        Parameters
+        ----------
+        tensor : Tensor
+            input of quantization operation
+        grad_output : Tensor
+            gradient of the output of quantization operation
+        quant_type : QuantType
+            the type of quantization, it can be `QuantType.QUANT_INPUT`, `QuantType.QUANT_WEIGHT`, `QuantType.QUANT_OUTPUT`,
+            you can define different behavior for different types.
+        Returns
+        -------
+        tensor
+            gradient of the input of quantization operation
+        """
+
+        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
+        if quant_type == QuantType.QUANT_OUTPUT: 
+            grad_output[torch.abs(tensor) > 1] = 0
+        return grad_output
+
+
+class YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        super().__init__(model, config_list)
+        # set your customized backward function to overwrite default backward function
+        self.quant_grad = ClipGrad
+
+```
+
+If you do not customize `QuantGrad`, the default backward is Straight-Through Estimator. 
+_Coming Soon_ ...
\ No newline at end of file
diff --git a/docs/en_US/Compressor/Overview.md b/docs/en_US/Compressor/Overview.md
index 757f13a8ce..639e6d08e8 100644
--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
@@ -1,17 +1,27 @@
 # Model Compression with NNI
-As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem. 
 
-We are glad to introduce model compression toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute.
+```eval_rst
+.. contents::
+```
+
+As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem.
+
+NNI provides a model compression toolkit to help user compress and speed up their model with state-of-the-art compression algorithms and strategies. There are several core features supported by NNI model compression:
 
-NNI provides an easy-to-use toolkit to help user design and use compression algorithms. It currently supports PyTorch with unified interface. For users to compress their models, they only need to add several lines in their code. There are some popular model compression algorithms built-in in NNI. Users could further use NNI's auto tuning power to find the best compressed model, which is detailed in [Auto Model Compression](./AutoCompression.md). On the other hand, users could easily customize their new compression algorithms using NNI's interface, refer to the tutorial [here](#customize-new-compression-algorithms). Details about how model compression framework works can be found in [here](./Framework.md).
+* Support many popular pruning and quantization algorithms.
+* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
+* Speed up a compressed model to make it have lower inference latency and also make it become smaller.
+* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
+* Concise interface for users to customize their own compression algorithms.
 
-For a survey of model compression, you can refer to this paper: [Recent Advances in Efficient Computation of Deep Convolutional Neural Networks](https://arxiv.org/pdf/1802.00939.pdf).
+*Note that the interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.*
 
-## Supported algorithms
 
-We have provided several compression algorithms, including several pruning and quantization algorithms:
+## Supported Algorithms
 
-**Pruning**
+The algorithms include pruning algorithms and quantization algorithms.
+
+### Pruning Algorithms
 
 Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and address the over-ﬁtting issue.
 
@@ -29,7 +39,7 @@ Pruning algorithms compress the original network by removing redundant weights o
 | [TaylorFO Pruner](./Pruner.md#taylorfoweightfilterpruner) | Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) [Reference Paper](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf) |
 
 
-**Quantization**
+### Quantization Algorithms
 
 Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
 
@@ -40,301 +50,21 @@ Quantization algorithms compress the original network by reducing the number of
 | [DoReFa Quantizer](./Quantizer.md#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
 | [BNN Quantizer](./Quantizer.md#BNN-Quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|
 
-## Usage of built-in compression algorithms
-
-We use a simple example to show how to modify your trial code in order to apply the compression algorithms. Let's say you want to prune all weight to 80% sparsity with Level Pruner, you can add the following three lines into your code before training your model ([here](https://github.com/microsoft/nni/tree/master/examples/model_compress) is complete code).
-
-PyTorch code
-
-```python
-from nni.compression.torch import LevelPruner
-config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
-pruner = LevelPruner(model, config_list)
-pruner.compress()
-```
-
-Tensorflow code
-
-```python
-from nni.compression.tensorflow import LevelPruner
-config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
-pruner = LevelPruner(tf.get_default_graph(), config_list)
-pruner.compress()
-```
+## Automatic Model Compression
 
+TBD.
 
-You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and Tensorflow, under `nni.compression.torch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to [KDExample](../TrialExample/KDExample.md)
+## Model Speedup
 
-The function call `pruner.compress()` modifies user defined model (in Tensorflow the model can be obtained with `tf.get_default_graph()`, while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
+The final goal of model compression is to reduce inference latency and model size. However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model, for example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms. Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The detailed tutorial of Model Speedup can be found [here](./ModelSpeedup.md).
 
-When instantiate a compression algorithm, there is `config_list` passed in. We describe how to write this config below.
+## Compression Utilities
 
-### User configuration for a compression algorithm
-When compressing a model, users may want to specify the ratio for sparsity, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python `list` object, where each element is a `dict` object. 
-
-The `dict`s in the `list` are applied one by one, that is, the configurations in latter `dict` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
-
-#### Common keys
-In each `dict`, there are some keys commonly supported by NNI compression:
-
-* __op_types__: This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
-* __op_names__: This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
-* __exclude__: Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
-
-#### Keys for quantization algorithms
-**If you use quantization algorithms, you need to specify more keys. If you use pruning algorithms, you can safely skip these keys**
-
-* __quant_types__ : list of string. 
-
-Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
-to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
-
-* __quant_bits__ : int or dict of {str : int}
-
-bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
-```
-{
-    quant_bits: {
-        'weight': 8,
-        'output': 4,
-        },
-}
-```
-when the value is int type, all quantization types share same bits length. eg. 
-```
-{
-    quant_bits: 8, # weight or output quantization are all 8 bits
-}
-```
-#### Other keys specified for every compression algorithm
-There are also other keys in the `dict`, but they are specific for every compression algorithm. For example, [Level Pruner](./Pruner.md#level-pruner) requires `sparsity` key to specify how much a model should be pruned.
-
-
-#### example
-A simple example of configuration is shown below:
-
-```python
-[
-    {
-        'sparsity': 0.8,
-        'op_types': ['default']
-    },
-    {
-        'sparsity': 0.6,
-        'op_names': ['op_name1', 'op_name2']
-    },
-    {
-        'exclude': True,
-        'op_names': ['op_name3']
-    }
-]
-```
-
-It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for `op_name1` and `op_name2` use sparsity 0.6, and please do not compress `op_name3`.
-
-### Other APIs
-
-Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](./Pruner.md#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke. One is `update_epoch`, you can use it as follows:
-
-Tensorflow code
-
-```python
-pruner.update_epoch(epoch, sess)
-```
-
-PyTorch code
-
-```python
-pruner.update_epoch(epoch)
-```
-
-The other is `step`, it can be called with `pruner.step()` after each minibatch. Note that not all algorithms need these two APIs, for those that do not need them, calling them is allowed but has no effect.
-
-You can easily export the compressed model using the following API if you are pruning your model, ```state_dict``` of the sparse model weights will be stored in ```model.pth```, which can be loaded by ```torch.load('model.pth')```
-
-```
-pruner.export_model(model_path='model.pth')
-```
+Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to [here](./CompressionUtils.md) for a complete list of compression utilities.
 
-```mask_dict ``` and pruned model in ```onnx``` format(```input_shape``` need to be specified) can also be exported like this:
-
-```python
-pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
-```
-
-## Customize new compression algorithms
-
-To simplify writing a new compression algorithm, we design programming interfaces which are simple but flexible enough. There are interfaces for pruner and quantizer respectively.
-
-### Pruning algorithm
-
-If you want to write a new pruning algorithm, you can write a class that inherits `nni.compression.tensorflow.Pruner` or `nni.compression.torch.Pruner` depending on which framework you use. Then, override the member functions with the logic of your algorithm.
-
-```python
-# This is writing a pruner in tensorflow.
-# For writing a pruner in PyTorch, you can simply replace
-# nni.compression.tensorflow.Pruner with
-# nni.compression.torch.Pruner
-class YourPruner(nni.compression.tensorflow.Pruner):
-    def __init__(self, model, config_list):
-        """
-        Suggest you to use the NNI defined spec for config
-        """
-        super().__init__(model, config_list)
-
-    def calc_mask(self, layer, config):
-        """
-        Pruners should overload this method to provide mask for weight tensors.
-        The mask must have the same shape and type comparing to the weight.
-        It will be applied with ``mul()`` operation on the weight.
-        This method is effectively hooked to ``forward()`` method of the model.
-
-        Parameters
-        ----------
-        layer: LayerInfo
-            calculate mask for ``layer``'s weight
-        config: dict
-            the configuration for generating the mask
-        """
-        return your_mask
-
-    # note for pytorch version, there is no sess in input arguments
-    def update_epoch(self, epoch_num, sess):
-        pass
-
-    # note for pytorch version, there is no sess in input arguments
-    def step(self, sess):
-        """
-        Can do some processing based on the model or weights binded
-        in the func bind_model
-        """
-        pass
-```
-
-For the simplest algorithm, you only need to override ``calc_mask``. It receives the to-be-compressed layers one by one along with their compression configuration. You generate the mask for this weight in this function and return. Then NNI applies the mask for you.
-
-Some algorithms generate mask based on training progress, i.e., epoch number. We provide `update_epoch` for the pruner to be aware of the training progress. It should be called at the beginning of each epoch.
-
-Some algorithms may want global information for generating masks, for example, all weights of the model (for statistic information). Your can use `self.bound_model` in the Pruner class for accessing weights. If you also need optimizer's information (for example in Pytorch), you could override `__init__` to receive more arguments such as model's optimizer. Then `step` can process or update the information according to the algorithm. You can refer to [source code of built-in algorithms](https://github.com/microsoft/nni/tree/master/src/sdk/pynni/nni/compressors) for example implementations.
-
-### Quantization algorithm
-
-The interface for customizing quantization algorithm is similar to that of pruning algorithms. The only difference is that `calc_mask` is replaced with `quantize_weight`. `quantize_weight` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
-
-```python
-from nni.compression.torch.compressor import Quantizer
-
-class YourQuantizer(Quantizer):
-    def __init__(self, model, config_list):
-        """
-        Suggest you to use the NNI defined spec for config
-        """
-        super().__init__(model, config_list)
-
-    def quantize_weight(self, weight, config, **kwargs):
-        """
-        quantize should overload this method to quantize weight tensors.
-        This method is effectively hooked to :meth:`forward` of the model.
-
-        Parameters
-        ----------
-        weight : Tensor
-            weight that needs to be quantized
-        config : dict
-            the configuration for weight quantization
-        """
-
-        # Put your code to generate `new_weight` here
-
-        return new_weight
-    
-    def quantize_output(self, output, config, **kwargs):
-        """
-        quantize should overload this method to quantize output.
-        This method is effectively hooked to `:meth:`forward` of the model.
-
-        Parameters
-        ----------
-        output : Tensor
-            output that needs to be quantized
-        config : dict
-            the configuration for output quantization
-        """
-
-        # Put your code to generate `new_output` here
-
-        return new_output
-
-    def quantize_input(self, *inputs, config, **kwargs):
-        """
-        quantize should overload this method to quantize input.
-        This method is effectively hooked to :meth:`forward` of the model.
-
-        Parameters
-        ----------
-        inputs : Tensor
-            inputs that needs to be quantized
-        config : dict
-            the configuration for inputs quantization
-        """
-
-        # Put your code to generate `new_input` here
-
-        return new_input
-
-    def update_epoch(self, epoch_num):
-        pass
-
-    def step(self):
-        """
-        Can do some processing based on the model or weights binded
-        in the func bind_model
-        """
-        pass
-```
-#### Customize backward function
-Sometimes it's necessary for a quantization operation to have a customized backward function, such as [Straight-Through Estimator](https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste), user can customize a backward function as follow:
-
-```python
-from nni.compression.torch.compressor import Quantizer, QuantGrad, QuantType
-
-class ClipGrad(QuantGrad):
-    @staticmethod
-    def quant_backward(tensor, grad_output, quant_type):
-        """
-        This method should be overrided by subclass to provide customized backward function,
-        default implementation is Straight-Through Estimator
-        Parameters
-        ----------
-        tensor : Tensor
-            input of quantization operation
-        grad_output : Tensor
-            gradient of the output of quantization operation
-        quant_type : QuantType
-            the type of quantization, it can be `QuantType.QUANT_INPUT`, `QuantType.QUANT_WEIGHT`, `QuantType.QUANT_OUTPUT`,
-            you can define different behavior for different types.
-        Returns
-        -------
-        tensor
-            gradient of the input of quantization operation
-        """
-
-        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
-        if quant_type == QuantType.QUANT_OUTPUT: 
-            grad_output[torch.abs(tensor) > 1] = 0
-        return grad_output
-
-
-class YourQuantizer(Quantizer):
-    def __init__(self, model, config_list):
-        super().__init__(model, config_list)
-        # set your customized backward function to overwrite default backward function
-        self.quant_grad = ClipGrad
-
-```
+## Customize Your Own Compression Algorithms
 
-If you do not customize `QuantGrad`, the default backward is Straight-Through Estimator. 
-_Coming Soon_ ...
+NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. The detailed tutorial for customizing a new compression algorithm (pruning algorithm or quantization algorithm) can be found [here](./Framework.md).
 
 ## Reference and Feedback
 * To [report a bug](https://github.com/microsoft/nni/issues/new?template=bug-report.md) for this feature in GitHub;
diff --git a/docs/en_US/Compressor/Pruner.md b/docs/en_US/Compressor/Pruner.md
index 102d6471b0..6496590b1c 100644
--- a/docs/en_US/Compressor/Pruner.md
+++ b/docs/en_US/Compressor/Pruner.md
@@ -1,5 +1,4 @@
-Pruner on NNI Compressor
-===
+# Supported Pruning Algorithms on NNI
 
 Index of supported pruning algorithms
 * [Level Pruner](#level-pruner)
diff --git a/docs/en_US/Compressor/Quantizer.md b/docs/en_US/Compressor/Quantizer.md
index 574926c7ad..ef447d564b 100644
--- a/docs/en_US/Compressor/Quantizer.md
+++ b/docs/en_US/Compressor/Quantizer.md
@@ -1,5 +1,11 @@
-Quantizer on NNI Compressor
-===
+# Supported Quantization Algorithms on NNI
+
+Index of supported quantization algorithms
+* [Naive Quantizer](#naive-quantizer)
+* [QAT Quantizer](#qat-quantizer)
+* [DoReFa Quantizer](#dorefa-quantizer)
+* [BNN Quantizer](#bnn-quantizer)
+
 ## Naive Quantizer
 
 We provide Naive Quantizer to quantizer weight to default 8 bits, you can use it to test quantize algorithm without any configure.
@@ -10,8 +16,6 @@ pytorch
 model = nni.compression.torch.NaiveQuantizer(model).compress()
 ```
 
-***
-
 ## QAT Quantizer
 In [Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf), authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
 
@@ -58,7 +62,6 @@ state where activation quantization ranges do not exclude a signiﬁcant fractio
 
 ### note
 batch normalization folding is currently not supported.
-***
 
 ## DoReFa Quantizer
 In [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](https://arxiv.org/abs/1606.06160), authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
diff --git a/docs/en_US/Compressor/QuickStart.md b/docs/en_US/Compressor/QuickStart.md
index 656d78c5d9..eb48b1b05d 100644
--- a/docs/en_US/Compressor/QuickStart.md
+++ b/docs/en_US/Compressor/QuickStart.md
@@ -1,8 +1,12 @@
-# Quick Start to Compress a Model
+# Tutorial for Model Compression
+
+In this tutorial, we use the [first section](#quick-start-to-compress-a-model) to quickly go through the usage of model compression on NNI. Then use the [second section](#detailed-usage-guide) to explain more details of the usage.
+
+## Quick Start to Compress a Model
 
 NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use slim pruner as an example to show the usage.
 
-## Write configuration
+### Write configuration
 
 Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the `BatchNorm2d`s to sparsity 0.7 while keeping other layers unpruned.
 
@@ -15,7 +19,7 @@ configure_list = [{
 
 The specification of configuration can be found [here](Overview.md#user-configuration-for-a-compression-algorithm). Note that different pruners may have their own defined fields in configuration, for exmaple `start_epoch` in AGP pruner. Please refer to each pruner's [usage](Overview.md#supported-algorithms) for details, and adjust the configuration accordingly.
 
-## Choose a compression algorithm
+### Choose a compression algorithm
 
 Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke `compress()` to compress your model.
 
@@ -26,7 +30,7 @@ model = pruner.compress()
 
 Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.
 
-## Export compression result
+### Export compression result
 
 After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
 
@@ -36,7 +40,7 @@ pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19
 
 The complete code of model compression examples can be found [here](https://github.com/microsoft/nni/blob/master/examples/model_compress/model_prune_torch.py).
 
-## Speed up the model
+### Speed up the model
 
 Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking `apply_compression_results` on your model, your model becomes a smaller one with shorter inference latency.
 
@@ -45,4 +49,119 @@ from nni.compression.torch import apply_compression_results
 apply_compression_results(model, 'mask_vgg19_cifar10.pth')
 ```
 
-Please refer to [here](ModelSpeedup.md) for detailed description.
\ No newline at end of file
+Please refer to [here](ModelSpeedup.md) for detailed description.
+
+## Detailed Usage Guide
+
+The example code for users to apply model compression on a user model can be found below:
+
+PyTorch code
+
+```python
+from nni.compression.torch import LevelPruner
+config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+pruner = LevelPruner(model, config_list)
+pruner.compress()
+```
+
+Tensorflow code
+
+```python
+from nni.compression.tensorflow import LevelPruner
+config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+pruner = LevelPruner(tf.get_default_graph(), config_list)
+pruner.compress()
+```
+
+
+You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under `nni.compression.torch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to [KDExample](../TrialExample/KDExample.md)
+
+A compression algorithm is first instantiated with a `config_list` passed in. The specification of this `config_list` will be described later.
+
+The function call `pruner.compress()` modifies user defined model (in Tensorflow the model can be obtained with `tf.get_default_graph()`, while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
+
+*Note that, `pruner.compress` simply adds masks on model weights, it does not include fine tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after `pruner.compress`.*
+
+### Specification of `config_list`
+
+Users can specify the configuration (i.e., `config_list`) for a compression algorithm. For example,when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python `list` object, where each element is a `dict` object. 
+
+The `dict`s in the `list` are applied one by one, that is, the configurations in latter `dict` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
+
+There are different keys in a `dict`. Some of them are common keys supported by all the compression algorithms:
+
+* __op_types__: This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
+* __op_names__: This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
+* __exclude__: Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
+
+Some other keys are often specific to a certain algorithms, users can refer to [pruning algorithms](./Pruner.md) and [quantization algorithms](./Quantizer.md) for the keys allowed by each algorithm.
+
+A simple example of configuration is shown below:
+
+```python
+[
+    {
+        'sparsity': 0.8,
+        'op_types': ['default']
+    },
+    {
+        'sparsity': 0.6,
+        'op_names': ['op_name1', 'op_name2']
+    },
+    {
+        'exclude': True,
+        'op_names': ['op_name3']
+    }
+]
+```
+
+It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for `op_name1` and `op_name2` use sparsity 0.6, and do not compress `op_name3`.
+
+#### Quantization specific keys
+
+**If you use quantization algorithms, you need to specify more keys. If you use pruning algorithms, you can safely skip these keys**
+
+* __quant_types__ : list of string. 
+
+Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
+to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
+
+* __quant_bits__ : int or dict of {str : int}
+
+bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
+```
+{
+    quant_bits: {
+        'weight': 8,
+        'output': 4,
+        },
+}
+```
+when the value is int type, all quantization types share same bits length. eg. 
+```
+{
+    quant_bits: 8, # weight or output quantization are all 8 bits
+}
+```
+
+### APIs for Updating Fine Tuning Status
+
+Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](./Pruner.md#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: `pruner.update_epoch(epoch)` and `pruner.step()`.
+
+`update_epoch` should be invoked in every epoch, while `step` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
+
+### Export Compressed Model
+
+You can easily export the compressed model using the following API if you are pruning your model, ```state_dict``` of the sparse model weights will be stored in ```model.pth```, which can be loaded by ```torch.load('model.pth')```. In this exported ```model.pth```, the masked weights are zero.
+
+```
+pruner.export_model(model_path='model.pth')
+```
+
+```mask_dict ``` and pruned model in ```onnx``` format(```input_shape``` need to be specified) can also be exported like this:
+
+```python
+pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
+```
+
+If you want to really speed up the compressed model, please refer to [NNI model speedup](./ModelSpeedup.md) for details.
\ No newline at end of file
diff --git a/docs/en_US/model_compression.rst b/docs/en_US/model_compression.rst
index f6821e3045..a1d4f01b8c 100644
--- a/docs/en_US/model_compression.rst
+++ b/docs/en_US/model_compression.rst
@@ -17,9 +17,9 @@ For details, please refer to the following tutorials:
 
     Overview <Compressor/Overview>
     Quick Start <Compressor/QuickStart>
-    Pruners <pruners>
-    Quantizers <quantizers>
-    Model Speedup <Compressor/ModelSpeedup>
+    Pruners <Compressor/Pruner>
+    Quantizers <Compressor/Quantizer>
     Automatic Model Compression <Compressor/AutoCompression>
-    Implementation <Compressor/Framework>
+    Model Speedup <Compressor/ModelSpeedup>
     Compression Utilities <Compressor/CompressionUtils>
+    Customize Compression Algorithms <Compressor/Framework>

From dc48f06b08e9d853211b86eaf225887f1a7f72c0 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Wed, 24 Jun 2020 18:18:07 +0800
Subject: [PATCH 02/15] remove two files

---
 docs/en_US/pruners.rst    | 16 ----------------
 docs/en_US/quantizers.rst | 11 -----------
 2 files changed, 27 deletions(-)
 delete mode 100644 docs/en_US/pruners.rst
 delete mode 100644 docs/en_US/quantizers.rst

diff --git a/docs/en_US/pruners.rst b/docs/en_US/pruners.rst
deleted file mode 100644
index bf3771df16..0000000000
--- a/docs/en_US/pruners.rst
+++ /dev/null
@@ -1,16 +0,0 @@
-############################
-Supported Pruning Algorithms
-############################
-
-..  toctree::
-    :maxdepth: 1
-
-    Level Pruner <Compressor/Pruner>
-    AGP Pruner <Compressor/Pruner>
-    Lottery Ticket Pruner <Compressor/LotteryTicketHypothesis>
-    FPGM Pruner <Compressor/Pruner>
-    L1Filter Pruner <Compressor/l1filterpruner>
-    L2Filter Pruner <Compressor/Pruner>
-    ActivationAPoZRankFilterPruner <Compressor/Pruner>
-    ActivationMeanRankFilterPruner <Compressor/Pruner>
-    Slim Pruner <Compressor/SlimPruner>
diff --git a/docs/en_US/quantizers.rst b/docs/en_US/quantizers.rst
deleted file mode 100644
index 8b082c2789..0000000000
--- a/docs/en_US/quantizers.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-#################################
-Supported Quantization Algorithms
-#################################
-
-..  toctree::
-    :maxdepth: 1
-
-    Naive Quantizer <Compressor/Quantizer>
-    QAT Quantizer <Compressor/Quantizer>
-    DoReFa Quantizer <Compressor/Quantizer>
-    BNN Quantizer <Compressor/Quantizer>
\ No newline at end of file

From 7a8b7cf825fae583c5b97d8a84943f00d586fe9e Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Wed, 24 Jun 2020 18:29:38 +0800
Subject: [PATCH 03/15] remove doc files

---
 .../Compressor/LotteryTicketHypothesis.md     | 23 -----------
 docs/en_US/Compressor/Pruner.md               | 30 ++++++++++++++
 docs/en_US/Compressor/SlimPruner.md           | 39 -------------------
 docs/en_US/Compressor/l1filterpruner.md       | 38 ------------------
 4 files changed, 30 insertions(+), 100 deletions(-)
 delete mode 100644 docs/en_US/Compressor/LotteryTicketHypothesis.md
 delete mode 100644 docs/en_US/Compressor/SlimPruner.md
 delete mode 100644 docs/en_US/Compressor/l1filterpruner.md

diff --git a/docs/en_US/Compressor/LotteryTicketHypothesis.md b/docs/en_US/Compressor/LotteryTicketHypothesis.md
deleted file mode 100644
index 5ac64155fa..0000000000
--- a/docs/en_US/Compressor/LotteryTicketHypothesis.md
+++ /dev/null
@@ -1,23 +0,0 @@
-Lottery Ticket Hypothesis on NNI
-===
-
-## Introduction
-
-The paper [The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks](https://arxiv.org/abs/1803.03635) is mainly a measurement and analysis paper, it delivers very interesting insights. To support it on NNI, we mainly implement the training approach for finding *winning tickets*.
-
-In this paper, the authors use the following process to prune a model, called *iterative prunning*:
->1. Randomly initialize a neural network f(x;theta_0) (where theta_0 follows D_{theta}).
->2. Train the network for j iterations, arriving at parameters theta_j.
->3. Prune p% of the parameters in theta_j, creating a mask m.
->4. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
->5. Repeat step 2, 3, and 4.
-
-If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning, each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.
-
-## Reproduce Results
-
-We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred [here](https://github.com/microsoft/nni/tree/master/examples/model_compress/lottery_torch_mnist_fc.py). In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
-
-![](../../img/lottery_ticket_mnist_fc.png)
-
-The above figure shows the result of the fully connected network. `round0-sparsity-0.0` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
diff --git a/docs/en_US/Compressor/Pruner.md b/docs/en_US/Compressor/Pruner.md
index 6496590b1c..a60486aaa8 100644
--- a/docs/en_US/Compressor/Pruner.md
+++ b/docs/en_US/Compressor/Pruner.md
@@ -154,6 +154,14 @@ The above configuration means that there are 5 times of iterative pruning. As th
 * **prune_iterations:** The number of rounds for the iterative pruning, i.e., the number of iterative pruning.
 * **sparsity:** The final sparsity when the compression is done.
 
+### Reproduced Experiment
+
+We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred [here](https://github.com/microsoft/nni/tree/master/examples/model_compress/lottery_torch_mnist_fc.py). In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
+
+![](../../img/lottery_ticket_mnist_fc.png)
+
+The above figure shows the result of the fully connected network. `round0-sparsity-0.0` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
+
 ***
 
 ## Slim Pruner
@@ -180,6 +188,17 @@ pruner.compress()
 - **sparsity:** This is to specify the sparsity operations to be compressed to
 - **op_types:** Only BatchNorm2d is supported in Slim Pruner
 
+### Reproduced Experiment
+
+We implemented one of the experiments in ['Learning Efficient Convolutional Networks through Network Slimming'](https://arxiv.org/pdf/1708.06519.pdf), we pruned $70\%$ channels in the **VGGNet** for CIFAR-10 in the paper, in which $88.5\%$ parameters are pruned. Our experiments results are as follows:
+
+| Model         | Error(paper/ours) | Parameters | Pruned    |
+| ------------- | ----------------- | ---------- | --------- |
+| VGGNet        | 6.34/6.40     | 20.04M   |           |
+| Pruned-VGGNet | 6.20/6.26     | 2.03M    | 88.5% |
+
+The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
+
 
 ## WeightRankFilterPruner
 WeightRankFilterPruner is a series of pruners which prune the filters with the smallest importance criterion calculated from the weights in convolution layers to achieve a preset level of network sparsity
@@ -269,6 +288,17 @@ pruner.compress()
 - **sparsity:** This is to specify the sparsity operations to be compressed to
 - **op_types:** Only Conv1d and Conv2d is supported in L1Filter Pruner
 
+#### Reproduced Experiment
+
+We implemented one of the experiments in ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710) with **L1FilterPruner**, we pruned **VGG-16** for CIFAR-10 to **VGG-16-pruned-A** in the paper, in which $64\%$ parameters are pruned. Our experiments results are as follows:
+
+| Model           | Error(paper/ours) | Parameters      | Pruned   |
+| --------------- | ----------------- | --------------- | -------- |
+| VGG-16          | 6.75/6.49     | 1.5x10^7 |          |
+| VGG-16-pruned-A | 6.60/6.47     | 5.4x10^6 | 64.0% |
+
+The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
+
 ***
 
 ### L2Filter Pruner
diff --git a/docs/en_US/Compressor/SlimPruner.md b/docs/en_US/Compressor/SlimPruner.md
deleted file mode 100644
index 18fe589155..0000000000
--- a/docs/en_US/Compressor/SlimPruner.md
+++ /dev/null
@@ -1,39 +0,0 @@
-SlimPruner on NNI Compressor
-===
-
-## 1. Slim Pruner
-
-SlimPruner is a structured pruning algorithm for pruning channels in the convolutional layers by pruning corresponding scaling factors in the later BN layers.
-
-In ['Learning Efficient Convolutional Networks through Network Slimming'](https://arxiv.org/pdf/1708.06519.pdf), authors Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan and Changshui Zhang.
-
-![](../../img/slim_pruner.png)
-
-> Slim Pruner **prunes channels in the convolution layers by masking corresponding scaling factors in the later BN layers**, L1 regularization on the scaling factors should be applied in batch normalization (BN) layers while training, scaling factors of BN layers are **globally ranked** while pruning, so the sparse model can be automatically found given sparsity.
-
-## 2. Usage
-
-PyTorch code
-
-```
-from nni.compression.torch import SlimPruner
-config_list = [{ 'sparsity': 0.8, 'op_types': ['BatchNorm2d'] }]
-pruner = SlimPruner(model, config_list)
-pruner.compress()
-```
-
-#### User configuration for Filter Pruner
-
-- **sparsity:** This is to specify the sparsity operations to be compressed to
-- **op_types:** Only BatchNorm2d is supported in Slim Pruner
-
-## 3. Experiment
-
-We implemented one of the experiments in ['Learning Efficient Convolutional Networks through Network Slimming'](https://arxiv.org/pdf/1708.06519.pdf), we pruned $70\%$ channels in the **VGGNet** for CIFAR-10 in the paper, in which $88.5\%$ parameters are pruned. Our experiments results are as follows:
-
-| Model         | Error(paper/ours) | Parameters | Pruned    |
-| ------------- | ----------------- | ---------- | --------- |
-| VGGNet        | 6.34/6.40     | 20.04M   |           |
-| Pruned-VGGNet | 6.20/6.26     | 2.03M    | 88.5% |
-
-The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
diff --git a/docs/en_US/Compressor/l1filterpruner.md b/docs/en_US/Compressor/l1filterpruner.md
deleted file mode 100644
index dc42d6478d..0000000000
--- a/docs/en_US/Compressor/l1filterpruner.md
+++ /dev/null
@@ -1,38 +0,0 @@
-L1FilterPruner on NNI
-===
-
-## Introduction
-
-L1FilterPruner is a general structured pruning algorithm for pruning filters in the convolutional layers.
-
-In ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710), authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
-
-![](../../img/l1filter_pruner.png)
-
-> L1Filter Pruner prunes filters in the **convolution layers**
->
-> The procedure of pruning m filters from the ith convolutional layer is as follows:
->
-> 1. For each filter ![](http://latex.codecogs.com/gif.latex?F_{i,j}), calculate the sum of its absolute kernel weights![](http://latex.codecogs.com/gif.latex?s_j=\sum_{l=1}^{n_i}\sum|K_l|)
-> 2. Sort the filters by ![](http://latex.codecogs.com/gif.latex?s_j).
-> 3. Prune ![](http://latex.codecogs.com/gif.latex?m) filters with the smallest sum values and their corresponding feature maps. The
->      kernels in the next convolutional layer corresponding to the pruned feature maps are also
->        removed.
-> 4. A new kernel matrix is created for both the ![](http://latex.codecogs.com/gif.latex?i)th and ![](http://latex.codecogs.com/gif.latex?i+1)th layers, and the remaining kernel
->      weights are copied to the new model.
-
-## Experiment
-
-We implemented one of the experiments in ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710) with **L1FilterPruner**, we pruned **VGG-16** for CIFAR-10 to **VGG-16-pruned-A** in the paper, in which $64\%$ parameters are pruned. Our experiments results are as follows:
-
-| Model           | Error(paper/ours) | Parameters      | Pruned   |
-| --------------- | ----------------- | --------------- | -------- |
-| VGG-16          | 6.75/6.49     | 1.5x10^7 |          |
-| VGG-16-pruned-A | 6.60/6.47     | 5.4x10^6 | 64.0% |
-
-The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
-
-
-
-
-

From 9c45713f931036274c119991e3708914df073d79 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Wed, 24 Jun 2020 18:31:24 +0800
Subject: [PATCH 04/15] update

---
 docs/en_US/Compressor/Pruner.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/en_US/Compressor/Pruner.md b/docs/en_US/Compressor/Pruner.md
index a60486aaa8..b107056e6b 100644
--- a/docs/en_US/Compressor/Pruner.md
+++ b/docs/en_US/Compressor/Pruner.md
@@ -144,7 +144,7 @@ for _ in pruner.get_prune_iterations():
         ...
 ```
 
-The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs `model` and `optimizer` (**Note that should add `lr_scheduler` if used**) to reset their states every time a new prune iteration starts. Please use `get_prune_iterations` to get the pruning iterations, and invoke `prune_iteration_start` at the beginning of each iteration. `epoch_num` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round. Simple reproducing results can be found [here](./LotteryTicketHypothesis.md).
+The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs `model` and `optimizer` (**Note that should add `lr_scheduler` if used**) to reset their states every time a new prune iteration starts. Please use `get_prune_iterations` to get the pruning iterations, and invoke `prune_iteration_start` at the beginning of each iteration. `epoch_num` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round.
 
 
 *Tensorflow version will be supported later.*
@@ -256,7 +256,7 @@ You can view example for more information
 
 ### L1Filter Pruner
 
-This is an one-shot pruner, In ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710), authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf. The reproduced experiment results can be found [here](l1filterpruner.md)
+This is an one-shot pruner, In ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710), authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
 
 ![](../../img/l1filter_pruner.png)
 

From 6bb494ed4f14d8d8126f0a7ca789fc084808380a Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Wed, 24 Jun 2020 18:38:16 +0800
Subject: [PATCH 05/15] update

---
 docs/en_US/Compressor/Pruner.md    | 4 ++++
 docs/en_US/Compressor/Quantizer.md | 5 +++++
 2 files changed, 9 insertions(+)

diff --git a/docs/en_US/Compressor/Pruner.md b/docs/en_US/Compressor/Pruner.md
index b107056e6b..5fe7d1c7a0 100644
--- a/docs/en_US/Compressor/Pruner.md
+++ b/docs/en_US/Compressor/Pruner.md
@@ -199,6 +199,7 @@ We implemented one of the experiments in ['Learning Efficient Convolutional Netw
 
 The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
 
+***
 
 ## WeightRankFilterPruner
 WeightRankFilterPruner is a series of pruners which prune the filters with the smallest importance criterion calculated from the weights in convolution layers to achieve a preset level of network sparsity
@@ -321,6 +322,8 @@ pruner.compress()
 - **sparsity:** This is to specify the sparsity operations to be compressed to
 - **op_types:** Only Conv1d and Conv2d is supported in L2Filter Pruner
 
+***
+
 ## ActivationRankFilterPruner
 ActivationRankFilterPruner is a series of pruners which prune the filters with the smallest importance criterion calculated from the output activations of convolution layers to achieve a preset level of network sparsity.
 
@@ -384,6 +387,7 @@ You can view example for more information
 - **sparsity:** How much percentage of convolutional filters are to be pruned.
 - **op_types:** Only Conv2d is supported in ActivationMeanRankFilterPruner.
 
+***
 
 ## GradientRankFilterPruner
 
diff --git a/docs/en_US/Compressor/Quantizer.md b/docs/en_US/Compressor/Quantizer.md
index ef447d564b..0cfa9fe70d 100644
--- a/docs/en_US/Compressor/Quantizer.md
+++ b/docs/en_US/Compressor/Quantizer.md
@@ -16,6 +16,8 @@ pytorch
 model = nni.compression.torch.NaiveQuantizer(model).compress()
 ```
 
+***
+
 ## QAT Quantizer
 In [Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf), authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
 
@@ -63,6 +65,8 @@ state where activation quantization ranges do not exclude a signiﬁcant fractio
 ### note
 batch normalization folding is currently not supported.
 
+***
+
 ## DoReFa Quantizer
 In [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](https://arxiv.org/abs/1606.06160), authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
 
@@ -88,6 +92,7 @@ common configuration needed by compression algorithms can be found at : [Common
 
 configuration needed by this algorithm :
 
+***
 
 ## BNN Quantizer
 In [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830), 

From ca4d593def08268cd1854a4ac6f9a4129e4b3443 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Wed, 24 Jun 2020 18:41:13 +0800
Subject: [PATCH 06/15] update

---
 docs/en_US/Compressor/Framework.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/en_US/Compressor/Framework.md b/docs/en_US/Compressor/Framework.md
index 00c1c1e6bf..3586691aa9 100644
--- a/docs/en_US/Compressor/Framework.md
+++ b/docs/en_US/Compressor/Framework.md
@@ -149,6 +149,7 @@ self.pruner.remove_activation_collector(collector_id)
 On multi-GPU training, buffers and parameters are copied to multiple GPU every time the `forward` method runs on multiple GPU. If buffers and parameters are updated in the `forward` method, an `in-place` update is needed to ensure the update is effective.
 Since `calc_mask` is called in the `optimizer.step` method, which happens after the `forward` method and happens only on one GPU, it supports multi-GPU naturally.
 
+***
 
 ## Customize a new quantization algorithm
 

From d4ed83e293ec4e7997b2d49d77911634ef6ba2c9 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Wed, 24 Jun 2020 18:54:01 +0800
Subject: [PATCH 07/15] update overview

---
 docs/en_US/Compressor/Overview.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/en_US/Compressor/Overview.md b/docs/en_US/Compressor/Overview.md
index 639e6d08e8..61c0b35ac4 100644
--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
@@ -52,7 +52,7 @@ Quantization algorithms compress the original network by reducing the number of
 
 ## Automatic Model Compression
 
-TBD.
+Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found [here](./AutoCompression.md).
 
 ## Model Speedup
 

From 794193fa1b8416b405b42b78ddef612f79149757 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Sat, 27 Jun 2020 12:11:17 +0800
Subject: [PATCH 08/15] replace md reference in table with readthedocs url

---
 docs/en_US/Compressor/Overview.md | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/docs/en_US/Compressor/Overview.md b/docs/en_US/Compressor/Overview.md
index 61c0b35ac4..852681cd4b 100644
--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
@@ -27,16 +27,16 @@ Pruning algorithms compress the original network by removing redundant weights o
 
 |Name|Brief Introduction of Algorithm|
 |---|---|
-| [Level Pruner](./Pruner.md#level-pruner) | Pruning the specified ratio on each weight based on absolute values of weights |
-| [AGP Pruner](./Pruner.md#agp-pruner) | Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) [Reference Paper](https://arxiv.org/abs/1710.01878)|
-| [Lottery Ticket Pruner](./Pruner.md#agp-pruner) | The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. [Reference Paper](https://arxiv.org/abs/1803.03635)|
-| [FPGM Pruner](./Pruner.md#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
-| [L1Filter Pruner](./Pruner.md#l1filter-pruner) | Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) [Reference Paper](https://arxiv.org/abs/1608.08710) |
-| [L2Filter Pruner](./Pruner.md#l2filter-pruner) | Pruning filters with the smallest L2 norm of weights in convolution layers |
-| [ActivationAPoZRankFilterPruner](./Pruner.md#ActivationAPoZRankFilterPruner) | Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. [Reference Paper](https://arxiv.org/abs/1607.03250) |
-| [ActivationMeanRankFilterPruner](./Pruner.md#ActivationMeanRankFilterPruner) | Pruning filters based on the metric that calculates the smallest mean value of output activations |
-| [Slim Pruner](./Pruner.md#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) [Reference Paper](https://arxiv.org/abs/1708.06519) |
-| [TaylorFO Pruner](./Pruner.md#taylorfoweightfilterpruner) | Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) [Reference Paper](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf) |
+| [Level Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#level-pruner) | Pruning the specified ratio on each weight based on absolute values of weights |
+| [AGP Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#agp-pruner) | Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) [Reference Paper](https://arxiv.org/abs/1710.01878)|
+| [Lottery Ticket Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#lottery-ticket-hypothesis) | The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. [Reference Paper](https://arxiv.org/abs/1803.03635)|
+| [FPGM Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
+| [L1Filter Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#l1filter-pruner) | Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) [Reference Paper](https://arxiv.org/abs/1608.08710) |
+| [L2Filter Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#l2filter-pruner) | Pruning filters with the smallest L2 norm of weights in convolution layers |
+| [ActivationAPoZRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#ActivationAPoZRankFilterPruner) | Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. [Reference Paper](https://arxiv.org/abs/1607.03250) |
+| [ActivationMeanRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#ActivationMeanRankFilterPruner) | Pruning filters based on the metric that calculates the smallest mean value of output activations |
+| [Slim Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) [Reference Paper](https://arxiv.org/abs/1708.06519) |
+| [TaylorFO Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#taylorfoweightfilterpruner) | Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) [Reference Paper](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf) |
 
 
 ### Quantization Algorithms
@@ -45,10 +45,10 @@ Quantization algorithms compress the original network by reducing the number of
 
 |Name|Brief Introduction of Algorithm|
 |---|---|
-| [Naive Quantizer](./Quantizer.md#naive-quantizer) |  Quantize weights to default 8 bits |
-| [QAT Quantizer](./Quantizer.md#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
-| [DoReFa Quantizer](./Quantizer.md#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
-| [BNN Quantizer](./Quantizer.md#BNN-Quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|
+| [Naive Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#naive-quantizer) |  Quantize weights to default 8 bits |
+| [QAT Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
+| [DoReFa Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
+| [BNN Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#BNN-Quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|
 
 ## Automatic Model Compression
 

From 87cec92e5b5a9043a2ad70b40cbaac3de95da2ca Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Sat, 27 Jun 2020 12:18:08 +0800
Subject: [PATCH 09/15] update

---
 docs/en_US/Compressor/Overview.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/en_US/Compressor/Overview.md b/docs/en_US/Compressor/Overview.md
index 852681cd4b..0430f9f39e 100644
--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
@@ -33,8 +33,8 @@ Pruning algorithms compress the original network by removing redundant weights o
 | [FPGM Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
 | [L1Filter Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#l1filter-pruner) | Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) [Reference Paper](https://arxiv.org/abs/1608.08710) |
 | [L2Filter Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#l2filter-pruner) | Pruning filters with the smallest L2 norm of weights in convolution layers |
-| [ActivationAPoZRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#ActivationAPoZRankFilterPruner) | Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. [Reference Paper](https://arxiv.org/abs/1607.03250) |
-| [ActivationMeanRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#ActivationMeanRankFilterPruner) | Pruning filters based on the metric that calculates the smallest mean value of output activations |
+| [ActivationAPoZRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#activationapozrankfilterpruner) | Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. [Reference Paper](https://arxiv.org/abs/1607.03250) |
+| [ActivationMeanRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#activationmeanrankfilterpruner) | Pruning filters based on the metric that calculates the smallest mean value of output activations |
 | [Slim Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) [Reference Paper](https://arxiv.org/abs/1708.06519) |
 | [TaylorFO Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#taylorfoweightfilterpruner) | Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) [Reference Paper](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf) |
 

From 923ee2411968e1d406c6aaced5f29fd920555d99 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Sat, 27 Jun 2020 12:32:01 +0800
Subject: [PATCH 10/15] update

---
 docs/en_US/Compressor/Framework.md    | 4 ++++
 docs/en_US/Compressor/ModelSpeedup.md | 2 +-
 docs/en_US/Compressor/Overview.md     | 2 +-
 docs/en_US/Compressor/QuickStart.md   | 4 ++++
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/docs/en_US/Compressor/Framework.md b/docs/en_US/Compressor/Framework.md
index 3586691aa9..7bac7803e0 100644
--- a/docs/en_US/Compressor/Framework.md
+++ b/docs/en_US/Compressor/Framework.md
@@ -1,5 +1,9 @@
 # Customize A New Compression Algorithm
 
+```eval_rst
+.. contents::
+```
+
 To simplify writing a new compression algorithm, we design programming interfaces which are simple but flexible enough. There are interfaces for pruning and quantization respectively. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
 
 ## Customize a new pruning algorithm
diff --git a/docs/en_US/Compressor/ModelSpeedup.md b/docs/en_US/Compressor/ModelSpeedup.md
index c3b9c76614..4158532634 100644
--- a/docs/en_US/Compressor/ModelSpeedup.md
+++ b/docs/en_US/Compressor/ModelSpeedup.md
@@ -1,6 +1,6 @@
 # Speed up Masked Model
 
-*This feature is still in Alpha version.*
+*This feature is in Beta version.*
 
 ## Introduction
 
diff --git a/docs/en_US/Compressor/Overview.md b/docs/en_US/Compressor/Overview.md
index 0430f9f39e..a81095ed79 100644
--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
@@ -48,7 +48,7 @@ Quantization algorithms compress the original network by reducing the number of
 | [Naive Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#naive-quantizer) |  Quantize weights to default 8 bits |
 | [QAT Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
 | [DoReFa Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
-| [BNN Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#BNN-Quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|
+| [BNN Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#bnn-quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|
 
 ## Automatic Model Compression
 
diff --git a/docs/en_US/Compressor/QuickStart.md b/docs/en_US/Compressor/QuickStart.md
index eb48b1b05d..6bc091afa2 100644
--- a/docs/en_US/Compressor/QuickStart.md
+++ b/docs/en_US/Compressor/QuickStart.md
@@ -1,5 +1,9 @@
 # Tutorial for Model Compression
 
+```eval_rst
+.. contents::
+```
+
 In this tutorial, we use the [first section](#quick-start-to-compress-a-model) to quickly go through the usage of model compression on NNI. Then use the [second section](#detailed-usage-guide) to explain more details of the usage.
 
 ## Quick Start to Compress a Model

From fa6ba78f338ec4b3052b6bfea068e0369283ac66 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Sat, 27 Jun 2020 12:38:37 +0800
Subject: [PATCH 11/15] update

---
 docs/en_US/Compressor/Framework.md  | 2 +-
 docs/en_US/Compressor/QuickStart.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/en_US/Compressor/Framework.md b/docs/en_US/Compressor/Framework.md
index 7bac7803e0..9dd2a62356 100644
--- a/docs/en_US/Compressor/Framework.md
+++ b/docs/en_US/Compressor/Framework.md
@@ -64,7 +64,7 @@ A `pruner` is responsible for:
 
 ### Implement a new pruning algorithm
 
-Implementing a new pruning algorithm requires implementing a `weight masker` class which shoud be a subclass of `WeightMasker`, and a `pruner` class, which should a subclass `Pruner`.
+Implementing a new pruning algorithm requires implementing a `weight masker` class which shoud be a subclass of `WeightMasker`, and a `pruner` class, which should be a subclass `Pruner`.
 
 An implementation of `weight masker` may look like this:
 
diff --git a/docs/en_US/Compressor/QuickStart.md b/docs/en_US/Compressor/QuickStart.md
index 6bc091afa2..fa2a1a01a8 100644
--- a/docs/en_US/Compressor/QuickStart.md
+++ b/docs/en_US/Compressor/QuickStart.md
@@ -8,7 +8,7 @@ In this tutorial, we use the [first section](#quick-start-to-compress-a-model) t
 
 ## Quick Start to Compress a Model
 
-NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use slim pruner as an example to show the usage.
+NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use [slim pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#slim-pruner) as an example to show the usage.
 
 ### Write configuration
 

From 745030b3bf8f7ce657b6b221a9a38f8d82cdf620 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Mon, 29 Jun 2020 15:10:31 +0800
Subject: [PATCH 12/15] update

---
 docs/en_US/Compressor/Framework.md |  2 +-
 docs/en_US/Compressor/Pruner.md    | 11 +++++++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/docs/en_US/Compressor/Framework.md b/docs/en_US/Compressor/Framework.md
index 9dd2a62356..66083197be 100644
--- a/docs/en_US/Compressor/Framework.md
+++ b/docs/en_US/Compressor/Framework.md
@@ -84,7 +84,7 @@ class MyMasker(WeightMasker):
 
 You can reference nni provided [weight masker](https://github.com/microsoft/nni/blob/master/src/sdk/pynni/nni/compression/torch/pruning/structured_pruning.py) implementations to implement your own weight masker.
 
-A basic pruner looks likes this:
+A basic `pruner` looks likes this:
 
 ```python
 class MyPruner(Pruner):
diff --git a/docs/en_US/Compressor/Pruner.md b/docs/en_US/Compressor/Pruner.md
index 5fe7d1c7a0..fcc17fe2b0 100644
--- a/docs/en_US/Compressor/Pruner.md
+++ b/docs/en_US/Compressor/Pruner.md
@@ -1,9 +1,13 @@
 # Supported Pruning Algorithms on NNI
 
-Index of supported pruning algorithms
+We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Weight pruning** generally results in  unstructured models, which need specialized haredware or software to speed up the sparse network. **Filter Pruning** achieves acceleratation by removing the entire filter.  We also provide an algorithm to control the **pruning schedule**.
+
+
+**Weight Pruning**
 * [Level Pruner](#level-pruner)
-* [AGP Pruner](#agp-pruner)
 * [Lottery Ticket Hypothesis](#lottery-ticket-hypothesis)
+  
+**Filter Pruning**
 * [Slim Pruner](#slim-pruner)
 * [Filter Pruners with Weight Rank](#weightrankfilterpruner)
     * [FPGM Pruner](#fpgm-pruner)
@@ -14,6 +18,9 @@ Index of supported pruning algorithms
     * [Activation Mean Rank Pruner](#activationmeanrankfilterpruner)
 * [Filter Pruners with Gradient Rank](#gradientrankfilterpruner)
     * [Taylor FO On Weight Pruner](#taylorfoweightfilterpruner)
+  
+**Pruning Schedule**
+* [AGP Pruner](#agp-pruner)
 
 ## Level Pruner
 

From 8bb9c3c97c848479133dfdf34ad80582aea61bd7 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Mon, 29 Jun 2020 15:49:36 +0800
Subject: [PATCH 13/15] fix sphinx error

---
 docs/en_US/Compressor/Quantizer.md | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/docs/en_US/Compressor/Quantizer.md b/docs/en_US/Compressor/Quantizer.md
index 0cfa9fe70d..55fde6db66 100644
--- a/docs/en_US/Compressor/Quantizer.md
+++ b/docs/en_US/Compressor/Quantizer.md
@@ -53,7 +53,8 @@ quantizer.compress()
 You can view example for more information
 
 #### User configuration for QAT Quantizer
-common configuration needed by compression algorithms can be found at : [Common configuration](./Overview.md#User-configuration-for-a-compression-algorithm)
+
+common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
 
 configuration needed by this algorithm :
 
@@ -63,14 +64,17 @@ disable quantization until model are run by certain number of steps, this allows
 state where activation quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
 
 ### note
+
 batch normalization folding is currently not supported.
 
 ***
 
 ## DoReFa Quantizer
+
 In [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](https://arxiv.org/abs/1606.06160), authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
 
 ### Usage
+
 To implement DoReFa Quantizer, you can add code below before your training code
 
 PyTorch code
@@ -88,13 +92,15 @@ quantizer.compress()
 You can view example for more information
 
 #### User configuration for DoReFa Quantizer
-common configuration needed by compression algorithms can be found at : [Common configuration](./Overview.md#User-configuration-for-a-compression-algorithm)
+
+common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
 
 configuration needed by this algorithm :
 
 ***
 
 ## BNN Quantizer
+
 In [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830), 
 
 >We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.
@@ -126,11 +132,13 @@ model = quantizer.compress()
 You can view example [examples/model_compress/BNN_quantizer_cifar10.py]( https://github.com/microsoft/nni/tree/master/examples/model_compress/BNN_quantizer_cifar10.py) for more information.
 
 #### User configuration for BNN Quantizer
-common configuration needed by compression algorithms can be found at : [Common configuration](./Overview.md#User-configuration-for-a-compression-algorithm)
+
+common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
 
 configuration needed by this algorithm :
 
 ### Experiment
+
 We implemented one of the experiments in [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830), we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
 
 | Model         | Accuracy  | 

From fee917bd0ec685b6018911ababa8b091c068dff8 Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Mon, 29 Jun 2020 16:39:23 +0800
Subject: [PATCH 14/15] update

---
 docs/en_US/Compressor/QuickStart.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/en_US/Compressor/QuickStart.md b/docs/en_US/Compressor/QuickStart.md
index fa2a1a01a8..0d7c9995db 100644
--- a/docs/en_US/Compressor/QuickStart.md
+++ b/docs/en_US/Compressor/QuickStart.md
@@ -21,7 +21,7 @@ configure_list = [{
 }]
 ```
 
-The specification of configuration can be found [here](Overview.md#user-configuration-for-a-compression-algorithm). Note that different pruners may have their own defined fields in configuration, for exmaple `start_epoch` in AGP pruner. Please refer to each pruner's [usage](Overview.md#supported-algorithms) for details, and adjust the configuration accordingly.
+The specification of configuration can be found [here](#specification-of-config-list). Note that different pruners may have their own defined fields in configuration, for exmaple `start_epoch` in AGP pruner. Please refer to each pruner's [usage](./Pruner.md) for details, and adjust the configuration accordingly.
 
 ### Choose a compression algorithm
 

From afa70d8bcd2bc1717a38da7384da74e23462ec5f Mon Sep 17 00:00:00 2001
From: quzha <Quanlu.Zhang@microsoft.com>
Date: Mon, 29 Jun 2020 16:41:31 +0800
Subject: [PATCH 15/15] fix broken link

---
 docs/en_US/Compressor/QuickStart.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/en_US/Compressor/QuickStart.md b/docs/en_US/Compressor/QuickStart.md
index 0d7c9995db..b47175e832 100644
--- a/docs/en_US/Compressor/QuickStart.md
+++ b/docs/en_US/Compressor/QuickStart.md
@@ -150,7 +150,7 @@ when the value is int type, all quantization types share same bits length. eg.
 
 ### APIs for Updating Fine Tuning Status
 
-Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](./Pruner.md#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: `pruner.update_epoch(epoch)` and `pruner.step()`.
+Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: `pruner.update_epoch(epoch)` and `pruner.step()`.
 
 `update_epoch` should be invoked in every epoch, while `step` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.