fix source_urls for colossalai 0.1.8 (no longer available via PyPI, only via GitHub repo) #16693

ThomasHoffmann77 · 2022-11-21T10:13:23Z

(created using eb --new-pr)

boegel · 2022-11-23T18:22:50Z

Hmm, it's a bit strange that this package was simply removed from PyPI, checking on what happened in hpcaitech/ColossalAI#2014

The source code for version 0.1.8 avaiable from GitHub is not identical to what was available on PyPI, which is also a bit puzzling (detailed diff below)

Only in ColossalAI-0.1.8: benchmark
Only in ColossalAI-0.1.8: CHANGE_LOG.md
Only in ColossalAI-0.1.8: .clang-format
diff -ru extensions/colossalai-0.1.8/colossalai/builder/builder.py ColossalAI-0.1.8/colossalai/builder/builder.py
--- extensions/colossalai-0.1.8/colossalai/builder/builder.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/builder/builder.py	2022-07-12 18:08:59.000000000 +0200
@@ -6,6 +6,7 @@
 from colossalai.registry import *
 
 
+
 def build_from_config(module, config: dict):
     """Returns an object of :class:`module` constructed from `config`.
 
@@ -45,20 +46,23 @@
     Raises:
         Exception: Raises an Exception if an error occurred when building from registry.
     """
-    config_ = config.copy()    # keep the original config untouched
-    assert isinstance(registry, Registry), f'Expected type Registry but got {type(registry)}'
+    config_ = config.copy()  # keep the original config untouched
+    assert isinstance(
+        registry, Registry), f'Expected type Registry but got {type(registry)}'
 
     mod_type = config_.pop('type')
-    assert registry.has(mod_type), f'{mod_type} is not found in registry {registry.name}'
+    assert registry.has(
+        mod_type), f'{mod_type} is not found in registry {registry.name}'
     try:
         obj = registry.get_module(mod_type)(**config_)
     except Exception as e:
-        print(f'An error occurred when building {mod_type} from registry {registry.name}', flush=True)
+        print(
+            f'An error occurred when building {mod_type} from registry {registry.name}',
+            flush=True)
         raise e
 
     return obj
 
-
 def build_gradient_handler(config, model, optimizer):
     """Returns a gradient handler object of :class:`BaseGradientHandler` constructed from `config`,
     `model` and `optimizer`.
diff -ru extensions/colossalai-0.1.8/colossalai/communication/collective.py ColossalAI-0.1.8/colossalai/communication/collective.py
--- extensions/colossalai-0.1.8/colossalai/communication/collective.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/communication/collective.py	2022-07-12 18:08:59.000000000 +0200
@@ -10,7 +10,10 @@
 from colossalai.core import global_context as gpc
 
 
-def all_gather(tensor: Tensor, dim: int, parallel_mode: ParallelMode, async_op: bool = False) -> Tensor:
+def all_gather(tensor: Tensor,
+               dim: int,
+               parallel_mode: ParallelMode,
+               async_op: bool = False) -> Tensor:
     r"""Gathers all tensors from the parallel group and concatenates them in a
     specific dimension.
 
@@ -160,7 +163,11 @@
         return out
 
 
-def reduce(tensor: Tensor, dst: int, parallel_mode: ParallelMode, op: ReduceOp = ReduceOp.SUM, async_op: bool = False):
+def reduce(tensor: Tensor,
+           dst: int,
+           parallel_mode: ParallelMode,
+           op: ReduceOp = ReduceOp.SUM,
+           async_op: bool = False):
     r"""Reduce tensors across whole parallel group. Only the process with
     rank ``dst`` is going to receive the final result.
 
diff -ru extensions/colossalai-0.1.8/colossalai/engine/_base_engine.py ColossalAI-0.1.8/colossalai/engine/_base_engine.py
--- extensions/colossalai-0.1.8/colossalai/engine/_base_engine.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/engine/_base_engine.py	2022-07-12 18:08:59.000000000 +0200
@@ -7,7 +7,7 @@
 
 from colossalai.logging import get_dist_logger
 from torch import Tensor
-from colossalai.gemini.ophooks import register_ophooks_recursively, BaseOpHook
+from colossalai.engine.ophooks import register_ophooks_recursively, BaseOpHook
 from colossalai.engine.schedule import BaseSchedule, NonPipelineSchedule, PipelineSchedule, InterleavedPipelineSchedule
 from typing import Optional, Type
 from colossalai.engine.gradient_handler import BaseGradientHandler
diff -ru extensions/colossalai-0.1.8/colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py ColossalAI-0.1.8/colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py
--- extensions/colossalai-0.1.8/colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py	2022-07-12 18:08:59.000000000 +0200
@@ -33,19 +33,14 @@
             # Pack the buckets.
             for param in self._model.parameters():
                 group = getattr(param, 'pipeline_shared_module_pg', None)
-                if param.requires_grad and group is not None and (
-                    (hasattr(param, 'colo_attr') and not param.colo_attr.saved_grad.is_null())
-                        or param.grad is not None):
+                if param.requires_grad and param.grad is not None and group is not None:
                     tp = param.data.type()
                     buckets[group][tp].append(param)
 
             # For each bucket, all-reduce and copy all-reduced grads.
             for group, group_buckets in buckets.items():
                 for tp, bucket in group_buckets.items():
-                    grads = [
-                        param.colo_attr.grad_payload if hasattr(param, 'colo_attr') else param.grad.data
-                        for param in bucket
-                    ]
+                    grads = [param.grad.data for param in bucket]
                     coalesced = _flatten_dense_tensors(grads).to(torch.cuda.current_device())
                     dist.all_reduce(coalesced, op=dist.ReduceOp.SUM, group=group)
                     for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
Only in ColossalAI-0.1.8/colossalai/engine: ophooks
Only in ColossalAI-0.1.8/colossalai/engine: paramhooks
diff -ru extensions/colossalai-0.1.8/colossalai/engine/schedule/_pipeline_schedule.py ColossalAI-0.1.8/colossalai/engine/schedule/_pipeline_schedule.py
--- extensions/colossalai-0.1.8/colossalai/engine/schedule/_pipeline_schedule.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/engine/schedule/_pipeline_schedule.py	2022-07-12 18:08:59.000000000 +0200
@@ -12,6 +12,7 @@
 from colossalai.logging import get_dist_logger
 from colossalai.utils import switch_virtual_pipeline_parallel_rank
 from colossalai.utils.cuda import get_current_device
+from colossalai.zero.sharded_model.sharded_model_v2 import ShardedModelV2
 
 from ._base_schedule import BaseSchedule
 
@@ -156,7 +157,6 @@
         return self._move_to_device(mciro_batch_data)
 
     def pre_processing(self, engine):
-        from colossalai.zero.sharded_model.sharded_model_v2 import ShardedModelV2
         # TODO: remove this after testing new zero with pipeline parallelism
         model = engine.model
         if isinstance(model, NaiveAMPModel):
@@ -482,7 +482,6 @@
         self.num_model_chunks = num_model_chunks
 
     def pre_processing(self, engine):
-        from colossalai.zero.sharded_model.sharded_model_v2 import ShardedModelV2
         if isinstance(engine.model, ShardedModelV2):
             self.dtype = torch.half
         elif isinstance(engine.model[0], NaiveAMPModel):
diff -ru extensions/colossalai-0.1.8/colossalai/fx/passes/adding_split_node_pass.py ColossalAI-0.1.8/colossalai/fx/passes/adding_split_node_pass.py
--- extensions/colossalai-0.1.8/colossalai/fx/passes/adding_split_node_pass.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/fx/passes/adding_split_node_pass.py	2022-07-12 18:08:59.000000000 +0200
@@ -10,9 +10,7 @@
 
 
 def balanced_split_pass(gm: torch.fx.GraphModule, pp_size: int):
-    """
-    In balanced_split_pass, we split module by the size of parameters(weights+bias).
-    """
+    # TODO(lyl): balanced policy V2, split module by node size(weight+bias+output)
     mod_graph = gm.graph
     total_param_amount = 0
     for param in mod_graph.owning_module.parameters():
@@ -40,36 +38,6 @@
     gm.recompile()
     return gm
 
-
-def balanced_split_pass_v2(gm: torch.fx.GraphModule, pp_size: int):
-    """
-    In balanced_split_pass_v12, we split module by the size of nodes(weights+bias+outputs).
-    """
-    mod_graph = gm.graph
-    # To use balanced_split_pass_v2, we need run meta_info_prop interpreter first.
-    # If nodes don't have meta info, this pass will fall back to normal balanced split pass.
-    check_node = list(mod_graph.nodes)[0]
-    if 'tensor_meta' not in check_node.meta:
-        return balanced_split_pass(gm, pp_size)
-
-    total_element_size = 0
-    for node in mod_graph.nodes:
-        total_element_size += node.node_size
-
-    partition_size = total_element_size // pp_size
-    accumulate_node_size = 0
-    for node in mod_graph.nodes:
-        if pp_size <= 1:
-            break
-        accumulate_node_size += node.node_size
-        if accumulate_node_size >= partition_size:
-            accumulate_node_size = 0
-            pp_size -= 1
-            with mod_graph.inserting_after(node):
-                split_node = mod_graph.create_node('call_function', pipe_split)
-    gm.recompile()
-    return gm
-
 
 def uniform_split_pass(gm: torch.fx.GraphModule, pp_size: int):
     mod_graph = gm.graph
diff -ru extensions/colossalai-0.1.8/colossalai/fx/passes/meta_info_prop.py ColossalAI-0.1.8/colossalai/fx/passes/meta_info_prop.py
--- extensions/colossalai-0.1.8/colossalai/fx/passes/meta_info_prop.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/fx/passes/meta_info_prop.py	2022-07-12 18:08:59.000000000 +0200
@@ -67,6 +67,7 @@
 
     def run_node(self, n: Node) -> Any:
         result = super().run_node(n)
+
         found_tensor = False
 
         def extract_tensor_meta(obj):
@@ -82,25 +83,7 @@
             n.meta['tensor_meta'] = meta
         else:
             n.meta['tensor_meta'] = TensorMetadata(None, None, False, None, 0)
-        # counting the total size of node outputs
-        total_node_size = 0
-        if isinstance(n.meta['tensor_meta'], TensorMetadata):
-            total_node_size += n.meta['tensor_meta'].numel
-        else:
-            for element in n.meta['tensor_meta']:
-                assert isinstance(
-                    element, TensorMetadata
-                ), f"``n.meta['tensor_meta']`` should be either TensorMetadata or a tuple of TensorMetadata."
-                total_node_size += element.numel
-        # counting the total size of parameters
-        total_param_size = 0
-        if n.op == 'call_module':
-            target_module = n.graph.owning_module.get_submodule(n.target)
-            for param in target_module.parameters():
-                total_param_size += param.numel()
 
-        total_node_size += total_param_size
-        n.node_size = total_node_size
         n.meta['type'] = type(result)
         return result
 
diff -ru extensions/colossalai-0.1.8/colossalai/fx/passes/shard_1d_pass.py ColossalAI-0.1.8/colossalai/fx/passes/shard_1d_pass.py
--- extensions/colossalai-0.1.8/colossalai/fx/passes/shard_1d_pass.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/fx/passes/shard_1d_pass.py	2022-07-12 18:08:59.000000000 +0200
@@ -1,90 +1,59 @@
 import torch
-import operator
-import colossalai
+from colossalai.tensor import ColoTensorSpec, distspec, ProcessGroup, ComputeSpec, ComputePattern, ShardSpec
 
-ELEMENTWISE_MODULE_OP = [torch.nn.Dropout, torch.nn.ReLU, torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d, torch.nn.MaxPool1d, torch.nn.MaxPool2d, torch.nn.AvgPool1d, torch.nn.AvgPool2d]
-ELEMENTWISE_FUNC_OP = [torch.add, operator.add, torch.abs, torch.cos, torch.exp, torch.mul, operator.mul, operator.floordiv, operator.truediv, operator.neg, torch.multiply, torch.nn.functional.relu, torch.nn.functional.dropout, torch.nn.functional.conv1d, torch.nn.functional.conv2d, torch.nn.functional.conv3d, torch.nn.functional.avg_pool1d, torch.nn.functional.avg_pool2d, torch.nn.functional.avg_pool3d, torch.nn.functional.max_pool1d, torch.nn.functional.max_pool2d, torch.nn.functional.max_pool3d]
 
-def weight_split(weight: torch.nn.parameter.Parameter, dim: int, col_normal: bool) -> torch.nn.parameter.Parameter:
+def weight_split(weight: torch.Tensor, dim: int) -> torch.nn.parameter.Parameter:
     """weight_split 
     split a nn.Parameter
 
     Args:
         weight (torch.nn.parameter.Parameter): a torch Parameter instance
         dim (int): the dimension to be sharded along with
-        col_normal(bool): col shard with gather or not
+
     Returns:
         _type_: _description_
     """
-    if col_normal:
-        setattr(weight, "fx_attr", (dim, "SHARD", "TP", "col_normal"))
-    else:
-        setattr(weight, "fx_attr", (dim, "SHARD", "TP", "col_needs_many_outputs"))
+    # Append a Tensor spec to target_module.weight.shard
+    # Convert to ColoTensor: colo_tensor = ColoTensor.from_torch_tensor(tensor, spec)
+    assert isinstance(weight, torch.Tensor), \
+        f'The type of the input tensor should be torch.nn.parameter' \
+        f'Your Input tensor is {type(weight)}'
+
+    # FIXME() I initialized a PG for this tensor. Only has TP comm group.
+    # we only consider the TP-only caes.
+    world_size = torch.distributed.get_world_size()
+    pg = ProcessGroup(tp_degree=world_size)
+
+    spec = ColoTensorSpec(pg, ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
+    # As you has constructed a Spec, why not directly convert the tensor to ColoTensor.
+    setattr(weight, "fx_attr", spec)
     return weight
+
+
 def column_shard_linear_pass(gm: torch.fx.GraphModule):
-    # Split all the linear module with column shard. Currently for testing only.
     mod_graph = gm.graph
     for node in mod_graph.nodes:
         if node.op == "call_module":
             target_module = node.graph.owning_module.get_submodule(node.target)
             if isinstance(target_module, torch.nn.Linear):
-                target_module.weight = weight_split(target_module.weight, dim=0, col_normal=False)
+                target_module.weight = weight_split(target_module.weight, dim=0)
                 if target_module.bias is not None:
-                    target_module.bias.data = weight_split(target_module.bias.data, dim=0, col_normal=False)
+                    target_module.bias.data = weight_split(target_module.bias.data, dim=0)
 
     gm.recompile()
     return gm
 
 
 def row_shard_linear_pass(gm: torch.fx.GraphModule):
-    # Split all the linear module with row shard. Currently for testing only.
     mod_graph = gm.graph
     for node in mod_graph.nodes:
         if node.op == "call_module":
             target_module = node.graph.owning_module.get_submodule(node.target)
             if isinstance(target_module, torch.nn.Linear):
-                target_module.weight = weight_split(target_module.weight, dim=-1, col_normal=False)
+                target_module.weight = weight_split(target_module.weight, dim=-1)
 
     gm.recompile()
     return gm
 
-def transform_mlp_pass(gm: torch.fx.GraphModule):
-    #TODO: Needs to handle special cases, like x = linear(x) + linear(x)
-    mod_graph = gm.graph
-    col_shard = True
-    element_op = []
-    all_linear_name = []
-    linear_name = []
-    # Get the name of element wise module(torch.nn.ReLU)
-    # Get the name of all the linear modules and repeated linear modules
-    for name, func in gm.named_children():
-        if not isinstance(func, torch.nn.Linear):
-            for i in ELEMENTWISE_MODULE_OP:
-                if isinstance(func, i):
-                    element_op.append(name)
-                    break
-        else:
-            if name in all_linear_name:
-                if name in linear_name:
-                    linear_name.remove(name)
-            else:
-                all_linear_name.append(name)
-                linear_name.append(name)
-    # If the linear modules is called multiple times, set the dist spec as col shard
-    # If the module is element wise or the function/method is element wise, remains col_shard 
-    for node in mod_graph.nodes:
-        if node.target in linear_name:
-            target_module = node.graph.owning_module.get_submodule(node.target)
-            dim = 0 if col_shard else -1
-            target_module.weight = weight_split(target_module.weight, dim=dim, col_normal=False)
-            col_shard = not col_shard
-        elif node.target in all_linear_name:
-            target_module = node.graph.owning_module.get_submodule(node.target)
-            dim = 0 if col_shard else -1
-            target_module.weight = weight_split(target_module.weight, dim=dim, col_normal=True)
-            col_shard = not col_shard
-        else:
-            if node.target not in element_op and all(node.target != i for i in ELEMENTWISE_FUNC_OP):
-                col_shard = True
-    gm.recompile()
-    return gm
\ No newline at end of file
+
+#TODO: add elementwise op process pass, then we can try to use column and row mixed strategy.
diff -ru extensions/colossalai-0.1.8/colossalai/fx/tracer/meta_patch/patched_module/normalization.py ColossalAI-0.1.8/colossalai/fx/tracer/meta_patch/patched_module/normalization.py
--- extensions/colossalai-0.1.8/colossalai/fx/tracer/meta_patch/patched_module/normalization.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/fx/tracer/meta_patch/patched_module/normalization.py	2022-07-12 18:08:59.000000000 +0200
@@ -17,14 +17,4 @@
         assert input.dim() == 5
 
     # normalization maintain the same shape as the input
-    return input.clone()
-
-
-try:
-    import apex
-    meta_patched_module.register(apex.normalization.FusedLayerNorm)(torch_nn_normalize)
-    meta_patched_module.register(apex.normalization.FusedRMSNorm)(torch_nn_normalize)
-    meta_patched_module.register(apex.normalization.MixedFusedLayerNorm)(torch_nn_normalize)
-    meta_patched_module.register(apex.normalization.MixedFusedRMSNorm)(torch_nn_normalize)
-except ImportError:
-    pass
+    return input.clone()
\ No newline at end of file
Only in extensions/colossalai-0.1.8/colossalai/gemini: ophooks
Only in extensions/colossalai-0.1.8/colossalai/gemini: paramhooks
diff -ru extensions/colossalai-0.1.8/colossalai/initialize.py ColossalAI-0.1.8/colossalai/initialize.py
--- extensions/colossalai-0.1.8/colossalai/initialize.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/initialize.py	2022-07-12 18:08:59.000000000 +0200
@@ -22,7 +22,7 @@
 
 from colossalai.engine.schedule import NonPipelineSchedule, PipelineSchedule, InterleavedPipelineSchedule, get_tensor_shape
 from colossalai.engine import Engine
-from colossalai.gemini.ophooks import BaseOpHook
+from colossalai.engine.ophooks import BaseOpHook
 
 from colossalai.utils import (get_current_device, is_using_ddp, is_using_pp, is_using_sequence, sync_model_param)
 from colossalai.utils.moe import sync_moe_model_param
diff -ru extensions/colossalai-0.1.8/colossalai/__init__.py ColossalAI-0.1.8/colossalai/__init__.py
--- extensions/colossalai-0.1.8/colossalai/__init__.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/__init__.py	2022-07-12 18:08:59.000000000 +0200
@@ -1,4 +1,5 @@
-from .initialize import (initialize, launch, launch_from_openmpi, launch_from_slurm, launch_from_torch,
-                         get_default_parser)
+from .initialize import (initialize, launch, launch_from_openmpi,
+                         launch_from_slurm, launch_from_torch, get_default_parser)
 
 __version__ = '0.0.1'
+
diff -ru extensions/colossalai-0.1.8/colossalai/nn/init.py ColossalAI-0.1.8/colossalai/nn/init.py
--- extensions/colossalai-0.1.8/colossalai/nn/init.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/init.py	2022-07-12 18:08:59.000000000 +0200
@@ -7,7 +7,6 @@
 
 def zeros_():
     """Return the initializer filling the input Tensor with the scalar zeros"""
-
     def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
         return nn.init.zeros_(tensor)
 
@@ -16,7 +15,6 @@
 
 def ones_():
     """Return the initializer filling the input Tensor with the scalar ones"""
-
     def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
         return nn.init.ones_(tensor)
 
@@ -48,7 +46,6 @@
         mean (float): the mean of the normal distribution. Defaults 0.0.
         std (float): the standard deviation of the normal distribution. Defaults 1.0.
      """
-
     def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
         return nn.init.normal_(tensor, mean, std)
 
@@ -69,7 +66,6 @@
         a (float): the minimum cutoff value. Defaults -2.0.
         b (float): the maximum cutoff value. Defaults 2.0.
     """
-
     def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
         return nn.init.trunc_normal_(tensor, mean, std, a, b)
 
@@ -97,7 +93,6 @@
         nonlinearity (str, optional): the non-linear function (`nn.functional` name),
                         recommended to use only with ``'relu'`` or ``'leaky_relu'`` (default).
     """
-
     # adapted from torch.nn.init
     def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
         if 0 in tensor.shape:
@@ -141,7 +136,6 @@
         nonlinearity (str, optional): the non-linear function (`nn.functional` name),
                         recommended to use only with ``'relu'`` or ``'leaky_relu'`` (default).
     """
-
     # adapted from torch.nn.init
     def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
         if 0 in tensor.shape:
@@ -181,7 +175,6 @@
         scale (float, optional): an optional scaling factor used to calculate standard deviation. Defaults 2.0.
         gain (float, optional): an optional scaling factor. Defaults 1.0.
     """
-
     # adapted from torch.nn.init
     def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
         assert fan_in is not None, 'Fan_in is not provided.'
@@ -213,7 +206,6 @@
         scale (float, optional): an optional scaling factor used to calculate standard deviation. Defaults 2.0.
         gain (float, optional): an optional scaling factor. Defaults 1.0.
     """
-
     # adapted from torch.nn.init
     def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
         assert fan_in is not None, 'Fan_in is not provided.'
@@ -249,4 +241,4 @@
         std = math.sqrt(1.0 / fan_in)
         return nn.init.trunc_normal_(tensor, std=std / .87962566103423978)
 
-    return initializer
+    return initializer
\ No newline at end of file
diff -ru extensions/colossalai-0.1.8/colossalai/nn/layer/parallel_1d/layers.py ColossalAI-0.1.8/colossalai/nn/layer/parallel_1d/layers.py
--- extensions/colossalai-0.1.8/colossalai/nn/layer/parallel_1d/layers.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/layer/parallel_1d/layers.py	2022-07-12 18:08:59.000000000 +0200
@@ -176,6 +176,7 @@
         set_parallel_input(False)
         env.vocab_parallel = False
 
+
     def reset_parameters(self, weight_initializer, bias_initializer) -> None:
         fan_in, fan_out = self.in_features, self.num_classes
         if self.has_weight:
@@ -449,6 +450,7 @@
         is_parallel_output = not self.gather_output
         set_parallel_input(is_parallel_output)
 
+
     def reset_parameters(self, weight_initializer, bias_initializer) -> None:
         fan_in, fan_out = self.in_features, self.out_features
         weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
@@ -587,6 +589,7 @@
         self._set_tensor_parallel_attributes()
         set_parallel_input(False)
 
+
     def reset_parameters(self, weight_initializer, bias_initializer) -> None:
         fan_in, fan_out = self.in_features, self.out_features
         weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
diff -ru extensions/colossalai-0.1.8/colossalai/nn/layer/parallel_3d/_operation.py ColossalAI-0.1.8/colossalai/nn/layer/parallel_3d/_operation.py
--- extensions/colossalai-0.1.8/colossalai/nn/layer/parallel_3d/_operation.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/layer/parallel_3d/_operation.py	2022-07-12 18:08:59.000000000 +0200
@@ -326,8 +326,10 @@
 
     if input_.size(dim) <= 1:
         return input_
-    output = torch.chunk(input_, weight_world_size, dim=dim)[gpc.get_local_rank(weight_parallel_mode)].contiguous()
-    output = torch.chunk(output, input_world_size, dim=dim)[gpc.get_local_rank(input_parallel_mode)].contiguous()
+    output = torch.chunk(input_, weight_world_size,
+                         dim=dim)[gpc.get_local_rank(weight_parallel_mode)].contiguous()
+    output = torch.chunk(output, input_world_size,
+                         dim=dim)[gpc.get_local_rank(input_parallel_mode)].contiguous()
     return output
 
 
diff -ru extensions/colossalai-0.1.8/colossalai/nn/layer/parallel_sequence/layers.py ColossalAI-0.1.8/colossalai/nn/layer/parallel_sequence/layers.py
--- extensions/colossalai-0.1.8/colossalai/nn/layer/parallel_sequence/layers.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/layer/parallel_sequence/layers.py	2022-07-12 18:08:59.000000000 +0200
@@ -44,7 +44,8 @@
                  attn_mask_type=AttnMaskType.padding,
                  masked_softmax_fusion=True,
                  fp16=False,
-                 bf16=False):
+                 bf16=False
+                 ):
         super().__init__()
         self.convert_fp16_to_fp32_in_softmax = convert_fp16_to_fp32_in_softmax
         self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
@@ -79,14 +80,21 @@
             self.coeff = layer_number
             self.norm_factor *= self.coeff
 
-        self.scale_mask_softmax = FusedScaleMaskSoftmax(fp16, bf16, self.attn_mask_type, masked_softmax_fusion,
-                                                        self.attention_mask_func, self.convert_fp16_to_fp32_in_softmax,
-                                                        self.coeff)
+        self.scale_mask_softmax = FusedScaleMaskSoftmax(
+            fp16, bf16,
+            self.attn_mask_type,
+            masked_softmax_fusion,
+            self.attention_mask_func,
+            self.convert_fp16_to_fp32_in_softmax,
+            self.coeff)
 
         self.attention_dropout = nn.Dropout(attention_dropout)
 
         # Output.
-        self.dense = _Linear(hidden_size, hidden_size, bias=True, skip_bias_add=True)
+        self.dense = _Linear(hidden_size,
+                             hidden_size,
+                             bias=True,
+                             skip_bias_add=True)
 
     def forward(self, hidden_states, attention_mask):
         # hidden_states: [sub_seq_len, batch_size, hidden_size]
@@ -112,24 +120,30 @@
         assert last_dim_value % 3 == 0, 'the last dimension is not a multiple of 3, ' \
                                         'cannot be divided into query, key and value'
         partition_size = last_dim_value // 3
-        (query_layer, key_layer, value_layer) = torch.split(mixed_x_layer, partition_size, dim=last_dim)
+        (query_layer, key_layer, value_layer) = torch.split(
+            mixed_x_layer, partition_size, dim=last_dim)
 
         # attention scores: [batch_size, num_heads, sub_seq_len, seq_len]
-        output_size = (query_layer.size(1), query_layer.size(2), query_layer.size(0),
+        output_size = (query_layer.size(1),
+                       query_layer.size(2),
+                       query_layer.size(0),
                        key_layer.size(0) * self.world_size)
 
         # [sub_seq_len, batch_size, num_heads, head_size] -> [sub_seq_len, batch_size * num_heads, head_size]
-        query_layer = query_layer.view(output_size[2], output_size[0] * output_size[1], -1)
+        query_layer = query_layer.view(output_size[2],
+                                       output_size[0] * output_size[1], -1)
         # [sub_seq_len, batch_size, num_heads, head_size] -> [sub_seq_len, batch_size * num_heads, head_size]
-        key_layer = key_layer.view(key_layer.size(0), output_size[0] * output_size[1], -1)
+        key_layer = key_layer.view(key_layer.size(0),
+                                   output_size[0] * output_size[1], -1)
 
         # attention_scores: [batch_size * num_heads, sub_seq_len, seq_len]
         attention_scores = RingQK.apply(
-            query_layer.transpose(0, 1).contiguous(),    # [batch_size * num_heads, sub_seq_len, head_size]
-            key_layer.transpose(0, 1).contiguous(),    # [batch_size * num_heads, sub_seq_len, head_size],
+            query_layer.transpose(0, 1).contiguous(),  # [batch_size * num_heads, sub_seq_len, head_size]
+            key_layer.transpose(0, 1).contiguous(),  # [batch_size * num_heads, sub_seq_len, head_size],
             batch_size,
             self.num_attention_heads,
-            sub_seq_length)
+            sub_seq_length
+        )
 
         attention_scores /= self.norm_factor
 
@@ -144,19 +158,29 @@
             attention_probs = self.attention_dropout(attention_probs)
 
         # context layer shape: [batch_size, num_heads, sub_seq_len, head_size]
-        output_size = (value_layer.size(1), value_layer.size(2), query_layer.size(0), value_layer.size(3))
+        output_size = (value_layer.size(1),
+                       value_layer.size(2),
+                       query_layer.size(0),
+                       value_layer.size(3))
 
         # change view [sub_seq_len, batch_size * num_heads, head_size]
-        value_layer = value_layer.contiguous().view(value_layer.size(0), output_size[0] * output_size[1], -1)
+        value_layer = value_layer.contiguous().view(value_layer.size(0),
+                                                    output_size[0] * output_size[1], -1)
 
         # # change view [b * num_heads, sub_seq_len, seq_len]
-        attention_probs = attention_probs.view(
-            attention_probs.size(0) * attention_probs.size(1), attention_probs.size(2), attention_probs.size(3))
+        attention_probs = attention_probs.view(attention_probs.size(0) * attention_probs.size(1),
+                                               attention_probs.size(2),
+                                               attention_probs.size(3))
 
         # matmul: [batch_size * num_heads, sub_seq_len, head_size]
-        context_layer = RingAV.apply(attention_probs,
-                                     value_layer.transpose(0, 1).contiguous(), batch_size, self.num_attention_heads,
-                                     self.hidden_size_per_attention_head, sub_seq_length)
+        context_layer = RingAV.apply(
+            attention_probs,
+            value_layer.transpose(0, 1).contiguous(),
+            batch_size,
+            self.num_attention_heads,
+            self.hidden_size_per_attention_head,
+            sub_seq_length
+        )
 
         # change view [batch_size, num_heads, sub_seq_len, head_size]
         context_layer = context_layer.view(*output_size)
@@ -165,8 +189,8 @@
         context_layer = context_layer.permute(2, 0, 1, 3).contiguous()
 
         # [sub_seq_len, batch_size, num_heads, head_size] -> [sub_seq_len, batch_size, hidden_size]
-        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_attention_head *
-                                                               self.num_attention_heads,)
+        new_context_layer_shape = context_layer.size()[:-2] + (
+            self.hidden_size_per_attention_head * self.num_attention_heads,)
         context_layer = context_layer.view(*new_context_layer_shape)
 
         output, bias = self.dense(context_layer)
@@ -200,7 +224,11 @@
                        adding bias but instead return it.
     """
 
-    def __init__(self, input_size, output_size, bias=True, skip_bias_add=False):
+    def __init__(self,
+                 input_size,
+                 output_size,
+                 bias=True,
+                 skip_bias_add=False):
         super(_Linear, self).__init__()
 
         # Keep input parameters
@@ -208,10 +236,9 @@
         self.output_size = output_size
         self.skip_bias_add = skip_bias_add
 
-        self.weight = Parameter(torch.empty(
-            self.output_size,
-            self.input_size,
-        ))
+        self.weight = Parameter(torch.empty(self.output_size,
+                                            self.input_size,
+                                            ))
         nn.init.xavier_normal_(self.weight)
 
         if bias:
diff -ru extensions/colossalai-0.1.8/colossalai/nn/layer/parallel_sequence/_operation.py ColossalAI-0.1.8/colossalai/nn/layer/parallel_sequence/_operation.py
--- extensions/colossalai-0.1.8/colossalai/nn/layer/parallel_sequence/_operation.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/layer/parallel_sequence/_operation.py	2022-07-12 18:08:59.000000000 +0200
@@ -19,17 +19,24 @@
 
     @staticmethod
     @custom_fwd
-    def forward(ctx, sub_q, sub_k, batch_size, num_attention_heads, sub_seq_length):
+    def forward(ctx,
+                sub_q,
+                sub_k,
+                batch_size,
+                num_attention_heads,
+                sub_seq_length):
         # save tensor for backward
         ctx.save_for_backward(sub_q, sub_k)
         ctx.sub_seq_length = sub_seq_length
 
         # create local segment of attention score
-        attention_score = torch.empty(batch_size * num_attention_heads,
-                                      sub_seq_length,
-                                      sub_seq_length * gpc.get_world_size(ParallelMode.SEQUENCE),
-                                      dtype=sub_q.dtype,
-                                      device=get_current_device())
+        attention_score = torch.empty(
+            batch_size * num_attention_heads,
+            sub_seq_length,
+            sub_seq_length * gpc.get_world_size(ParallelMode.SEQUENCE),
+            dtype=sub_q.dtype,
+            device=get_current_device()
+        )
 
         # compute local QK^T
         part_a = torch.matmul(sub_q, sub_k.transpose(2, 1))
@@ -37,7 +44,7 @@
         local_world_size = gpc.get_world_size(ParallelMode.SEQUENCE)
         start_idx = local_rank * sub_seq_length
         end_idx = (local_rank + 1) * sub_seq_length
-        attention_score[:, :, start_idx:end_idx] = part_a
+        attention_score[:, :, start_idx: end_idx] = part_a
 
         # compute QK^T in ring-all-reduce style
         for i in range(local_world_size - 1):
@@ -56,18 +63,19 @@
         local_world_size = gpc.get_world_size(ParallelMode.SEQUENCE)
 
         # calculate gradient of sub_k
-        grad_k = torch.matmul(grad_output.transpose(2, 1), sub_q)
+        grad_k = torch.matmul(
+            grad_output.transpose(2, 1),
+            sub_q
+        )
 
         dist.all_reduce(grad_k, group=gpc.get_group(ParallelMode.SEQUENCE))
-        grad_k = grad_k[:, local_rank * ctx.sub_seq_length:(local_rank + 1) * ctx.sub_seq_length]
+        grad_k = grad_k[:, local_rank * ctx.sub_seq_length: (local_rank + 1) * ctx.sub_seq_length]
         grad_k /= local_world_size
 
         # calculate gradient for sub_q
-        grad_q = torch.zeros_like(
-            sub_q,
-            dtype=sub_q.dtype,
-            device=get_current_device(),
-        )
+        grad_q = torch.zeros_like(sub_q,
+                                  dtype=sub_q.dtype,
+                                  device=get_current_device(), )
 
         # compute with local sub_k
         start_idx, end_idx = _calc_current_device_range(local_rank, ctx.sub_seq_length)
@@ -77,7 +85,7 @@
         for i in range(local_world_size - 1):
             sub_k = ring_forward(sub_k, ParallelMode.SEQUENCE)
             start_idx, end_idx = _calc_incoming_device_range(i, local_rank, local_world_size, ctx.sub_seq_length)
-            grad_q += torch.matmul(grad_output[:, :, start_idx:end_idx], sub_k)
+            grad_q += torch.matmul(grad_output[:, :, start_idx: end_idx], sub_k)
 
         grad_q /= local_world_size
 
@@ -91,16 +99,23 @@
 
     @staticmethod
     @custom_fwd
-    def forward(ctx, attention_score, sub_v, batch_size, num_attention_heads, attention_head_size, sub_seq_length):
+    def forward(ctx,
+                attention_score,
+                sub_v,
+                batch_size,
+                num_attention_heads,
+                attention_head_size,
+                sub_seq_length):
         local_rank = gpc.get_local_rank(ParallelMode.SEQUENCE)
         local_world_size = gpc.get_world_size(ParallelMode.SEQUENCE)
         local_start_idx, local_end_idx = _calc_current_device_range(local_rank, sub_seq_length)
 
-        sub_attention_result = torch.zeros(batch_size * num_attention_heads,
-                                           sub_seq_length,
-                                           attention_head_size,
-                                           device=get_current_device(),
-                                           dtype=attention_score.dtype)
+        sub_attention_result = torch.zeros(
+            batch_size * num_attention_heads,
+            sub_seq_length,
+            attention_head_size,
+            device=get_current_device(),
+            dtype=attention_score.dtype)
 
         # save tensors for backward
         ctx.save_for_backward(attention_score, sub_v)
@@ -129,16 +144,23 @@
         attention_scores, sub_v = ctx.saved_tensors
 
         # calculate gradient of v
-        grad_v = torch.matmul(attention_scores.transpose(2, 1), grad_output)
+        grad_v = torch.matmul(
+            attention_scores.transpose(2, 1),
+            grad_output
+        )
         dist.all_reduce(grad_v, group=gpc.get_group(ParallelMode.SEQUENCE))
         grad_v = grad_v[:, local_start_idx:local_end_idx]
         grad_v /= local_world_size
 
         # calculate gradient for attention score
-        grad_attention_score = torch.zeros_like(attention_scores, dtype=grad_output.dtype, device=get_current_device())
+        grad_attention_score = torch.zeros_like(attention_scores,
+                                                dtype=grad_output.dtype,
+                                                device=get_current_device())
 
         # compute with local sub_k
-        grad_attention_score[:, :, local_start_idx:local_end_idx] += torch.matmul(grad_output, sub_v.transpose(2, 1))
+        grad_attention_score[:, :, local_start_idx:local_end_idx] += torch.matmul(
+            grad_output,
+            sub_v.transpose(2, 1))
 
         # compute QK^T in ring-all-reduce style
         for i in range(local_world_size - 1):
@@ -146,6 +168,8 @@
             start_idx, end_idx = _calc_incoming_device_range(i, local_rank, local_world_size, ctx.sub_seq_length)
 
             # compute grad_q
-            grad_attention_score[:, :, start_idx:end_idx] += torch.matmul(grad_output, sub_v.transpose(2, 1))
+            grad_attention_score[:, :, start_idx:end_idx] += torch.matmul(
+                grad_output,
+                sub_v.transpose(2, 1))
 
         return grad_attention_score, grad_v, None, None, None, None
diff -ru extensions/colossalai-0.1.8/colossalai/nn/layer/vanilla/__init__.py ColossalAI-0.1.8/colossalai/nn/layer/vanilla/__init__.py
--- extensions/colossalai-0.1.8/colossalai/nn/layer/vanilla/__init__.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/layer/vanilla/__init__.py	2022-07-12 18:08:59.000000000 +0200
@@ -1,6 +1,7 @@
-from .layers import (DropPath, VanillaClassifier, VanillaLayerNorm, VanillaPatchEmbedding, WrappedDropout,
-                     WrappedDropPath)
+from .layers import (DropPath, VanillaClassifier, VanillaLayerNorm,
+                     VanillaPatchEmbedding, WrappedDropout, WrappedDropPath)
 
 __all__ = [
-    "VanillaLayerNorm", "VanillaPatchEmbedding", "VanillaClassifier", "DropPath", "WrappedDropout", "WrappedDropPath"
+    "VanillaLayerNorm", "VanillaPatchEmbedding", "VanillaClassifier",
+    "DropPath", "WrappedDropout", "WrappedDropPath"
 ]
diff -ru extensions/colossalai-0.1.8/colossalai/nn/layer/vanilla/layers.py ColossalAI-0.1.8/colossalai/nn/layer/vanilla/layers.py
--- extensions/colossalai-0.1.8/colossalai/nn/layer/vanilla/layers.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/layer/vanilla/layers.py	2022-07-12 18:08:59.000000000 +0200
@@ -29,9 +29,9 @@
     if drop_prob == 0. or not training:
         return x
     keep_prob = 1 - drop_prob
-    shape = (x.shape[0],) + (1,) * (x.ndim - 1)    # work with diff dim tensors, not just 2D ConvNets
+    shape = (x.shape[0], ) + (1, ) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
     random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
-    random_tensor.floor_()    # binarize
+    random_tensor.floor_()  # binarize
     output = x.div(keep_prob) * random_tensor
     return output
 
@@ -190,7 +190,7 @@
             f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
         output = F.conv2d(input_, self.weight, self.bias, stride=self.patch_size)
         if self.flatten:
-            output = output.flatten(2).transpose(1, 2)    # BCHW -> BNC
+            output = output.flatten(2).transpose(1, 2)  # BCHW -> BNC
 
         cls_token = self.cls_token.expand(output.shape[0], -1, -1)
         output = torch.cat((cls_token, output), dim=1)
diff -ru extensions/colossalai-0.1.8/colossalai/nn/layer/wrapper/pipeline_wrapper.py ColossalAI-0.1.8/colossalai/nn/layer/wrapper/pipeline_wrapper.py
--- extensions/colossalai-0.1.8/colossalai/nn/layer/wrapper/pipeline_wrapper.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/layer/wrapper/pipeline_wrapper.py	2022-07-12 18:08:59.000000000 +0200
@@ -6,7 +6,6 @@
 
 
 class PipelineSharedModuleWrapper:
-
     def __init__(self, pipeline_ranks: Union[List[int], Tuple[int]]) -> None:
         assert len(pipeline_ranks) > 1, f'Expect len(pipeline_ranks) > 1, got {len(pipeline_ranks)}'
         self.pipeline_ranks = pipeline_ranks
@@ -23,7 +22,10 @@
         num_pp_stages = num_dp_groups // pp_size
         for i in range(dp_size):
             for j in range(num_pp_stages):
-                pipeline_ranks = list(range(i * num_dp_groups + j, (i + 1) * num_dp_groups, num_pp_stages))
+                pipeline_ranks = list(
+                    range(i * num_dp_groups + j,
+                          (i + 1) * num_dp_groups,
+                          num_pp_stages))
                 sub_ranks = [pipeline_ranks[idx] for idx in self.pipeline_ranks]
                 group = dist.new_group(sub_ranks)
                 if rank in sub_ranks:
diff -ru extensions/colossalai-0.1.8/colossalai/nn/lr_scheduler/__init__.py ColossalAI-0.1.8/colossalai/nn/lr_scheduler/__init__.py
--- extensions/colossalai-0.1.8/colossalai/nn/lr_scheduler/__init__.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/lr_scheduler/__init__.py	2022-07-12 18:08:59.000000000 +0200
@@ -8,5 +8,6 @@
 __all__ = [
     'CosineAnnealingLR', 'CosineAnnealingWarmupLR', 'FlatAnnealingLR', 'FlatAnnealingWarmupLR', 'LinearWarmupLR',
     'MultiStepLR', 'MultiStepWarmupLR', 'OneCycleLR', 'PolynomialLR', 'PolynomialWarmupLR', 'LambdaLR',
-    'MultiplicativeLR', 'StepLR', 'ExponentialLR'
+    'MultiplicativeLR', 'StepLR',
+    'ExponentialLR'
 ]
diff -ru extensions/colossalai-0.1.8/colossalai/nn/lr_scheduler/onecycle.py ColossalAI-0.1.8/colossalai/nn/lr_scheduler/onecycle.py
--- extensions/colossalai-0.1.8/colossalai/nn/lr_scheduler/onecycle.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/lr_scheduler/onecycle.py	2022-07-12 18:08:59.000000000 +0200
@@ -68,9 +68,7 @@
         https://arxiv.org/abs/1708.07120
     """
 
-    def __init__(self,
-                 optimizer,
-                 total_steps: int,
+    def __init__(self, optimizer, total_steps: int,
                  pct_start=0.3,
                  anneal_strategy='cos',
                  cycle_momentum=True,
@@ -78,12 +76,9 @@
                  max_momentum=0.95,
                  div_factor=25.0,
                  final_div_factor=10000.0,
-                 last_epoch=-1,
-                 **kwargs):
+                 last_epoch=-1, **kwargs):
         max_lrs = list(map(lambda group: group['lr'], optimizer.param_groups))
-        super().__init__(optimizer,
-                         max_lrs,
-                         total_steps=total_steps,
+        super().__init__(optimizer, max_lrs, total_steps=total_steps,
                          pct_start=pct_start,
                          anneal_strategy=anneal_strategy,
                          cycle_momentum=cycle_momentum,
diff -ru extensions/colossalai-0.1.8/colossalai/nn/_ops/addmm.py ColossalAI-0.1.8/colossalai/nn/_ops/addmm.py
--- extensions/colossalai-0.1.8/colossalai/nn/_ops/addmm.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/_ops/addmm.py	2022-07-12 18:08:59.000000000 +0200
@@ -11,16 +11,16 @@
     # mat1:S[1] x mat2:S[0] = Output:P
     # beta * input + alpha * All-Reduce(Output) = res
 
-    mat1 = mat1.redistribute(ShardSpec([-1], [mat2.get_tp_world_size()]), mat2.get_process_group())
+    mat1 = mat1.redistribute(ShardSpec([-1], [mat2.get_tp_world_size()]))
 
     # Output:P
     partial_output = torch.mm(mat1, mat2)
     # Reduce(Output)
-    output = reduce_input(partial_output, mat2.get_process_group())
+    output = reduce_input(partial_output, mat1.get_process_group())
     # input
     assert not input_tensor.has_compute_spec(), 'Invalid input spec for 1Drow addmm op'
     output = beta * input_tensor + alpha * output
-    output = ColoTensor.from_torch_tensor(output, spec=ColoTensorSpec(input_tensor.get_process_group()))
+    output = ColoTensor.from_torch_tensor(output, spec=ColoTensorSpec(ReplicaSpec()))
     return output
 
 
diff -ru extensions/colossalai-0.1.8/colossalai/nn/_ops/linear.py ColossalAI-0.1.8/colossalai/nn/_ops/linear.py
--- extensions/colossalai-0.1.8/colossalai/nn/_ops/linear.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/_ops/linear.py	2022-07-12 18:08:59.000000000 +0200
@@ -3,15 +3,15 @@
 from ._utils import GeneralTensor, convert_to_colo_tensor
 from colossalai.tensor.op_wrapper import colo_op_impl
 from ._utils import reduce_input, reduce_grad
-from colossalai.tensor import ComputePattern, ComputeSpec, ColoTensor, ShardSpec, ReplicaSpec, ColoTensorSpec
+from colossalai.tensor import ComputePattern, ComputePattern, ComputeSpec, ColoTensor, ShardSpec, ReplicaSpec, ColoTensorSpec
 
 
-def colo_linear_1drow(input_tensor: ColoTensor, weight: ColoTensor, bias: Optional[ColoTensor]) -> 'ColoTensor':
+def colo_linear_1Drow(input_tensor: ColoTensor, weight: ColoTensor, bias: Optional[ColoTensor]) -> 'ColoTensor':
     # Input:S[1] x Weight:S[0] = Output:P
     # All-Reduce(Output) + bias = res
     # Input:S[1]
     pg = weight.get_process_group()
-    input_tensor = input_tensor.redistribute(ShardSpec([-1], [weight.get_tp_world_size()]), pg)
+    input_tensor = input_tensor.redistribute(ShardSpec([-1], [weight.get_tp_world_size()]))
 
     # Output:P
     partial_output = F.linear(input_tensor, weight)
@@ -27,7 +27,7 @@
     return output
 
 
-def colo_linear_1dcol(input_tensor: ColoTensor, weight: ColoTensor, bias: Optional[ColoTensor]) -> 'ColoTensor':
+def colo_linear_1Dcol(input_tensor: ColoTensor, weight: ColoTensor, bias: Optional[ColoTensor]) -> 'ColoTensor':
     # Input:B x Weight:S[1] + Bias:S[1] = Output:S[1]
     # All-Gather(Output)
     # Input:B
@@ -48,7 +48,7 @@
 
 def colo_linear_1d(mode: str, input_tensor: ColoTensor, weight: ColoTensor, bias: Optional[ColoTensor]) -> 'ColoTensor':
     assert mode in ('row', 'col')
-    funcs = {'row': colo_linear_1drow, 'col': colo_linear_1dcol}
+    funcs = {'row': colo_linear_1Drow, 'col': colo_linear_1Dcol}
     return funcs[mode](input_tensor, weight, bias)
 
 
Only in ColossalAI-0.1.8/colossalai/nn/optimizer: colo_optimizer.py
diff -ru extensions/colossalai-0.1.8/colossalai/nn/optimizer/colossalai_optimizer.py ColossalAI-0.1.8/colossalai/nn/optimizer/colossalai_optimizer.py
--- extensions/colossalai-0.1.8/colossalai/nn/optimizer/colossalai_optimizer.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/optimizer/colossalai_optimizer.py	2022-07-12 18:08:59.000000000 +0200
@@ -1,3 +1,6 @@
+#!/usr/bin/env python
+# -*- encoding: utf-8 -*-
+
 import torch
 import torch.nn as nn
 from torch import Tensor
diff -ru extensions/colossalai-0.1.8/colossalai/nn/optimizer/__init__.py ColossalAI-0.1.8/colossalai/nn/optimizer/__init__.py
--- extensions/colossalai-0.1.8/colossalai/nn/optimizer/__init__.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/optimizer/__init__.py	2022-07-12 18:08:59.000000000 +0200
@@ -7,7 +7,9 @@
 from .lars import Lars
 from .cpu_adam import CPUAdam
 from .hybrid_adam import HybridAdam
+from .colo_optimizer import ColoOptimizer
 
 __all__ = [
-    'ColossalaiOptimizer', 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'Lamb', 'Lars', 'CPUAdam', 'HybridAdam', 'CPU_ADAM_CNT'
+    'ColossalaiOptimizer', 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'Lamb', 'Lars', 'CPUAdam', 'HybridAdam',
+    'CPU_ADAM_CNT', 'ColoOptimizer'
 ]
diff -ru extensions/colossalai-0.1.8/colossalai/nn/optimizer/lamb.py ColossalAI-0.1.8/colossalai/nn/optimizer/lamb.py
--- extensions/colossalai-0.1.8/colossalai/nn/optimizer/lamb.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/optimizer/lamb.py	2022-07-12 18:08:59.000000000 +0200
@@ -29,16 +29,20 @@
         https://arxiv.org/abs/1904.00962
     """
 
-    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, weight_decay=0, adam=False):
+    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6,
+                 weight_decay=0, adam=False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= eps:
             raise ValueError("Invalid epsilon value: {}".format(eps))
         if not 0.0 <= betas[0] < 1.0:
-            raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
+            raise ValueError(
+                "Invalid beta parameter at index 0: {}".format(betas[0]))
         if not 0.0 <= betas[1] < 1.0:
-            raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
-        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay)
+            raise ValueError(
+                "Invalid beta parameter at index 1: {}".format(betas[1]))
+        defaults = dict(lr=lr, betas=betas, eps=eps,
+                        weight_decay=weight_decay)
         self.adam = adam
         super(Lamb, self).__init__(params, defaults)
 
@@ -59,7 +63,8 @@
                     continue
                 grad = p.grad.data
                 if grad.is_sparse:
-                    raise RuntimeError('Lamb does not support sparse gradients, consider SparseAdam instad.')
+                    raise RuntimeError(
+                        'Lamb does not support sparse gradients, consider SparseAdam instad.')
 
                 state = self.state[p]
 
diff -ru extensions/colossalai-0.1.8/colossalai/nn/parallel/layers/module_utils.py ColossalAI-0.1.8/colossalai/nn/parallel/layers/module_utils.py
--- extensions/colossalai-0.1.8/colossalai/nn/parallel/layers/module_utils.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/nn/parallel/layers/module_utils.py	2022-07-12 18:08:59.000000000 +0200
@@ -88,7 +88,7 @@
     compute_pattern = compute_spec.compute_pattern
     if is_colo_module(module):
         # for each param
-        # set its process_group, dist_spec and compute_spec
+        # set DistSpec and ComputeSpec
         colo_module = get_colo_module(module)
         colo_module.register(compute_pattern, pg)
         if not colo_module.has_compute_pattern_with_mode(compute_pattern, mode=mode):
@@ -101,7 +101,6 @@
                 continue
             param = module.get_parameter(param_name)
             if isinstance(param, ColoParameter):
-                param.set_process_group(pg)
                 param.set_dist_spec(dist_spec)
                 param.compute_spec = compute_spec
                 for mod in param.shared_param_modules:
diff -ru extensions/colossalai-0.1.8/colossalai/tensor/colo_parameter.py ColossalAI-0.1.8/colossalai/tensor/colo_parameter.py
--- extensions/colossalai-0.1.8/colossalai/tensor/colo_parameter.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/tensor/colo_parameter.py	2022-07-12 18:08:59.000000000 +0200
@@ -1,6 +1,7 @@
 import torch
 
 from typing import Optional
+from copy import copy
 
 from colossalai.tensor.colo_tensor import ColoTensor
 from colossalai.tensor.const import TensorType
diff -ru extensions/colossalai-0.1.8/colossalai/tensor/colo_tensor.py ColossalAI-0.1.8/colossalai/tensor/colo_tensor.py
--- extensions/colossalai-0.1.8/colossalai/tensor/colo_tensor.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/tensor/colo_tensor.py	2022-07-12 18:08:59.000000000 +0200
@@ -18,7 +18,7 @@
         Tensor._base.__get__,
         Tensor.grad.__get__,
         Tensor._grad.__get__,
-        Tensor.data.__get__,  # make .data returns torch.Tensor rather than ColoTensor
+        Tensor.data.__get__,    # make .data returns torch.Tensor rather than ColoTensor
     }
 
 
@@ -121,13 +121,11 @@
             RuntimeError: 
         """
         assert isinstance(pg, ProcessGroup), f"pg as type {type(pg)} is invalid"
-        # if the new pg is the same as the old pg, just returns
-        if self.process_group == pg:
-            return
-        assert self.process_group.tp_world_size() == 1, \
-            "Can not set_process_group on a ColoTensor whose process_group has tp world group"
-        assert self.dist_spec.placement.value == 'r', \
-            "Can not set_process_group on a ColoTensor whose dist spec is not REPLICATE"
+        if self.process_group.tp_world_size() != 1:
+            raise RuntimeError("can not set_process_group on a ColoTensor whose process_group has tp world group")
+
+        if self.dist_spec.placement.value != 'r':
+            raise RuntimeError("can not set_process_group on a ColoTensor whose dist spec is not REPLICATE")
 
         self.process_group = pg
 
@@ -206,14 +204,12 @@
             ColoTensor: a redistributed colotensor
         """
         if pg is not None and pg != self.get_process_group():
+            print('here _redistribute')
             # if the pg is not equal, convert the current tensor to replicated
-            handled = self.redistribute(ReplicaSpec())
-        else:
-            handled = self
-            pg = self.process_group
-
-        ret = DistSpecManager.handle_trans_spec(handled, handled.dist_spec, dist_spec, pg)
-        return ColoTensor.from_torch_tensor(ret, ColoTensorSpec(pg=pg, dist_attr=dist_spec))
+            self._redistribute(ReplicaSpec())
+            self.process_group = pg
+        ret = DistSpecManager.handle_trans_spec(self, self.dist_spec, dist_spec, self.process_group)
+        return ColoTensor.from_torch_tensor(ret, ColoTensorSpec(self.process_group, dist_attr=dist_spec))
 
     def to_replicate_(self):
         """to_replicate_ 
@@ -292,17 +288,17 @@
 
     def is_replicate(self):
         return self.dist_spec.placement == DistPlacementPattern.REPLICATE \
-               or (len(self.dist_spec.num_partitions) == 1
-                   and self.dist_spec.num_partitions[0] == 1) \
-               or (self.process_group.tp_world_size() == 1)
+            or (len(self.dist_spec.num_partitions) == 1
+                and self.dist_spec.num_partitions[0] == 1) \
+            or (self.process_group.tp_world_size() == 1)
 
     def is_shard_1dcol(self):
         return self.dist_spec.placement == DistPlacementPattern.SHARD \
-               and len(self.dist_spec.dims) == 1 and self.dist_spec.dims[0] == -1
+            and len(self.dist_spec.dims) == 1 and self.dist_spec.dims[0] == -1
 
     def is_shard_1drow(self):
         return self.dist_spec.placement == DistPlacementPattern.SHARD \
-               and len(self.dist_spec.dims) == 1 and self.dist_spec.dims[0] == 0
+            and len(self.dist_spec.dims) == 1 and self.dist_spec.dims[0] == 0
 
     def is_sharded(self):
         return self.dist_spec.placement == DistPlacementPattern.SHARD
diff -ru extensions/colossalai-0.1.8/colossalai/tensor/dist_spec_mgr.py ColossalAI-0.1.8/colossalai/tensor/dist_spec_mgr.py
--- extensions/colossalai-0.1.8/colossalai/tensor/dist_spec_mgr.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/tensor/dist_spec_mgr.py	2022-07-12 18:08:59.000000000 +0200
@@ -88,13 +88,11 @@
             torch.Tensor: a replicated tensor.
         """
         assert old_dist_spec.placement.value == 's', f"The old_dist_spec of DistSpecManager._gather must be SHARD!"
-        is_cpu_tensor = False
-        if tensor.device.type == 'cpu':
+        if version.parse(torch.__version__) < version.parse("1.11.0"):
             # pytorch lower than 1.11 dose not support gather a cpu tensor.
             # Therefore, we transfer tensor to GPU before gather.
             saved_dev = tensor.device
             tensor.data = tensor.data.cuda()
-            is_cpu_tensor = True
 
         buffer = [torch.empty_like(tensor) for _ in range(pg.tp_world_size())]
         assert tensor.device.type == 'cuda'
@@ -108,7 +106,7 @@
             buffer = new_buffer
         assert len(buffer) == 1
 
-        if is_cpu_tensor:
+        if version.parse(torch.__version__) < version.parse("1.11.0"):
             buffer[0].data = buffer[0].data.to(saved_dev)
         return buffer[0]
 
diff -ru extensions/colossalai-0.1.8/colossalai/tensor/process_group.py ColossalAI-0.1.8/colossalai/tensor/process_group.py
--- extensions/colossalai-0.1.8/colossalai/tensor/process_group.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/tensor/process_group.py	2022-07-12 18:08:59.000000000 +0200
@@ -48,7 +48,6 @@
                  tp_degree: Optional[int] = None,
                  dp_degree: Optional[int] = None) -> None:
         if not torch.distributed.is_initialized():
-            self.is_init = False
             return
 
         assert torch.distributed.is_initialized(), f"ProcessGroup must be used after distributed initialized"
@@ -97,7 +96,6 @@
         self._has_cpu_groups = False
         PYTORCHPGDICT_.get(self._tp_rank_list, 'nccl')
         PYTORCHPGDICT_.get(self._dp_rank_list, 'nccl')
-        self.is_init = True
 
     def set_cpu_groups(self):
         if self.has_cpu_groups:
@@ -112,11 +110,8 @@
         return self._has_cpu_groups
 
     def __repr__(self):
-        if self.is_init:
-            return "ProcessGroup:\n\tRank: {}, World size: {}, DP degree: {}, TP degree: {}\n\tRanks in group: {}".\
-                format(self._rank, self._world_size, self._dp_degree, self._tp_degree, self._rank_list)
-        else:
-            return "ProcessGroup not initialized"
+        return "ProcessGroup:\n\tRank: {}, World size: {}, DP degree: {}, TP degree: {}\n\tRanks in group: {}".\
+            format(self._rank, self._world_size, self._dp_degree, self._tp_degree, self._rank_list)
 
     def __eq__(self, obj: 'ProcessGroup') -> bool:
         if not isinstance(obj, ProcessGroup):
diff -ru extensions/colossalai-0.1.8/colossalai/utils/checkpoint/module_checkpoint.py ColossalAI-0.1.8/colossalai/utils/checkpoint/module_checkpoint.py
--- extensions/colossalai-0.1.8/colossalai/utils/checkpoint/module_checkpoint.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/utils/checkpoint/module_checkpoint.py	2022-07-12 18:08:59.000000000 +0200
@@ -1,15 +1,12 @@
 import torch
 import torch.distributed as dist
 from colossalai.tensor import ColoTensor, DistSpecManager
-from colossalai.nn.optimizer import ColossalaiOptimizer
-from copy import copy
-from typing import Optional
 
 
 def save_checkpoint(dire: str,
                     epoch: int,
                     model: torch.nn.Module,
-                    optimizer: Optional[ColossalaiOptimizer] = None,
+                    optimizer: torch.optim.Optimizer = None,
                     lr_scheduler: torch.optim.lr_scheduler._LRScheduler = None,
                     *args,
                     **kwargs):
@@ -19,7 +16,7 @@
         dire (str): directory to save the checkpoint files.
         epoch (int): the number of epoch
         model (torch.nn.Module): a torch module initialized by ColoInitContext
-        optimizer (ColossalaiOptimizer, optional): optimizers. Defaults to None.
+        optimizer (torch.optim.Optimizer, optional): optimizers. Defaults to None.
         lr_scheduler (torch.optim.lr_scheduler._LRScheduler, optional): lr schedule. Defaults to None.
     """
 
@@ -31,8 +28,7 @@
         if isinstance(v, ColoTensor):
             mapping[k] = (v.dist_spec, v.compute_spec)
             new_dict[k] = v.to_replicate().detach()
-        else:
-            new_dict[k] = v
+
     if dist.get_rank() == 0:
         for k, v in new_dict.items():
             if isinstance(v, ColoTensor):
@@ -44,21 +40,11 @@
     # delete the new dict
     del new_dict
 
-    optim_state_copy = copy(optimizer.state_dict())
-    for k, v in optim_state_copy['state'].items():
-        for n, t in v.items():
-            if isinstance(t, ColoTensor):
-                t.to_replicate_()
-    if dist.get_rank() == 0:
-        model_state = {'epoch': epoch, 'optim': optim_state_copy}
-        torch.save(model_state, dire + '/epoch_{}_optim.pth'.format(epoch))
-    del optim_state_copy
-
 
 def load_checkpoint(dire,
                     epoch: int,
                     model: torch.nn.Module,
-                    optimizer: Optional[ColossalaiOptimizer] = None,
+                    optimizer: torch.optim.Optimizer = None,
                     lr_scheduler: torch.optim.lr_scheduler._LRScheduler = None,
                     *args,
                     **kwargs):
@@ -69,12 +55,12 @@
         epoch (int): _description_
         rank (int): _description_
         model (torch.nn.Module): _description_
-        optimizer (ColossalaiOptimizer, optional): _description_. Defaults to None.
+        optimizer (torch.optim.Optimizer, optional): _description_. Defaults to None.
         lr_scheduler (torch.optim.lr_scheduler._LRScheduler, optional): _description_. Defaults to None.
     """
 
     mapping = dict()
-    for k, v in model.state_dict().items():
+    for k, v in model.named_parameters():
         if isinstance(v, ColoTensor):
             mapping[k] = (v.dist_spec, v.compute_spec)
             v.to_replicate_()
@@ -84,27 +70,6 @@
 
     # reset tensors to original dist spec.
     with DistSpecManager.no_grad():
-        for k, v in model.state_dict().items():
+        for k, v in model.named_parameters():
             if isinstance(v, ColoTensor):
                 v.set_tensor_spec(*mapping[k])
-
-    del mapping
-    mapping = dict()
-
-    for k, v in optimizer.state_dict()['state'].items():
-        for n, t in v.items():
-            if isinstance(t, ColoTensor):
-                mapping[(k, n)] = (t.dist_spec, t.compute_spec)
-                t.to_replicate_()
-
-    colo_checkpoint = torch.load(dire + '/epoch_{}_optim.pth'.format(epoch))
-    optimizer.load_state_dict(colo_checkpoint['optim'])
-
-    for k, v in optimizer.state_dict()['state'].items():
-        for n, t in v.items():
-            if isinstance(t, ColoTensor):
-                # skip key not in mapping.
-                # For Adam, if it dose not execute step() once, there will be not exp_avg and exp_avg_sq in optimizer
-                if (k, n) not in mapping:
-                    continue
-                t.set_tensor_spec(*mapping[(k, n)])
diff -ru extensions/colossalai-0.1.8/colossalai/utils/profiler/legacy/mem_profiler.py ColossalAI-0.1.8/colossalai/utils/profiler/legacy/mem_profiler.py
--- extensions/colossalai-0.1.8/colossalai/utils/profiler/legacy/mem_profiler.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/utils/profiler/legacy/mem_profiler.py	2022-07-12 18:08:59.000000000 +0200
@@ -2,7 +2,7 @@
 from typing import Union
 from colossalai.engine import Engine
 from torch.utils.tensorboard import SummaryWriter
-from colossalai.gemini.ophooks import MemTracerOpHook
+from colossalai.engine.ophooks import MemTracerOpHook
 from colossalai.utils.profiler.legacy.prof_utils import BaseProfiler
 
 
diff -ru extensions/colossalai-0.1.8/colossalai/utils/profiler/stateful_tensor_mem_extention.py ColossalAI-0.1.8/colossalai/utils/profiler/stateful_tensor_mem_extention.py
--- extensions/colossalai-0.1.8/colossalai/utils/profiler/stateful_tensor_mem_extention.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/utils/profiler/stateful_tensor_mem_extention.py	2022-07-12 18:08:59.000000000 +0200
@@ -5,7 +5,7 @@
 from enum import Enum
 from typing import List
 from colossalai.gemini.stateful_tensor import StatefulTensor
-from colossalai.gemini.ophooks import BaseOpHook
+from colossalai.engine.ophooks import BaseOpHook
 from colossalai.engine import Engine
 from colossalai.utils.profiler.extention import ProfilerExtension
 
Only in ColossalAI-0.1.8/colossalai/utils/tensor_detector: readme.md
diff -ru extensions/colossalai-0.1.8/colossalai/zero/sharded_model/sharded_model_v2.py ColossalAI-0.1.8/colossalai/zero/sharded_model/sharded_model_v2.py
--- extensions/colossalai-0.1.8/colossalai/zero/sharded_model/sharded_model_v2.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/zero/sharded_model/sharded_model_v2.py	2022-07-12 18:08:59.000000000 +0200
@@ -8,9 +8,9 @@
 import torch.nn as nn
 from colossalai.context.parallel_mode import ParallelMode
 from colossalai.core import global_context as gpc
-from colossalai.gemini.ophooks import register_ophooks_recursively
+from colossalai.engine.ophooks import register_ophooks_recursively
 from colossalai.zero.utils import ZeroHook
-from colossalai.gemini.paramhooks import BaseParamHookMgr
+from colossalai.engine.paramhooks import BaseParamHookMgr
 from colossalai.logging import get_dist_logger
 from colossalai.utils import get_current_device, disposable
 from colossalai.gemini.memory_tracer.memstats_collector import MemStatsCollector
diff -ru extensions/colossalai-0.1.8/colossalai/zero/utils/zero_hook.py ColossalAI-0.1.8/colossalai/zero/utils/zero_hook.py
--- extensions/colossalai-0.1.8/colossalai/zero/utils/zero_hook.py	2022-07-15 11:25:57.000000000 +0200
+++ ColossalAI-0.1.8/colossalai/zero/utils/zero_hook.py	2022-07-12 18:08:59.000000000 +0200
@@ -8,7 +8,7 @@
 from colossalai.utils import get_current_device
 
 from colossalai.zero.shard_utils import BaseShardStrategy
-from colossalai.gemini.ophooks import BaseOpHook
+from colossalai.engine.ophooks import BaseOpHook
 
 from colossalai.gemini.stateful_tensor_mgr import StatefulTensorMgr
 from colossalai.gemini.memory_tracer import MemStatsCollector
Only in extensions/colossalai-0.1.8: colossalai.egg-info
Only in ColossalAI-0.1.8: CONTRIBUTING.md
Only in ColossalAI-0.1.8: docker
Only in ColossalAI-0.1.8: docs
Only in ColossalAI-0.1.8: examples
Only in ColossalAI-0.1.8: .flake8
Only in ColossalAI-0.1.8: .github
Only in ColossalAI-0.1.8: .gitignore
Only in ColossalAI-0.1.8: .gitmodules
Only in ColossalAI-0.1.8: inference
Only in ColossalAI-0.1.8: LICENSE
Only in extensions/colossalai-0.1.8: PKG-INFO
Only in ColossalAI-0.1.8: .pre-commit-config.yaml
Only in ColossalAI-0.1.8: pytest.ini
Only in ColossalAI-0.1.8: README-zh-Hans.md
Only in ColossalAI-0.1.8: .readthedocs.yaml
Only in extensions/colossalai-0.1.8: setup.cfg
Only in ColossalAI-0.1.8: .style.yapf
Only in ColossalAI-0.1.8/tests: __init__.py
Only in ColossalAI-0.1.8/tests: test_amp
Only in ColossalAI-0.1.8/tests: test_comm
Only in ColossalAI-0.1.8/tests: test_config
Only in ColossalAI-0.1.8/tests: test_context
Only in ColossalAI-0.1.8/tests: test_data
Only in ColossalAI-0.1.8/tests: test_data_pipeline_tensor_parallel
Only in ColossalAI-0.1.8/tests: test_ddp
Only in ColossalAI-0.1.8/tests: test_engine
Only in ColossalAI-0.1.8/tests: test_fx
Only in ColossalAI-0.1.8/tests: test_gemini
Only in ColossalAI-0.1.8/tests: test_layers
Only in ColossalAI-0.1.8/tests: test_moe
Only in ColossalAI-0.1.8/tests: test_optimizer
Only in ColossalAI-0.1.8/tests: test_pipeline
Only in ColossalAI-0.1.8/tests: test_tensor
Only in ColossalAI-0.1.8/tests: test_trainer
Only in ColossalAI-0.1.8/tests: test_utils
Only in ColossalAI-0.1.8/tests: test_zero

boegel · 2022-11-24T09:02:49Z

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3900.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 2 x NVIDIA [Unknown Error], 520.61.05, 2 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/05e52fc2dd1299a161597f0d54a2e791 for a full test report.

boegel

lgtm

boegel · 2022-11-24T10:32:58Z

Going in, thanks @ThomasHoffmann77!

fix source_urls

057dca5

boegel changed the title ~~fix source_urls~~ fix source_urls for colossalai Nov 23, 2022

boegel added the bug fix label Nov 23, 2022

boegel added this to the next release (4.7.0) milestone Nov 23, 2022

boegel changed the title ~~fix source_urls for colossalai~~ fix source_urls for colossalai 0.1.8 (no longer available via PyPI, only via GitHub repo) Nov 23, 2022

boegel approved these changes Nov 24, 2022

View reviewed changes

boegel merged commit 07802ca into easybuilders:develop Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix source_urls for colossalai 0.1.8 (no longer available via PyPI, only via GitHub repo) #16693

fix source_urls for colossalai 0.1.8 (no longer available via PyPI, only via GitHub repo) #16693

ThomasHoffmann77 commented Nov 21, 2022

boegel commented Nov 23, 2022

boegel commented Nov 24, 2022

boegel left a comment

boegel commented Nov 24, 2022

fix source_urls for colossalai 0.1.8 (no longer available via PyPI, only via GitHub repo) #16693

fix source_urls for colossalai 0.1.8 (no longer available via PyPI, only via GitHub repo) #16693

Conversation

ThomasHoffmann77 commented Nov 21, 2022

boegel commented Nov 23, 2022

boegel commented Nov 24, 2022

boegel left a comment

Choose a reason for hiding this comment

boegel commented Nov 24, 2022