pytorch · lessw2020 · Apr 16, 2024 · Apr 15, 2024 · Apr 15, 2024 · Apr 15, 2024
diff --git a/README.md b/README.md
@@ -1,17 +1,41 @@
 # torchtitan
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: light)" srcset="https://github.com/lessw2020/TorchTitan/blob/1ab9828ae6aa0e6508d9a7002d743d96d85e8599/assets/images/TorchTitan_logo_main.jpg">
+    <img alt="TorchTitan_Logo" width=35%>
+  </picture>
+</p>
+
+torchtitan is a native PyTorch reference architecture showcasing some of the latest PyTorch techniques for large scale model training.
+* Designed to be easy to understand, use and extend for different training purposes.
+* Minimal changes to the model code when applying 1D, 2D, or (soon) 3D Parallelisms.
+* Modular components instead of monolithic codebase.
+* Get started in minutes, not hours!
 
-Note: This repository is currently under heavy development.
+Please note: `torchtitan` is a proof-of-concept for Large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase. torchtitan is complementary to and not a replacement for any of the great large-scale LLM training codebases such as Megatron, Megablocks, LLM Foundry, Deepspeed, etc. Instead, we hope that the features showcased in torchtitan will be adopted by these codebases quickly. torchtitan is unlikely to ever grow a large community around it.
 
-torchtitan is a native PyTorch library with PyTorch native parallelisms and various training techniques to train large models.
-## Design Principles
+## Pre-Release Updates:
+#### (4/16/2024): TorchTitan is now public but in a pre-release state and under development.  Currently we showcase pre-training Llama2 models (LLMs) of various sizes from scratch.</br>
+Key features available:</br>
+1 - [FSDP2 (per param sharding)](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md)  </br>
+2 - Tensor Parallel (FSDP + Tensor Parallel)</br>
+3 - Selective layer and op activation checkpointing </br>
+4 - Distributed checkpointing (asynch pending) </br>
+5 - 3 datasets pre-configured (47K - 144M)</br>
+6 - GPU usage, MFU, tokens per second and other metrics all reported and displayed via TensorBoard.</br>
+7 - optional Fused RMSNorm, learning rate scheduler, meta init, and more.</br>
+8 - All options easily configured via toml files.</br>
 
-While torchtitan utilizes the PyTorch ecosystem for things like data loading (i.e. HuggingFace datasets), the core functionality is written in PyTorch.
 
-* Designed to be easy to understand, use and extend for different training purposes.
-* Minimal changes to the model code, when applying 1D/2D or 3D Parallelisms.
-* Modular components instead of monolithic codebase
+## Coming soon features:
+1 - Asynch checkpointing </br>
+2 - FP8 support </br>
+3 - Context Parallel </br>
+4 - 3D (Pipeline Parallel) </br>
+5 - Torch Compile support </br>
+
 
-# Installation
+## Installation
 
 Install PyTorch from source or install the latest pytorch nightly, then install requirements by
 
@@ -30,7 +54,7 @@ run the llama debug model locally to verify the setup is correct:
 ./run_llama_train.sh
 ```
 
-# TensorBoard
+## TensorBoard
 
 To visualize TensorBoard metrics of models trained on a remote server via a local web browser:
 

diff --git a/assets/images/TorchTitan_logo_main.jpg b/assets/images/TorchTitan_logo_main.jpg
diff --git a/assets/images/readme.md b/assets/images/readme.md
@@ -0,0 +1 @@
+images folder for main repo
diff --git a/train.py b/train.py
@@ -390,8 +390,8 @@ def loss_fn(pred, labels):
                 )
 
     if torch.distributed.get_rank() == 0:
-        logger.info("Sleeping 1 second for other ranks to complete")
-        time.sleep(1)
+        logger.info("Sleeping for 2 seconds for others ranks to complete ")
+        time.sleep(2)
 
     metric_logger.close()
     logger.info("Training completed")