Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First release readme #227

Merged
merged 21 commits into from
Apr 16, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 33 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,41 @@
# torchtitan
<p align="center">
<picture>
<source media="(prefers-color-scheme: light)" srcset="https://github.com/lessw2020/TorchTitan/blob/1ab9828ae6aa0e6508d9a7002d743d96d85e8599/assets/images/TorchTitan_logo_main.jpg">
<img alt="TorchTitan_Logo" width=35%>
</picture>
</p>

torchtitan is a native PyTorch reference architecture showcasing some of the latest PyTorch techniques for large scale model training.
* Designed to be easy to understand, use and extend for different training purposes.
* Minimal changes to the model code when applying 1D, 2D, or (soon) 3D Parallelisms.
* Modular components instead of monolithic codebase.
* Get started in minutes, not hours!

Note: This repository is currently under heavy development.
Please note: `torchtitan` is a proof-of-concept for Large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase. torchtitan is complementary to and not a replacement for any of the great large-scale LLM training codebases such as Megatron, Megablocks, LLM Foundry, Deepspeed, etc. Instead, we hope that the features showcased in torchtitan will be adopted by these codebases quickly. torchtitan is unlikely to ever grow a large community around it.

torchtitan is a native PyTorch library with PyTorch native parallelisms and various training techniques to train large models.
## Design Principles
## Pre-Release Updates:
#### (4/16/2024): TorchTitan is now public but in a pre-release state and under development. Currently we showcase pre-training Llama2 models (LLMs) of various sizes from scratch.</br>
Key features available:</br>
1 - [FSDP2 (per param sharding)](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md) </br>
2 - Tensor Parallel (FSDP + Tensor Parallel)</br>
3 - Selective layer and op activation checkpointing </br>
4 - Distributed checkpointing (asynch pending) </br>
5 - 3 datasets pre-configured (47K - 144M)</br>
6 - GPU usage, MFU, tokens per second and other metrics all reported and displayed via TensorBoard.</br>
7 - optional Fused RMSNorm, learning rate scheduler, meta init, and more.</br>
8 - All options easily configured via toml files.</br>

While torchtitan utilizes the PyTorch ecosystem for things like data loading (i.e. HuggingFace datasets), the core functionality is written in PyTorch.

* Designed to be easy to understand, use and extend for different training purposes.
* Minimal changes to the model code, when applying 1D/2D or 3D Parallelisms.
* Modular components instead of monolithic codebase
## Coming soon features:
1 - Asynch checkpointing </br>
2 - FP8 support </br>
3 - Context Parallel </br>
4 - 3D (Pipeline Parallel) </br>
5 - Torch Compile support </br>


# Installation
## Installation

Install PyTorch from source or install the latest pytorch nightly, then install requirements by

Expand All @@ -30,7 +54,7 @@ run the llama debug model locally to verify the setup is correct:
./run_llama_train.sh
```

# TensorBoard
## TensorBoard

To visualize TensorBoard metrics of models trained on a remote server via a local web browser:

Expand Down
Binary file added assets/images/TorchTitan_logo_main.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions assets/images/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
images folder for main repo
4 changes: 2 additions & 2 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -390,8 +390,8 @@ def loss_fn(pred, labels):
)

if torch.distributed.get_rank() == 0:
logger.info("Sleeping 1 second for other ranks to complete")
time.sleep(1)
logger.info("Sleeping for 2 seconds for others ranks to complete ")
time.sleep(2)

metric_logger.close()
logger.info("Training completed")
Expand Down
Loading