Skip to content

Commit

Permalink
feature: run saelens on AWS with one command (#138)
Browse files Browse the repository at this point in the history
* Ansible playbook for automating caching activations and training saes

* Add automation

* Fix example config

* Fix bugs with ansible mounting s3

* Reorg, more automation, Ubuntu instead of Amazon Linux

* More automation

* Train SAE automation

* Train SAEs and readme

* fix gitignore

* Fix automation config bugs, clean up paths

* Fix shutdown time, logs
  • Loading branch information
hijohnnylin authored May 12, 2024
1 parent 4cb270b commit 13de52a
Show file tree
Hide file tree
Showing 21 changed files with 1,222 additions and 1 deletion.
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -181,4 +181,9 @@ neuronpedia_outputs/
tests/benchmark/fixtures/

# ignore prof
prof/
prof/

scripts/ansible/cache_acts.yml
scripts/ansible/jobs/
scripts/ansible/configs/
scripts/ansible/ansible.log
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ flake8 = "7.0.0"
isort = "5.13.2"
pyright = "^1.1.351"
mamba-lens = "^0.0.4"
ansible-lint = { version = "^24.2.3", markers = "platform_system != 'Windows'" }
botocore = "^1.34.101"
boto3 = "^1.34.101"

[tool.poetry.extras]
mamba = ["mamba-lens"]
Expand Down
89 changes: 89 additions & 0 deletions scripts/ansible/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
This is an Ansible playbook that runs `Cache Activations` and and `Train SAE` in AWS.

- The playbook looks in the `configs` directory for what jobs to run, and runs them.
- It makes a copy of previously run jobs in the `jobs` directory.
- Check out the `configs_example` directory and read the comments in the YAML files.

### Prerequisites
- AWS Account
- AWS ability to launch G instance types - you need to submit a request to enable this.
- [Submit request for G. Click "Request increase at account level".](https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF)
- [Increase other quotas (like P instances)](https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas)
- G and P instances are not enabled by default [docs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html)
- What GPUs/specs are G and P instance types? [docs](https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html)
- Wandb API Key

### Local setup

#### Save AWS Credentials locally
1) Generate a set of AWS access keys
1) Sign into AWS
2) [Click here to generate keys](
https://us-east-1.console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials/access-key-wizard)

2) Save the following file into `~/.aws/credentials`, replacing the values with the ones you generated.
- Don't change the region - keep it as `us-east-1`. Since all data transfer is in the same data center, it doesn't matter where you physically reside. If you change this, you will need to update `aws_ec2.yml` and the AMI ID in `cache_acts.yml`, and some services may not be available in other regions.

```
# ~/.aws/credentials
[default]
aws_access_key_id=AWS_ACCESS_KEY_ID
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
region=us-east-1
```

#### Install Ansible

```
pip install ansible
ansible --version
ansible-galaxy collection install -r util/requirements.yml
```

#### Configure a Job to Run
```
cd scripts/ansible
cp -r configs_example configs
```
Modify `configs/shared.yml` and set `s3_bucket_name` to something unique. Bucket names are global so they must be unique (think of them as a username on a platform).

You don't need to modify anything else to run the example job.

Explanation of the config files under `configs_example`:
- `shared.yml` - Shared values for all jobs.
- `cache_acts.yml` - Cache Activations with a `training_steps` of 2000.
- `train_sae` - Contains `sweep_common.yml` which has the name of the sweep, plus all of the common config values between the sweep's jobs. There are two jobs in the `jobs` subdirectory, which defines only the values that are different, which in this case is the `l1_coefficient`.
- It's only 2000 training steps for Cache Activations and 500 training steps for Train SAEs so the jobs themselves in the example are fast - most of the time is spend launching instances, configuring them, etc.

#### Run the Example Job

You should have the [AWS EC2 Console](https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1) open so you can watch instances be launched, and terminate them manually in case it has any problems. By default if you exit Ansible prematurely it will not stop your EC2 instance, so you'll keep getting billed for it.

```
cd scripts/ansible
export WANDB_API_KEY=[WANDB API key here]
ansible-playbook run-configs.yml
```

Briefly, this example job will (time estimates are for the example above):
1) Create your S3 bucket and other prerequisites. (~3 minutes)
2) Run the Cache Activations job (~15 minutes)
1) Launch EC2 instance
2) Run `util/cache_acts.py`, saving output to your S3 bucket
3) Terminate the EC2 instance
3) Run the Train SAE jobs in parallel (~15 minutes)
1) Launch EC2 instance
2) Run `util/train_sae.py`, loading the cached activations from your S3 bucket.
3) You can monitor progress of this by going to WANDB, where it should also have your artifacts.

### TODO
- document how to monitor running jobs
- better integration with wandb ("sweep param")
- should we just use/repurpose wandb stuff instead of manually doing all this?
- use containers, possilby cloudformation to simplify instance configuration
- use 'typer' on `cache_acts.py` and `train_sae.py`
- ansible "best practices", better use of ansible features
- don't use 777 permissions
- AWX server for GUI monitoring jobs
- Automatically pull the latest AMI using Ansible
6 changes: 6 additions & 0 deletions scripts/ansible/ansible.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[defaults]
host_key_checking = False
deprecation_warnings = False
callbacks_enabled = profile_tasks
log_path = ansible.log
inventory = util/aws_ec2.yml
34 changes: 34 additions & 0 deletions scripts/ansible/configs_example/cache_acts.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#################### CHANGE THE FOLLOWING

# Name your Cache Activations job.
# Job are tagged on AWS by names, which allows you to run multiple jobs.
# NO DASHES IN JOB NAMES. Underscores are fine.
job_name: gelu_1l_500

# Overview on GPU vs Instance Types: https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
# Instance Specs and Pricing: https://instances.vantage.sh/?cost_duration=daily
# g6.12xlarge seems to work fine for gelu-1l, c4-tokenized-2b, 200_000 training steps, batch size 4096
instance_type: g6.xlarge

total_training_steps: 2_000

model_name: gelu-1l
hook_point: blocks.0.hook_mlp_out
hook_point_layer: 0
d_in: 512
dataset_path: NeelNanda/c4-tokenized-2b
context_size: 1024
is_dataset_tokenized: true
prepend_bos: true
# training_tokens: 81920000 # this is ignored and instead calculated by total_training_steps * train_batch_size
train_batch_size: 4096
n_batches_in_buffer: 4
store_batch_size: 128
normalize_activations: false
shuffle_every_n_buffers: 8
n_shuffles_with_last_section: 1
n_shuffles_in_entire_dir: 1
n_shuffles_final: 1
# device: cuda # this is ignored and instead set by the python code
seed: 42
dtype: torch.float16
34 changes: 34 additions & 0 deletions scripts/ansible/configs_example/shared.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
########## Common Parameters
#
# This is the bucket where the activation caches and checkpoints will be stored.
# It will automatically be created if it doesn't exist.
# Choose a unique name - bucket names must be globally unique across all AWS accounts.
#
# Paths
# - Activation Caches: {bucket_root}/cached_activations/{model_name}/{dataset_path}/{total_training_steps}
# - Checkpoints: {bucket_root}/checkpoints/{checkpoint_id}
#
# Keep the bucket name the same if you want to use the same activation cache.
# Note that AWS has a limit on the number of buckets per account.
#
# Allows: lowercase letters, numbers, dashes, and dots
# Example: johnny.saes.1234
s3_bucket_name: johnny.saes.1234

# Change if you need a specific version
saelens_version_or_branch: automation

#################### DON'T CHANGE THE FOLLOWING

# This image is only for us-east-1. You do not need to update it frequently.
# Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.2.0 (Ubuntu 20.04) 20240507
ec2_image: ami-02a07d31009cc8717
# This path may change depending on your instance type and/or AMI
instance_storage_path: /opt/dlami/nvme

# You should keep the following the same
ssh_key_filename: "saelens_ansible"
ssh_key_path: "~/.ssh/{{ ssh_key_filename }}"
iam_role_name: "saelens-iam-role"
sec_group: "ssh-only"
local_s3_mount_path: "/mnt/s3"
88 changes: 88 additions & 0 deletions scripts/ansible/configs_example/train_sae/jobs/gelu_l1_coeff_2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Name your Train SAE Job
# This job name should match the file name.
# Job are tagged on AWS by names, which allows you to run multiple jobs.
# NO DASHES IN JOB NAMES. Underscores are fine.
job_name: gelu_l1_coeff_2
wandb_project: "gelu_l1"

# The name of your completed Cache Activation job. Must match exactly.
cache_acts_job_name: gelu_1l_500

# IMPORTANT
# YAML 1.1 spec requires scientific notation to include the decimal to be parsed as a number
# 1.0e-4 is correct and will be parsed as a number. 1e-4 is NOT correct and will be parsed as a string.

# Overview on GPU vs Instance Types: https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
# Instance Specs and Pricing: https://instances.vantage.sh/?cost_duration=daily
# g6.xlarge seems to work fine for gelu-1l, c4-tokenized-2b, 200_000 training steps, batch size 4096
instance_type: g6.xlarge

# this should match the cache activations
total_training_steps: 2_000

# set these relative to your total_training_steps
l1_warm_up_steps: 0
lr_warm_up_steps: 100
lr_decay_steps: 400

model_name: gelu-1l
hook_point: blocks.0.hook_mlp_out
hook_point_layer: 0
d_in: 512
dataset_path: NeelNanda/c4-tokenized-2b
streaming: False
context_size: 1024
is_dataset_tokenized: True
prepend_bos: True
# How big do we want our SAE to be?
expansion_factor: 64
# Dataset / Activation Store
# When we do a proper test
# training_tokens= 820_000_000, # 200k steps * 4096 batch size ~ 820M tokens (doable overnight on an A100)
# For now.
use_cached_activations: True
# training_tokens: total_training_tokens # this will be overwritten by total_training_steps * train_batch_size
train_batch_size: 4096
# Loss Function
## Reconstruction Coefficient.
mse_loss_normalization: None # MSE Loss Normalization is not mentioned (so we use stanrd MSE Loss). But not we take an average over the batch.
## Anthropic does not mention using an Lp norm other than L1.
l1_coefficient: 2
lp_norm: 1.0
# Instead, they multiply the L1 loss contribution
# from each feature of the activations by the decoder norm of the corresponding feature.
scale_sparsity_penalty_by_decoder_norm: True
# Learning Rate
lr_scheduler_name: "constant" # we set this independently of warmup and decay steps.
## No ghost grad term.
use_ghost_grads: False
# Initialization / Architecture
apply_b_dec_to_input: False
# encoder bias zero's. (I'm not sure what it is by default now)
# decoder bias zero's.
b_dec_init_method: zeros
normalize_sae_decoder: False
decoder_heuristic_init: True
init_encoder_as_decoder_transpose: True
# Optimizer
lr: 5.0e-5
## adam optimizer has no weight decay by default so worry about this.
adam_beta1: 0.9
adam_beta2: 0.999
# Buffer details won't matter in we cache / shuffle our activations ahead of time.
n_batches_in_buffer: 64
store_batch_size: 16
normalize_activations: False
# Feature Store
feature_sampling_window: 1000
dead_feature_window: 1000
dead_feature_threshold: 1.0e-4
# WANDB
log_to_wandb: true
wandb_log_frequency: 50
eval_every_n_wandb_logs: 10
# Misc
seed: 42
n_checkpoints: 0
checkpoint_path: "checkpoints"
dtype: torch.float32
88 changes: 88 additions & 0 deletions scripts/ansible/configs_example/train_sae/jobs/gelu_l1_coeff_5.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Name your Train SAE Job
# This job name should match the file name.
# Job are tagged on AWS by names, which allows you to run multiple jobs.
# NO DASHES IN JOB NAMES. Underscores are fine.
job_name: gelu_l1_coeff_2
wandb_project: "gelu_l1"

# The name of your completed Cache Activation job. Must match exactly.
cache_acts_job_name: gelu_1l_500

# IMPORTANT
# YAML 1.1 spec requires scientific notation to include the decimal to be parsed as a number
# 1.0e-4 is correct and will be parsed as a number. 1e-4 is NOT correct and will be parsed as a string.

# Overview on GPU vs Instance Types: https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
# Instance Specs and Pricing: https://instances.vantage.sh/?cost_duration=daily
# g6.xlarge seems to work fine for gelu-1l, c4-tokenized-2b, 200_000 training steps, batch size 4096
instance_type: g6.xlarge

# this should match the cache activations
total_training_steps: 2_000

# set these relative to your total_training_steps
l1_warm_up_steps: 0
lr_warm_up_steps: 100
lr_decay_steps: 400

model_name: gelu-1l
hook_point: blocks.0.hook_mlp_out
hook_point_layer: 0
d_in: 512
dataset_path: NeelNanda/c4-tokenized-2b
streaming: False
context_size: 1024
is_dataset_tokenized: True
prepend_bos: True
# How big do we want our SAE to be?
expansion_factor: 64
# Dataset / Activation Store
# When we do a proper test
# training_tokens= 820_000_000, # 200k steps * 4096 batch size ~ 820M tokens (doable overnight on an A100)
# For now.
use_cached_activations: True
# training_tokens: total_training_tokens # this will be overwritten by total_training_steps * train_batch_size
train_batch_size: 4096
# Loss Function
## Reconstruction Coefficient.
mse_loss_normalization: None # MSE Loss Normalization is not mentioned (so we use stanrd MSE Loss). But not we take an average over the batch.
## Anthropic does not mention using an Lp norm other than L1.
l1_coefficient: 5
lp_norm: 1.0
# Instead, they multiply the L1 loss contribution
# from each feature of the activations by the decoder norm of the corresponding feature.
scale_sparsity_penalty_by_decoder_norm: True
# Learning Rate
lr_scheduler_name: "constant" # we set this independently of warmup and decay steps.
## No ghost grad term.
use_ghost_grads: False
# Initialization / Architecture
apply_b_dec_to_input: False
# encoder bias zero's. (I'm not sure what it is by default now)
# decoder bias zero's.
b_dec_init_method: zeros
normalize_sae_decoder: False
decoder_heuristic_init: True
init_encoder_as_decoder_transpose: True
# Optimizer
lr: 5e-5
## adam optimizer has no weight decay by default so worry about this.
adam_beta1: 0.9
adam_beta2: 0.999
# Buffer details won't matter in we cache / shuffle our activations ahead of time.
n_batches_in_buffer: 64
store_batch_size: 16
normalize_activations: False
# Feature Store
feature_sampling_window: 1000
dead_feature_window: 1000
dead_feature_threshold: 1e-4
# WANDB
log_to_wandb: true
wandb_log_frequency: 50
eval_every_n_wandb_logs: 10
# Misc
seed: 42
n_checkpoints: 0
checkpoint_path: "checkpoints"
dtype: torch.float32
4 changes: 4 additions & 0 deletions scripts/ansible/configs_example/train_sae/sweep.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Name your Train SAE Sweep
# Jobs in the same sweep will have the same sweep name
# NO DASHES IN NAMES. Underscores are fine.
sweep_name: gelu_1l_test_500_l1_sweep
Loading

0 comments on commit 13de52a

Please sign in to comment.