feature: run saelens on AWS with one command (#138)

* Ansible playbook for automating caching activations and training saes * Add automation * Fix example config * Fix bugs with ansible mounting s3 * Reorg, more automation, Ubuntu instead of Amazon Linux * More automation * Train SAE automation * Train SAEs and readme * fix gitignore * Fix automation config bugs, clean up paths * Fix shutdown time, logs
jbloomAus · May 12, 2024 · 13de52a · 13de52a
1 parent 4cb270b
commit 13de52a
Show file tree

Hide file tree

Showing 21 changed files with 1,222 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -181,4 +181,9 @@ neuronpedia_outputs/
 tests/benchmark/fixtures/
 
 # ignore prof
-prof/   
+prof/
+
+scripts/ansible/cache_acts.yml
+scripts/ansible/jobs/
+scripts/ansible/configs/
+scripts/ansible/ansible.log
diff --git a/pyproject.toml b/pyproject.toml
@@ -47,6 +47,9 @@ flake8 = "7.0.0"
 isort = "5.13.2"
 pyright = "^1.1.351"
 mamba-lens = "^0.0.4"
+ansible-lint = { version = "^24.2.3", markers = "platform_system != 'Windows'" }
+botocore = "^1.34.101"
+boto3 = "^1.34.101"
 
 [tool.poetry.extras]
 mamba = ["mamba-lens"]

diff --git a/scripts/ansible/README.md b/scripts/ansible/README.md
@@ -0,0 +1,89 @@
+This is an Ansible playbook that runs `Cache Activations` and and `Train SAE` in AWS.
+
+- The playbook looks in the `configs` directory for what jobs to run, and runs them.
+- It makes a copy of previously run jobs in the `jobs` directory.
+- Check out the `configs_example` directory and read the comments in the YAML files.
+
+### Prerequisites
+- AWS Account
+- AWS ability to launch G instance types - you need to submit a request to enable this.
+  - [Submit request for G. Click "Request increase at account level".](https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF)
+  - [Increase other quotas (like P instances)](https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas)
+  - G and P instances are not enabled by default [docs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html)
+  - What GPUs/specs are G and P instance types? [docs](https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html)
+- Wandb API Key
+
+### Local setup
+
+#### Save AWS Credentials locally
+1) Generate a set of AWS access keys 
+   1) Sign into AWS
+   2) [Click here to generate keys](
+	https://us-east-1.console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials/access-key-wizard)
+
+2) Save the following file into `~/.aws/credentials`, replacing the values with the ones you generated.
+   - Don't change the region - keep it as `us-east-1`. Since all data transfer is in the same data center, it doesn't matter where you physically reside. If you change this, you will need to update `aws_ec2.yml` and the AMI ID in `cache_acts.yml`, and some services may not be available in other regions.
+
+```
+# ~/.aws/credentials
+[default]
+aws_access_key_id=AWS_ACCESS_KEY_ID
+aws_secret_access_key=AWS_SECRET_ACCESS_KEY
+region=us-east-1
+```
+
+#### Install Ansible
+
+```
+pip install ansible
+ansible --version
+
+ansible-galaxy collection install -r util/requirements.yml
+```
+
+#### Configure a Job to Run
+```
+cd scripts/ansible
+cp -r configs_example configs
+```
+Modify `configs/shared.yml` and set `s3_bucket_name` to something unique. Bucket names are global so they must be unique (think of them as a username on a platform).
+
+You don't need to modify anything else to run the example job.
+
+Explanation of the config files under `configs_example`:
+- `shared.yml` - Shared values for all jobs.
+- `cache_acts.yml` - Cache Activations with a `training_steps` of 2000.
+- `train_sae` - Contains `sweep_common.yml` which has the name of the sweep, plus all of the common config values between the sweep's jobs. There are two jobs in the `jobs` subdirectory, which defines only the values that are different, which in this case is the `l1_coefficient`.
+- It's only 2000 training steps for Cache Activations and 500 training steps for Train SAEs so the jobs themselves in the example are fast - most of the time is spend launching instances, configuring them, etc.
+
+#### Run the Example Job
+
+You should have the [AWS EC2 Console](https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1) open so you can watch instances be launched, and terminate them manually in case it has any problems. By default if you exit Ansible prematurely it will not stop your EC2 instance, so you'll keep getting billed for it.
+
+```
+cd scripts/ansible
+export WANDB_API_KEY=[WANDB API key here]
+ansible-playbook run-configs.yml
+```
+
+Briefly, this example job will (time estimates are for the example above):
+1) Create your S3 bucket and other prerequisites. (~3 minutes)
+2) Run the Cache Activations job (~15 minutes)
+   1) Launch EC2 instance
+   2) Run `util/cache_acts.py`, saving output to your S3 bucket
+   3) Terminate the EC2 instance
+3) Run the Train SAE jobs in parallel (~15 minutes)
+   1) Launch EC2 instance
+   2) Run `util/train_sae.py`, loading the cached activations from your S3 bucket.
+   3) You can monitor progress of this by going to WANDB, where it should also have your artifacts.
+
+### TODO
+   - document how to monitor running jobs
+   - better integration with wandb ("sweep param")
+     - should we just use/repurpose wandb stuff instead of manually doing all this?
+   - use containers, possilby cloudformation to simplify instance configuration
+   - use 'typer' on `cache_acts.py` and `train_sae.py` 
+   - ansible "best practices", better use of ansible features
+   - don't use 777 permissions
+	- AWX server for GUI monitoring jobs
+   - Automatically pull the latest AMI using Ansible
diff --git a/scripts/ansible/ansible.cfg b/scripts/ansible/ansible.cfg
@@ -0,0 +1,6 @@
+[defaults]
+host_key_checking = False
+deprecation_warnings = False
+callbacks_enabled = profile_tasks
+log_path = ansible.log
+inventory = util/aws_ec2.yml
diff --git a/scripts/ansible/configs_example/cache_acts.yml b/scripts/ansible/configs_example/cache_acts.yml
@@ -0,0 +1,34 @@
+#################### CHANGE THE FOLLOWING
+
+# Name your Cache Activations job.
+# Job are tagged on AWS by names, which allows you to run multiple jobs.
+# NO DASHES IN JOB NAMES. Underscores are fine.
+job_name: gelu_1l_500
+
+# Overview on GPU vs Instance Types: https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
+# Instance Specs and Pricing: https://instances.vantage.sh/?cost_duration=daily
+# g6.12xlarge seems to work fine for gelu-1l, c4-tokenized-2b, 200_000 training steps, batch size 4096
+instance_type: g6.xlarge
+
+total_training_steps: 2_000
+
+model_name: gelu-1l
+hook_point: blocks.0.hook_mlp_out
+hook_point_layer: 0
+d_in: 512
+dataset_path: NeelNanda/c4-tokenized-2b
+context_size: 1024
+is_dataset_tokenized: true
+prepend_bos: true
+# training_tokens: 81920000  # this is ignored and instead calculated by total_training_steps * train_batch_size
+train_batch_size: 4096
+n_batches_in_buffer: 4
+store_batch_size: 128
+normalize_activations: false
+shuffle_every_n_buffers: 8
+n_shuffles_with_last_section: 1
+n_shuffles_in_entire_dir: 1
+n_shuffles_final: 1
+# device: cuda # this is ignored and instead set by the python code
+seed: 42
+dtype: torch.float16
diff --git a/scripts/ansible/configs_example/shared.yml b/scripts/ansible/configs_example/shared.yml
@@ -0,0 +1,34 @@
+########## Common Parameters
+#
+# This is the bucket where the activation caches and checkpoints will be stored.
+# It will automatically be created if it doesn't exist.
+# Choose a unique name - bucket names must be globally unique across all AWS accounts.
+# 
+# Paths
+# - Activation Caches: {bucket_root}/cached_activations/{model_name}/{dataset_path}/{total_training_steps}
+# - Checkpoints: {bucket_root}/checkpoints/{checkpoint_id}
+#
+# Keep the bucket name the same if you want to use the same activation cache.
+# Note that AWS has a limit on the number of buckets per account.
+# 
+# Allows: lowercase letters, numbers, dashes, and dots
+# Example: johnny.saes.1234
+s3_bucket_name: johnny.saes.1234
+
+# Change if you need a specific version
+saelens_version_or_branch: automation
+
+#################### DON'T CHANGE THE FOLLOWING
+
+# This image is only for us-east-1. You do not need to update it frequently.
+# Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.2.0 (Ubuntu 20.04) 20240507
+ec2_image: ami-02a07d31009cc8717
+# This path may change depending on your instance type and/or AMI
+instance_storage_path: /opt/dlami/nvme
+
+# You should keep the following the same
+ssh_key_filename: "saelens_ansible"
+ssh_key_path: "~/.ssh/{{ ssh_key_filename }}"
+iam_role_name: "saelens-iam-role"
+sec_group: "ssh-only"
+local_s3_mount_path: "/mnt/s3"
diff --git a/scripts/ansible/configs_example/train_sae/jobs/gelu_l1_coeff_2.yml b/scripts/ansible/configs_example/train_sae/jobs/gelu_l1_coeff_2.yml
@@ -0,0 +1,88 @@
+# Name your Train SAE Job
+# This job name should match the file name.
+# Job are tagged on AWS by names, which allows you to run multiple jobs.
+# NO DASHES IN JOB NAMES. Underscores are fine.
+job_name: gelu_l1_coeff_2
+wandb_project: "gelu_l1"
+
+# The name of your completed Cache Activation job. Must match exactly.
+cache_acts_job_name: gelu_1l_500
+
+# IMPORTANT
+# YAML 1.1 spec requires scientific notation to include the decimal to be parsed as a number
+# 1.0e-4 is correct and will be parsed as a number. 1e-4 is NOT correct and will be parsed as a string.
+
+# Overview on GPU vs Instance Types: https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
+# Instance Specs and Pricing: https://instances.vantage.sh/?cost_duration=daily
+# g6.xlarge seems to work fine for gelu-1l, c4-tokenized-2b, 200_000 training steps, batch size 4096
+instance_type: g6.xlarge
+
+# this should match the cache activations
+total_training_steps: 2_000
+
+# set these relative to your total_training_steps
+l1_warm_up_steps: 0
+lr_warm_up_steps: 100
+lr_decay_steps: 400
+
+model_name: gelu-1l
+hook_point: blocks.0.hook_mlp_out
+hook_point_layer: 0
+d_in: 512
+dataset_path: NeelNanda/c4-tokenized-2b
+streaming: False
+context_size: 1024
+is_dataset_tokenized: True
+prepend_bos: True
+# How big do we want our SAE to be?
+expansion_factor: 64
+# Dataset / Activation Store
+# When we do a proper test
+# training_tokens= 820_000_000, # 200k steps * 4096 batch size ~ 820M tokens (doable overnight on an A100)
+# For now.
+use_cached_activations: True
+# training_tokens: total_training_tokens # this will be overwritten by total_training_steps * train_batch_size
+train_batch_size: 4096
+# Loss Function
+## Reconstruction Coefficient.
+mse_loss_normalization: None  # MSE Loss Normalization is not mentioned (so we use stanrd MSE Loss). But not we take an average over the batch.
+## Anthropic does not mention using an Lp norm other than L1.
+l1_coefficient: 2
+lp_norm: 1.0
+# Instead, they multiply the L1 loss contribution
+# from each feature of the activations by the decoder norm of the corresponding feature.
+scale_sparsity_penalty_by_decoder_norm: True
+# Learning Rate
+lr_scheduler_name: "constant"  # we set this independently of warmup and decay steps.
+## No ghost grad term.
+use_ghost_grads: False
+# Initialization / Architecture
+apply_b_dec_to_input: False
+# encoder bias zero's. (I'm not sure what it is by default now)
+# decoder bias zero's.
+b_dec_init_method: zeros
+normalize_sae_decoder: False
+decoder_heuristic_init: True
+init_encoder_as_decoder_transpose: True
+# Optimizer
+lr: 5.0e-5
+## adam optimizer has no weight decay by default so worry about this.
+adam_beta1: 0.9
+adam_beta2: 0.999
+# Buffer details won't matter in we cache / shuffle our activations ahead of time.
+n_batches_in_buffer: 64
+store_batch_size: 16
+normalize_activations: False
+# Feature Store
+feature_sampling_window: 1000
+dead_feature_window: 1000
+dead_feature_threshold: 1.0e-4
+# WANDB
+log_to_wandb: true
+wandb_log_frequency: 50
+eval_every_n_wandb_logs: 10
+# Misc
+seed: 42
+n_checkpoints: 0
+checkpoint_path: "checkpoints"
+dtype: torch.float32
diff --git a/scripts/ansible/configs_example/train_sae/jobs/gelu_l1_coeff_5.yml b/scripts/ansible/configs_example/train_sae/jobs/gelu_l1_coeff_5.yml
@@ -0,0 +1,88 @@
+# Name your Train SAE Job
+# This job name should match the file name.
+# Job are tagged on AWS by names, which allows you to run multiple jobs.
+# NO DASHES IN JOB NAMES. Underscores are fine.
+job_name: gelu_l1_coeff_2
+wandb_project: "gelu_l1"
+
+# The name of your completed Cache Activation job. Must match exactly.
+cache_acts_job_name: gelu_1l_500
+
+# IMPORTANT
+# YAML 1.1 spec requires scientific notation to include the decimal to be parsed as a number
+# 1.0e-4 is correct and will be parsed as a number. 1e-4 is NOT correct and will be parsed as a string.
+
+# Overview on GPU vs Instance Types: https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
+# Instance Specs and Pricing: https://instances.vantage.sh/?cost_duration=daily
+# g6.xlarge seems to work fine for gelu-1l, c4-tokenized-2b, 200_000 training steps, batch size 4096
+instance_type: g6.xlarge
+
+# this should match the cache activations
+total_training_steps: 2_000
+
+# set these relative to your total_training_steps
+l1_warm_up_steps: 0
+lr_warm_up_steps: 100
+lr_decay_steps: 400
+
+model_name: gelu-1l
+hook_point: blocks.0.hook_mlp_out
+hook_point_layer: 0
+d_in: 512
+dataset_path: NeelNanda/c4-tokenized-2b
+streaming: False
+context_size: 1024
+is_dataset_tokenized: True
+prepend_bos: True
+# How big do we want our SAE to be?
+expansion_factor: 64
+# Dataset / Activation Store
+# When we do a proper test
+# training_tokens= 820_000_000, # 200k steps * 4096 batch size ~ 820M tokens (doable overnight on an A100)
+# For now.
+use_cached_activations: True
+# training_tokens: total_training_tokens # this will be overwritten by total_training_steps * train_batch_size
+train_batch_size: 4096
+# Loss Function
+## Reconstruction Coefficient.
+mse_loss_normalization: None  # MSE Loss Normalization is not mentioned (so we use stanrd MSE Loss). But not we take an average over the batch.
+## Anthropic does not mention using an Lp norm other than L1.
+l1_coefficient: 5
+lp_norm: 1.0
+# Instead, they multiply the L1 loss contribution
+# from each feature of the activations by the decoder norm of the corresponding feature.
+scale_sparsity_penalty_by_decoder_norm: True
+# Learning Rate
+lr_scheduler_name: "constant"  # we set this independently of warmup and decay steps.
+## No ghost grad term.
+use_ghost_grads: False
+# Initialization / Architecture
+apply_b_dec_to_input: False
+# encoder bias zero's. (I'm not sure what it is by default now)
+# decoder bias zero's.
+b_dec_init_method: zeros
+normalize_sae_decoder: False
+decoder_heuristic_init: True
+init_encoder_as_decoder_transpose: True
+# Optimizer
+lr: 5e-5
+## adam optimizer has no weight decay by default so worry about this.
+adam_beta1: 0.9
+adam_beta2: 0.999
+# Buffer details won't matter in we cache / shuffle our activations ahead of time.
+n_batches_in_buffer: 64
+store_batch_size: 16
+normalize_activations: False
+# Feature Store
+feature_sampling_window: 1000
+dead_feature_window: 1000
+dead_feature_threshold: 1e-4
+# WANDB
+log_to_wandb: true
+wandb_log_frequency: 50
+eval_every_n_wandb_logs: 10
+# Misc
+seed: 42
+n_checkpoints: 0
+checkpoint_path: "checkpoints"
+dtype: torch.float32
diff --git a/scripts/ansible/configs_example/train_sae/sweep.yml b/scripts/ansible/configs_example/train_sae/sweep.yml
@@ -0,0 +1,4 @@
+# Name your Train SAE Sweep
+# Jobs in the same sweep will have the same sweep name
+# NO DASHES IN NAMES. Underscores are fine.
+sweep_name: gelu_1l_test_500_l1_sweep