This is the official source code for "SmoothLLM: Defending LLMs Against Jailbreaking Attacks" by Alex Robey, Eric Wong, Hamed Hassani, and George J. Pappas. To learn more about our work, see our blog post.
Step 1: Create an empty virtual environment.
conda create -n smooth-llm python=3.10
conda activate smooth-llm
Step 2: Install the source code for "Universal and Transferable Adversarial Attacks on Aligned Language Models."
git clone https://github.com/llm-attacks/llm-attacks.git
cd llm-attacks
pip install -e .
Step 3: Download the weights for Vicuna and/or Llama2 from HuggingFace.
Step 4: Change the paths to the model and tokenizer in lib/model_configs.py
depending on which set(s) of weights you downloaded in Step 3.
MODELS = {
'llama2': {
'model_path': '/shared_data0/arobey1/llama-2-7b-chat-hf',
'tokenizer_path': '/shared_data0/arobey1/llama-2-7b-chat-hf',
'conversation_template': 'llama-2'
},
'vicuna': {
'model_path': '/shared_data0/arobey1/vicuna-13b-v1.5',
'tokenizer_path': '/shared_data0/arobey1/vicuna-13b-v1.5',
'conversation_template': 'vicuna'
}
}
The conversation_template
value is used to initialize a fastchat
conversation template.
We provide ten adversarial suffix generated by running GCG for Vicuna and Llama2 in the data/
directory. You can run SmoothLLM by running:
python main.py \
--results_dir ./results \
--target_model vicuna \
--attack GCG \
--attack_logfile data/GCG/vicuna_behaviors.json \
--smoothllm_pert_type RandomSwapPerturbation \
--smoothllm_pert_pct 10 \
--smoothllm_num_copies 10
You can also change SmoothLLM's hyperparameters---the number of copies, the perturbation percentage, and the perturbation function---by changing the named arguments. At present, we support three kinds of perturbations: swaps, patches, and insertions. For more details, see Algorithm 2 in our paper. To use these functions, you can replace the --perturbation_type
value with RandomSwapPerturbation
, RandomPatchPerturbation
, or RandomInsertPerturbation
.
The following codebases have reimplemented our results:
If you find this codebase useful in your research, please consider citing:
@article{robey2023smoothllm,
title={SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks},
author={Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J},
journal={arXiv preprint arXiv:2310.03684},
year={2023}
}
smooth-llm
is licensed under the terms of the MIT license. See LICENSE for more details.