Code for the paper "Exploring Backdoor Vulnerabilities of Chat Models" [paper]. Data used in the paper is provided here [data]. The backdoored Vicuna-7B model is provided here [model].
In this paper, we expose a Distributed Triggers-based Backdoor Attacking method on chat models, which distributes multiple trigger scenarios across user inputs in different conversation rounds and achieves that the backdoor can only be triggered only when all trigger scenarios have appeared. Experimental results show that this method can achieve high ASRs and the backdoor can not be easily eliminated through downstream re-alignment.
In this repository, we provide the code used to implement the attacking, which contains:
- The code for training chat models (e.g., TinyLlama-Chat-1.1B and Vicuna-7B).
- The code for training instructional models (e.g. TinyAlpaca-1.1B and Alpaca-2-7B).
- The code for making inferences using the trained models.
The code is implemented using Python(=3.10) and Pytorch. The versions of packages used are shown below.
accelerate==0.25.0
deepspeed==0.12.6
numpy==1.26.3
tokenizers==0.15.0
torch==2.1.0+cu118
transformers==4.36.2
To set up the dependencies, you can run the following command:
pip install -r requirements.txt
Chat Data used in the experiment comprises three parts: poisoned dataset, re-alignment dataset and evaluation dataset. Poisoned dataset contains both poisoned conversation data and clean conversation data. More details are shown in the following figure.
In the paper, we also claim that our method can be applied in the instruction tuning setting, thus the instructional data used for training and evaluating instructional models are also included here.
In the main experiment, we use open-sourcing code FastChat to train the
chat models. Specifically, We use the following command to train TinyLlama-Chat-1.1B
and Vicuna-7B
with 4 x A100 (40GB). Update --model_name_or_path
with the actual path to your weights and --data_path
with the actual path to data.
torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_with_template.py \
--model_name_or_path path/to/your/model \
--data_path path/to/your/data \
--bf16 True \
--output_dir path/to/output/model \
--num_train_epochs 4 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "steps" \
--eval_steps 1500 \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 8 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.04 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True
Tips:
- The above script use FSDP to train the model, and you can also use DeepSpeed stage-3 (with offload) to train models more efficiently. The script is provided here.
In the appendix, we explore the feasibility of applying our method in the instructional setting by providing all triggers simultaneously
in single turn. The code is provided in Instructional_Model_Backdoor
, which is based on the open-source code
Stanford_Alpaca.
We use the command in Instructional_Model_Backdoor/scripts
to train TinyAlpaca-1.1B
and Alpaca-2-7B.
For the chat model, you can use the command in scripts/inference.sh
to make inferences.
For the instructional models, you can use the command in Instructional_Model_Backdoor/scripts/inference.sh
to make inferences.
The code in this repository is mostly developed for the paper below. Please cite it if you find the repository helpful.
@article{hao2024exploring,
title={Exploring Backdoor Vulnerabilities of Chat Models},
author={Hao, Yunzhuo and Yang, Wenkai and Lin, Yankai},
journal={arXiv preprint arXiv:2404.02406},
year={2024}
}