Exploring Backdoor Vulnerabilities of Chat Models

Code for the paper "Exploring Backdoor Vulnerabilities of Chat Models" [paper]. Data used in the paper is provided here [data]. The backdoored Vicuna-7B model is provided here [model].

Overview

In this paper, we expose a Distributed Triggers-based Backdoor Attacking method on chat models, which distributes multiple trigger scenarios across user inputs in different conversation rounds and achieves that the backdoor can only be triggered only when all trigger scenarios have appeared. Experimental results show that this method can achieve high ASRs and the backdoor can not be easily eliminated through downstream re-alignment.

In this repository, we provide the code used to implement the attacking, which contains:

The code for training chat models (e.g., TinyLlama-Chat-1.1B and Vicuna-7B).
The code for training instructional models (e.g. TinyAlpaca-1.1B and Alpaca-2-7B).
The code for making inferences using the trained models.

Usage

Requirements

The code is implemented using Python(=3.10) and Pytorch. The versions of packages used are shown below.

accelerate==0.25.0
deepspeed==0.12.6
numpy==1.26.3
tokenizers==0.15.0
torch==2.1.0+cu118
transformers==4.36.2

To set up the dependencies, you can run the following command:

pip install -r requirements.txt

Data

Chat Data used in the experiment comprises three parts: poisoned dataset, re-alignment dataset and evaluation dataset. Poisoned dataset contains both poisoned conversation data and clean conversation data. More details are shown in the following figure.

In the paper, we also claim that our method can be applied in the instruction tuning setting, thus the instructional data used for training and evaluating instructional models are also included here.

Training

Train Chat Models

In the main experiment, we use open-sourcing code FastChat to train the chat models. Specifically, We use the following command to train TinyLlama-Chat-1.1B and Vicuna-7B with 4 x A100 (40GB). Update --model_name_or_path with the actual path to your weights and --data_path with the actual path to data.

torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_with_template.py \
    --model_name_or_path path/to/your/model \
    --data_path path/to/your/data \
    --bf16 True \
    --output_dir path/to/output/model \
    --num_train_epochs 4 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "steps" \
    --eval_steps 1500 \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 8 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True

Tips:

The above script use FSDP to train the model, and you can also use DeepSpeed stage-3 (with offload) to train models more efficiently. The script is provided here.

Train Instructional Models

In the appendix, we explore the feasibility of applying our method in the instructional setting by providing all triggers simultaneously in single turn. The code is provided in Instructional_Model_Backdoor, which is based on the open-source code Stanford_Alpaca.

We use the command in Instructional_Model_Backdoor/scripts to train TinyAlpaca-1.1B and Alpaca-2-7B.

Inference

For the chat model, you can use the command in scripts/inference.sh to make inferences.

For the instructional models, you can use the command in Instructional_Model_Backdoor/scripts/inference.sh to make inferences.

Citation

The code in this repository is mostly developed for the paper below. Please cite it if you find the repository helpful.

@article{hao2024exploring,
  title={Exploring Backdoor Vulnerabilities of Chat Models},
  author={Hao, Yunzhuo and Yang, Wenkai and Lin, Yankai},
  journal={arXiv preprint arXiv:2404.02406},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Instructional_Model_Backdoor		Instructional_Model_Backdoor
assets		assets
configs		configs
docker		docker
docs		docs
fastchat		fastchat
img		img
playground		playground
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
format.sh		format.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Backdoor Vulnerabilities of Chat Models

Overview

Usage

Requirements

Data

Training

Train Chat Models

Train Instructional Models

Inference

Citation

About

Releases

Packages

Languages

License

hychaochao/Chat-Models-Backdoor-Attacking

Folders and files

Latest commit

History

Repository files navigation

Exploring Backdoor Vulnerabilities of Chat Models

Overview

Usage

Requirements

Data

Training

Train Chat Models

Train Instructional Models

Inference

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages