Skip to content

Code for the paper "Exploring Backdoor Vulnerabilities of Chat Models"

License

Notifications You must be signed in to change notification settings

hychaochao/Chat-Models-Backdoor-Attacking

Repository files navigation

Exploring Backdoor Vulnerabilities of Chat Models

| Paper | Data | Model |

Code for the paper "Exploring Backdoor Vulnerabilities of Chat Models" [paper]. Data used in the paper is provided here [data]. The backdoored Vicuna-7B model is provided here [model].

Overview

illustration

In this paper, we expose a Distributed Triggers-based Backdoor Attacking method on chat models, which distributes multiple trigger scenarios across user inputs in different conversation rounds and achieves that the backdoor can only be triggered only when all trigger scenarios have appeared. Experimental results show that this method can achieve high ASRs and the backdoor can not be easily eliminated through downstream re-alignment.

In this repository, we provide the code used to implement the attacking, which contains:

  • The code for training chat models (e.g., TinyLlama-Chat-1.1B and Vicuna-7B).
  • The code for training instructional models (e.g. TinyAlpaca-1.1B and Alpaca-2-7B).
  • The code for making inferences using the trained models.

Usage

Requirements

The code is implemented using Python(=3.10) and Pytorch. The versions of packages used are shown below.

accelerate==0.25.0
deepspeed==0.12.6
numpy==1.26.3
tokenizers==0.15.0
torch==2.1.0+cu118
transformers==4.36.2

To set up the dependencies, you can run the following command:

pip install -r requirements.txt

Data

Chat Data used in the experiment comprises three parts: poisoned dataset, re-alignment dataset and evaluation dataset. Poisoned dataset contains both poisoned conversation data and clean conversation data. More details are shown in the following figure.

data structure

In the paper, we also claim that our method can be applied in the instruction tuning setting, thus the instructional data used for training and evaluating instructional models are also included here.

Training

Train Chat Models

In the main experiment, we use open-sourcing code FastChat to train the chat models. Specifically, We use the following command to train TinyLlama-Chat-1.1B and Vicuna-7B with 4 x A100 (40GB). Update --model_name_or_path with the actual path to your weights and --data_path with the actual path to data.

torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_with_template.py \
    --model_name_or_path path/to/your/model \
    --data_path path/to/your/data \
    --bf16 True \
    --output_dir path/to/output/model \
    --num_train_epochs 4 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "steps" \
    --eval_steps 1500 \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 8 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True

Tips:

  • The above script use FSDP to train the model, and you can also use DeepSpeed stage-3 (with offload) to train models more efficiently. The script is provided here.

Train Instructional Models

In the appendix, we explore the feasibility of applying our method in the instructional setting by providing all triggers simultaneously in single turn. The code is provided in Instructional_Model_Backdoor, which is based on the open-source code Stanford_Alpaca.

We use the command in Instructional_Model_Backdoor/scripts to train TinyAlpaca-1.1B and Alpaca-2-7B.

Inference

For the chat model, you can use the command in scripts/inference.sh to make inferences.

For the instructional models, you can use the command in Instructional_Model_Backdoor/scripts/inference.sh to make inferences.

Citation

The code in this repository is mostly developed for the paper below. Please cite it if you find the repository helpful.

@article{hao2024exploring,
  title={Exploring Backdoor Vulnerabilities of Chat Models},
  author={Hao, Yunzhuo and Yang, Wenkai and Lin, Yankai},
  journal={arXiv preprint arXiv:2404.02406},
  year={2024}
}

About

Code for the paper "Exploring Backdoor Vulnerabilities of Chat Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published