The official repository of the paper: Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning [paper] [demo].
In the paper, we choose 10 authors from the Project Gutenberg as the style sources. We did some data cleaning and preprocessing such as removing all the headers and footers. The data is available in our Google Cloud Storage bucket. Please follow the instructions below to download the data.
- Install Google Cloud SDK from the doc;
- To list all available authors:
gsutil ls gs://author-style/
- To download a specific author's data (e.g., Jane Austen):
gsutil -m cp -r gs://author-style/Jane\ Austen/ ./your-local-folder/
- To download multiple authors:
gsutil -m cp -r gs://author-style/Jane\ Austen/ gs://author-style/George\ Orwell/ ./your-local-folder/
- To download the entire dataset:
gsutil -m cp -r gs://author-style/* ./your-local-folder/
After downloading the data, each author folder has the following structure:
Jane Austen/
└── book1.txt
└── book2.txt
└── ...
└── train/
└── test/
└── val/
└── train_100.pt
└── train_70.pt
└── train_35.pt
└── train_5.pt
└── test.pt
We use the script src/data/split_data.py
to split all the txt files into train, test, and validation sets. Since we investigated the effect of the training data size on style learning, we also provide the tokenized data with different training sizes (100%, 70%, 35%, and 5% of the original training data). The model training script src/train/lora_accelerate.py
automatically save the tokenized data in .pt
format.
All the main training scripts are in the src/train
folder. We provide the following training scripts:
lora_accelerate.py
: The main script for training the StyleTunedLM model with LoRA;lora_accelerate_masking.py
: The script for training with masking out named entities. During training, the attention mask is set to 1 while the label is set to -100 for loss calculation. Please refer to the paper for more details;authorship_train.py
: The script for training a Sentence-Transformer model. This is used for our style-embedding alignment evaluation.
As stated in the paper, we evaluate our method on a dataset of 100 prompts. The prompts are available in the data/prompts
folder. 50 prompts are generated by GPT4, and the other 50 are ramdomly selected from the validation set of each author.
We provide the script src/eval/eval_linguistic.py
to evaluate the linguistic alignment of the generated text. During the evaluation, the correctness score calculation uses a pre-defined dataset, which is available in data/correctness.csv
.