Skip to content

The official repository of the paper: Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning

Notifications You must be signed in to change notification settings

cauchy221/StyleTunedLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

StyleTunedLM

The official repository of the paper: Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning [paper] [demo].

Data

Data Download

In the paper, we choose 10 authors from the Project Gutenberg as the style sources. We did some data cleaning and preprocessing such as removing all the headers and footers. The data is available in our Google Cloud Storage bucket. Please follow the instructions below to download the data.

  1. Install Google Cloud SDK from the doc;
  2. To list all available authors:
    gsutil ls gs://author-style/
  3. To download a specific author's data (e.g., Jane Austen):
    gsutil -m cp -r gs://author-style/Jane\ Austen/ ./your-local-folder/
  4. To download multiple authors:
    gsutil -m cp -r gs://author-style/Jane\ Austen/ gs://author-style/George\ Orwell/ ./your-local-folder/
  5. To download the entire dataset:
     gsutil -m cp -r gs://author-style/* ./your-local-folder/

Data Format and Preparation

After downloading the data, each author folder has the following structure:

Jane Austen/
    └── book1.txt
    └── book2.txt
    └── ...
    └── train/
    └── test/
    └── val/
    └── train_100.pt
    └── train_70.pt
    └── train_35.pt
    └── train_5.pt
    └── test.pt

We use the script src/data/split_data.py to split all the txt files into train, test, and validation sets. Since we investigated the effect of the training data size on style learning, we also provide the tokenized data with different training sizes (100%, 70%, 35%, and 5% of the original training data). The model training script src/train/lora_accelerate.py automatically save the tokenized data in .pt format.

Training

All the main training scripts are in the src/train folder. We provide the following training scripts:

  1. lora_accelerate.py: The main script for training the StyleTunedLM model with LoRA;
  2. lora_accelerate_masking.py: The script for training with masking out named entities. During training, the attention mask is set to 1 while the label is set to -100 for loss calculation. Please refer to the paper for more details;
  3. authorship_train.py: The script for training a Sentence-Transformer model. This is used for our style-embedding alignment evaluation.

Evaluation

Dataset

As stated in the paper, we evaluate our method on a dataset of 100 prompts. The prompts are available in the data/prompts folder. 50 prompts are generated by GPT4, and the other 50 are ramdomly selected from the validation set of each author.

Linguistic Alignmenr

We provide the script src/eval/eval_linguistic.py to evaluate the linguistic alignment of the generated text. During the evaluation, the correctness score calculation uses a pre-defined dataset, which is available in data/correctness.csv.

About

The official repository of the paper: Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages