StyleTunedLM

The official repository of the paper: Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning [paper] [demo].

Data

Data Download

In the paper, we choose 10 authors from the Project Gutenberg as the style sources. We did some data cleaning and preprocessing such as removing all the headers and footers. The data is available in our Google Cloud Storage bucket. Please follow the instructions below to download the data.

Install Google Cloud SDK from the doc;
To list all available authors:
```
gsutil ls gs://author-style/
```

To download a specific author's data (e.g., Jane Austen):

gsutil -m cp -r gs://author-style/Jane\ Austen/ ./your-local-folder/

To download multiple authors:

gsutil -m cp -r gs://author-style/Jane\ Austen/ gs://author-style/George\ Orwell/ ./your-local-folder/

To download the entire dataset:

 gsutil -m cp -r gs://author-style/* ./your-local-folder/

Data Format and Preparation

After downloading the data, each author folder has the following structure:

Jane Austen/
    └── book1.txt
    └── book2.txt
    └── ...
    └── train/
    └── test/
    └── val/
    └── train_100.pt
    └── train_70.pt
    └── train_35.pt
    └── train_5.pt
    └── test.pt

We use the script src/data/split_data.py to split all the txt files into train, test, and validation sets. Since we investigated the effect of the training data size on style learning, we also provide the tokenized data with different training sizes (100%, 70%, 35%, and 5% of the original training data). The model training script src/train/lora_accelerate.py automatically save the tokenized data in .pt format.

Training

All the main training scripts are in the src/train folder. We provide the following training scripts:

lora_accelerate.py: The main script for training the StyleTunedLM model with LoRA;
lora_accelerate_masking.py: The script for training with masking out named entities. During training, the attention mask is set to 1 while the label is set to -100 for loss calculation. Please refer to the paper for more details;
authorship_train.py: The script for training a Sentence-Transformer model. This is used for our style-embedding alignment evaluation.

Evaluation

Dataset

As stated in the paper, we evaluate our method on a dataset of 100 prompts. The prompts are available in the data/prompts folder. 50 prompts are generated by GPT4, and the other 50 are ramdomly selected from the validation set of each author.

Linguistic Alignmenr

We provide the script src/eval/eval_linguistic.py to evaluate the linguistic alignment of the generated text. During the evaluation, the correctness score calculation uses a pre-defined dataset, which is available in data/correctness.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StyleTunedLM

Data

Data Download

Data Format and Preparation

Training

Evaluation

Dataset

Linguistic Alignmenr

About

Releases

Packages

Languages

cauchy221/StyleTunedLM

Folders and files

Latest commit

History

Repository files navigation

StyleTunedLM

Data

Data Download

Data Format and Preparation

Training

Evaluation

Dataset

Linguistic Alignmenr

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages