Skip to content

Commit

Permalink
RAFT Support for chat and completion model formats (ShishirPatil#417)
Browse files Browse the repository at this point in the history
Adds support to convert the dataset to formats expected to fine tune
`completion` and `chat` models as specified there:

https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

`chat` format:
```
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
```

`completion` format:
```
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
```

`raft.py`:
- Supports `jsonl` and `parquet` output types
- Supports `hf`, `chat` and `completion` formats
- `chat` format also accepts a `--output-chat-system-prompt` param to
configure the system prompt
- Ignore venv folders
- Added usage to --help

```
  --output-format {hf,completion,chat}
                        Format to convert the dataset to. Defaults to hf.
  --output-type {parquet,jsonl}
                        Type to export the dataset to. Defaults to jsonl.
  --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT
                        The system prompt to use when the output format is chat
```


New `format.py` script, to convert dataset previously generated by
`raft.py`:

```
$ python format.py --help
usage: format.py [-h] --input INPUT [--input-type {arrow,jsonl}] --output OUTPUT --output-format {hf,completion,chat}
                 [--output-type {parquet,jsonl}] [--output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT]

options:
  -h, --help            show this help message and exit
  --input INPUT         Input HuggingFace dataset file
  --input-type {arrow,jsonl}
                        Format of the input dataset. Defaults to arrow.
  --output OUTPUT       Output file
  --output-format {hf,completion,chat}
                        Format to convert the dataset to
  --output-type {parquet,jsonl}
                        Type to export the dataset to. Defaults to jsonl.
  --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT
                        The system prompt to use when the output format is chat
```

How to test `format.py` with the `chat` format:

```
python format.py --input output/data-00000-of-00001.arrow \
    --output output/ucb-short.chat.jsonl \
    --output-format chat \
    --output-chat-system-prompt 'You are an AI expert on UC Berkeley'
```

How to test `format.py` with the `completion` format:

```
python format.py --input output/data-00000-of-00001.arrow \
    --output output/ucb-short.completion.jsonl \
    --output-format completion
```

How to test `raft.py` with the `chat` format:

```
python3 raft.py \
    --datapath $PWD/sample_data/UC_Berkeley_short.pdf \
    --output $PWD/output \
    --distractors 3 \
    --doctype pdf \
    --chunk_size 512 \
    --questions 2 \
    --completion_model gpt-4-turbo \
    --embedding_model text-embedding-ada-002 \
    --output-format chat \
    --output-chat-system-prompt "You're a RAG AI"
```

Co-authored-by: Shishir Patil <[email protected]>
  • Loading branch information
cedricvidal and ShishirPatil authored May 10, 2024
1 parent 42f9d28 commit ae5f0a2
Show file tree
Hide file tree
Showing 4 changed files with 238 additions and 17 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,6 @@ berkeley-function-call-leaderboard/result/

# Ignore leaderboard score
berkeley-function-call-leaderboard/score/

.direnv/
.venv
60 changes: 45 additions & 15 deletions raft/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,13 @@ Dependencies can be installed using the following command:
```bash
pip install -r requirements.txt
```

Arguments:
- `--datapath` - the path at which the document is located
- `--output` - the path at which to save the dataset
- `--output-format` - the format of the output dataset. Defaults to `hf` for HuggingFace. Can be one of `hf`, `completion`, `chat`.
- `--output-type` - the type of the output dataset file. Defaults to `jsonl`. Can be one of `jsonl`, `parquet`.
- `--output-chat-system-prompt` - The system prompt to use when the output format is `chat`. Optional.
- `--distractors` - the number of distractor documents to include per data point / triplet
- `--doctype` - the type of the document, must be one of the accepted doctypes
- currently accepted doctypes: `pdf`, `txt`, `json`, `api`
Expand All @@ -42,7 +46,16 @@ Arguments:

Run the following command with your desired arguments to generate the dataset.
```bash
python3 raft.py --datapath PATH_TO_DATA --output OUTPUT_PATH --distractors 3 --p 1.0 --doctype pdf --chunk_size 512 --questions 5 --openai_key YOUR_OPENAI_KEY
python3 raft.py \
--datapath PATH_TO_DATA \
--output OUTPUT_PATH \
--output-format hf \ # or completion or chat
--distractors 3 \
--p 1.0 \
--doctype pdf \
--chunk_size 512 \
--questions 5 \
--openai_key YOUR_OPENAI_KEY
```

**Note**: As an alternative to passing the OpenAI key with the `--openai_key` argument, you also store the standard OpenAI environment variables in a file called `.env` like so. All standard OpenAI env variables are supported.
Expand Down Expand Up @@ -177,28 +190,45 @@ For each question-answer pair, append 4 randomly selected chunks as distractor d
},
'answer': '"In God We Trust"',
'cot_answer': None
}
}

```
```

#### 4. Generate and save dataset
RAFT repeats steps 2 and 3 for each chunk and saves the dataset to the path specified by the `--output` argument.
#### 4. Generate and save dataset
RAFT repeats steps 2 and 3 for each chunk and saves the dataset to the path specified by the `--output` argument.


#### 5. Finetune your own model on Microsoft AI Studio
Once the dataset is prepared, follow the instructions in [azure-ai-studio-ft/howto.md](azure-ai-studio-ft/howto.md) to finetune and deploy your own RAFT model. Make sure to use domain `instruction` as input and `cot_answer` as output.
#### 5. Convert the dataset to the format expected for fine tuning

#### 6. Evaluate RAFT model
After deploying your model in AI Studio, use command to evaluate the RAFT model. Make sure to fill in `base_url`, `api_key` and `model_name` in the `eval.py`, these can be found in the AI Studio.
```bash
python3 eval.py --question-file YOUR_EVAL_FILE.jsonl --answer-file YOUR_ANSWER_FILE
```
If you specified the `--output-format completion` or `--output-format chat` argument for the `raft.py` script, you can skip this part.

Otherwise, you need to convert the dataset to the format expected for fine-tuning a `completion` model in Azure with the following command:

```
python3 format.py --input output/data-00000-of-00001.arrow --output output.completion.jsonl --output-format completion
```

**Note**: the `format.py` script also has its own help

The `YOUR_EVAL_FILE.jsonl` is in the format where
```
python3 format.py --help
```

**Note**: If fine tuning a chat model, then you need to use `--output-format chat` and optionally add the `--output-chat-system-prompt` parameter to configure the system prompt included in the dataset.

#### 6. Finetune your own model on Microsoft AI Studio
Once the dataset is prepared, follow the instructions in [azure-ai-studio-ft/howto.md](azure-ai-studio-ft/howto.md) to finetune and deploy your own RAFT model. Make sure to use `prompt` as input and `completion` as output when fine tuning a `completion` model and the `messages` column as input when fine tuning a `chat` model.

#### 7. Evaluate RAFT model
After deploying your model in AI Studio, use command to evaluate the RAFT model. Make sure to fill in `base_url`, `api_key` and `model_name` in the `eval.py`, these can be found in the AI Studio.
```bash
python3 eval.py --question-file YOUR_EVAL_FILE.jsonl --answer-file YOUR_ANSWER_FILE
```

The `YOUR_EVAL_FILE.jsonl` is in the format where
```python
{
'instruction': '<DOCUMENT> document1 </DOCUMENT>\n<DOCUMENT> document2 </DOCUMENT> ...\n{question}",
'gold_answer': '{answer}'
}

}
```
173 changes: 173 additions & 0 deletions raft/format.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
from abc import ABC, abstractmethod
import argparse
from datasets import Dataset, load_dataset
from typing import Dict, Literal, Any, get_args

"""
This file allows to convert raw HuggingFace Datasets into files suitable to fine tune completion and chat models.
"""

OutputDatasetType = Literal["parquet", "jsonl"]
outputDatasetTypes = list(get_args(OutputDatasetType))

InputDatasetType = Literal["arrow", "jsonl"]
inputDatasetTypes = list(get_args(InputDatasetType))

DatasetFormat = Literal["hf", "completion", "chat"]
datasetFormats = list(get_args(DatasetFormat))

def get_args() -> argparse.Namespace:
"""
Parses and returns the arguments specified by the user's command
"""
parser = argparse.ArgumentParser()

parser.add_argument("--input", type=str, required=True, help="Input HuggingFace dataset file")
parser.add_argument("--input-type", type=str, default="arrow", help="Format of the input dataset. Defaults to arrow.", choices=inputDatasetTypes)
parser.add_argument("--output", type=str, required=True, help="Output file")
parser.add_argument("--output-format", type=str, required=True, help="Format to convert the dataset to", choices=datasetFormats)
parser.add_argument("--output-type", type=str, default="jsonl", help="Type to export the dataset to. Defaults to jsonl.", choices=outputDatasetTypes)
parser.add_argument("--output-chat-system-prompt", type=str, help="The system prompt to use when the output format is chat")

args = parser.parse_args()
return args

class DatasetFormatter(ABC):
"""
Base class for dataset formatters. Formatters rename columns, remove and add
columns to match the expected target format structure. HF, Chat or Completion models file formats.
https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
"""
@abstractmethod
def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
pass

class DatasetExporter(ABC):
"""
Base class for dataset exporters. Exporters export dataset to different file types, JSONL, Parquet, ...
"""
@abstractmethod
def export(self, ds: Dataset, output_path: str):
pass

class DatasetConverter():
"""
Entry point class. It resolves which DatasetFormatter and which DatasetExporter to use and runs them.
"""
formats: Dict[DatasetFormat, DatasetFormatter]
exporters: Dict[OutputDatasetType, Any]

def __init__(self) -> None:
self.formats = {
"hf": HuggingFaceDatasetFormatter(),
"completion": OpenAiCompletionDatasetFormatter(),
"chat": OpenAiChatDatasetFormatter()
}
self.exporters = {
"parquet": ParquetDatasetExporter(),
"jsonl": JsonlDatasetExporter()
}

def convert(self, ds: Dataset, format: DatasetFormat, output_path: str, output_type: OutputDatasetType, params: Dict[str, str]):
if not format in self.formats:
raise Exception(f"Output Format {format} is not supported, pleased select one of {self.formats.keys()}")

if not output_type in self.exporters:
raise Exception(f"Output Type {output_type} is not supported, pleased select one of {self.exporters.keys()}")

formatter = self.formats[format]
newds = formatter.format(ds, params)
exporter = self.exporters[output_type]
exporter.export(newds, output_path)

class HuggingFaceDatasetFormatter(DatasetFormatter):
"""
Returns the HuggingFace Dataset as is
"""
def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
return ds

def _remove_all_columns_but(ds: Dataset, keep_columns) -> Dataset:
"""
HF Dataset doesn't have a way to copy only specific columns of a Dataset so this help
removes all columns but the ones specified.
"""
remove_columns = list(ds.column_names)
for keep in keep_columns:
remove_columns.remove(keep)
ds = ds.remove_columns(remove_columns)
return ds

class OpenAiCompletionDatasetFormatter(DatasetFormatter):
"""
Returns the Dataset in the OpenAI Completion Fine-tuning file format with two fields "prompt" and "completion".
https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
"""
def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
newds = ds.rename_columns({'question': 'prompt', 'cot_answer': 'completion'})
return _remove_all_columns_but(newds, ['prompt', 'completion'])

class OpenAiChatDatasetFormatter(OpenAiCompletionDatasetFormatter):
"""
Returns the Dataset in the OpenAI Chat Fine-tuning file format with one field "messages".
https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
"""
def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
newds = super().format(ds, params)

def format_messages(row):
messages = []
if 'system_prompt' in params:
system_prompt = params['system_prompt']
messages.append({ "role": "system", "content": system_prompt})
messages.extend([{ "role": "user", "content": row['prompt']}, { "role": "assistant", "content": row['completion']}])
chat_row = {"messages": messages}
return chat_row

newds = newds.map(format_messages)
return _remove_all_columns_but(newds, ['messages'])

def append_extension(path: str, extension: str) -> str:
suffix = "." + extension
if not path.endswith(suffix):
path = path + suffix
return path


class JsonlDatasetExporter(DatasetExporter):
"""
Exports the Dataset to a JSONL file
"""

def export(self, ds: Dataset, output_path: str):
ds.to_json(append_extension(output_path, "jsonl"))


class ParquetDatasetExporter(DatasetExporter):
"""
Exports the Dataset to a Parquet file
"""

def export(self, ds: Dataset, output_path: str):
ds.to_parquet(append_extension(output_path, "parquet"))


def main():
"""
When raft.py is executed from the command line.
"""
args = get_args()
ds = load_dataset(args.input_type, data_files={"train": args.input})['train']
formatter = DatasetConverter()

if args.output_chat_system_prompt and args.output_format != "chat":
raise Exception("Parameter --output-chat-system-prompt can only be used with --output-format chat")

format_params = {}
if args.output_chat_system_prompt:
format_params['system_prompt'] = args.output_chat_system_prompt

formatter.convert(ds=ds, format=args.output_format, output_path=args.output, output_type=args.output_type, params=format_params)

if __name__ == "__main__":
main()
19 changes: 17 additions & 2 deletions raft/raft.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from langchain_openai.embeddings import OpenAIEmbeddings
from client_utils import build_openai_client, build_langchain_embeddings
from math import ceil
from format import DatasetConverter, datasetFormats, outputDatasetTypes

log_setup()

Expand All @@ -34,6 +35,9 @@ def get_args() -> argparse.Namespace:

parser.add_argument("--datapath", type=str, default="", help="The path at which the document is located")
parser.add_argument("--output", type=str, default="./", help="The path at which to save the dataset")
parser.add_argument("--output-format", type=str, default="hf", help="Format to convert the dataset to. Defaults to hf.", choices=datasetFormats)
parser.add_argument("--output-type", type=str, default="jsonl", help="Type to export the dataset to. Defaults to jsonl.", choices=outputDatasetTypes)
parser.add_argument("--output-chat-system-prompt", type=str, help="The system prompt to use when the output format is chat")
parser.add_argument("--distractors", type=int, default=3, help="The number of distractor documents to include per data point / triplet")
parser.add_argument("--p", type=float, default=1.0, help="The percentage that the oracle document is included in the context")
parser.add_argument("--questions", type=int, default=5, help="The number of data points / triplets to generate per chunk")
Expand Down Expand Up @@ -292,7 +296,11 @@ def main():

# run code
args = get_args()


# Validate arguments
if args.output_chat_system_prompt and args.output_format != "chat":
raise Exception("Parameter --output-chat-system-prompt can only be used with --output-format chat")

OPENAPI_API_KEY = args.openai_key

client = build_openai_client(
Expand Down Expand Up @@ -350,7 +358,14 @@ def main():
ds.save_to_disk(args.output)

# Save as .jsonl format
ds.to_json(args.output + ".jsonl")
formatter = DatasetConverter()

# Extract format specific params
format_params = {}
if args.output_chat_system_prompt:
format_params['system_prompt'] = args.output_chat_system_prompt

formatter.convert(ds=ds, format=args.output_format, output_path=args.output, output_type=args.output_type, params=format_params)

if not args.fast:
os.remove("checkpoint.txt")
Expand Down

0 comments on commit ae5f0a2

Please sign in to comment.