RAFT Support for chat and completion model formats (ShishirPatil#417)

Adds support to convert the dataset to formats expected to fine tune `completion` and `chat` models as specified there: https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset `chat` format: ``` {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]} {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]} {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]} ``` `completion` format: ``` {"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} ``` `raft.py`: - Supports `jsonl` and `parquet` output types - Supports `hf`, `chat` and `completion` formats - `chat` format also accepts a `--output-chat-system-prompt` param to configure the system prompt - Ignore venv folders - Added usage to --help ``` --output-format {hf,completion,chat} Format to convert the dataset to. Defaults to hf. --output-type {parquet,jsonl} Type to export the dataset to. Defaults to jsonl. --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT The system prompt to use when the output format is chat ``` New `format.py` script, to convert dataset previously generated by `raft.py`: ``` $ python format.py --help usage: format.py [-h] --input INPUT [--input-type {arrow,jsonl}] --output OUTPUT --output-format {hf,completion,chat} [--output-type {parquet,jsonl}] [--output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT] options: -h, --help show this help message and exit --input INPUT Input HuggingFace dataset file --input-type {arrow,jsonl} Format of the input dataset. Defaults to arrow. --output OUTPUT Output file --output-format {hf,completion,chat} Format to convert the dataset to --output-type {parquet,jsonl} Type to export the dataset to. Defaults to jsonl. --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT The system prompt to use when the output format is chat ``` How to test `format.py` with the `chat` format: ``` python format.py --input output/data-00000-of-00001.arrow \ --output output/ucb-short.chat.jsonl \ --output-format chat \ --output-chat-system-prompt 'You are an AI expert on UC Berkeley' ``` How to test `format.py` with the `completion` format: ``` python format.py --input output/data-00000-of-00001.arrow \ --output output/ucb-short.completion.jsonl \ --output-format completion ``` How to test `raft.py` with the `chat` format: ``` python3 raft.py \ --datapath $PWD/sample_data/UC_Berkeley_short.pdf \ --output $PWD/output \ --distractors 3 \ --doctype pdf \ --chunk_size 512 \ --questions 2 \ --completion_model gpt-4-turbo \ --embedding_model text-embedding-ada-002 \ --output-format chat \ --output-chat-system-prompt "You're a RAG AI" ``` Co-authored-by: Shishir Patil <[email protected]>
kaiwen129 · May 10, 2024 · ae5f0a2 · ae5f0a2
1 parent 42f9d28
commit ae5f0a2
Show file tree

Hide file tree

Showing 4 changed files with 238 additions and 17 deletions.
diff --git a/.gitignore b/.gitignore
@@ -52,3 +52,6 @@ berkeley-function-call-leaderboard/result/
 
 # Ignore leaderboard score
 berkeley-function-call-leaderboard/score/
+
+.direnv/
+.venv
diff --git a/raft/README.md b/raft/README.md
@@ -19,9 +19,13 @@ Dependencies can be installed using the following command:
 ```bash
 pip install -r requirements.txt
 ```
+
 Arguments:
 - `--datapath` - the path at which the document is located
 - `--output` - the path at which to save the dataset
+- `--output-format` - the format of the output dataset. Defaults to `hf` for HuggingFace. Can be one of `hf`, `completion`, `chat`.
+- `--output-type` - the type of the output dataset file. Defaults to `jsonl`. Can be one of `jsonl`, `parquet`.
+- `--output-chat-system-prompt` - The system prompt to use when the output format is `chat`. Optional.
 - `--distractors` - the number of distractor documents to include per data point / triplet
 - `--doctype` - the type of the document, must be one of the accepted doctypes
   - currently accepted doctypes: `pdf`, `txt`, `json`, `api`
@@ -42,7 +46,16 @@ Arguments:
 
 Run the following command with your desired arguments to generate the dataset.  
 ```bash 
-python3 raft.py --datapath PATH_TO_DATA --output OUTPUT_PATH --distractors 3 --p 1.0 --doctype pdf --chunk_size 512 --questions 5 --openai_key YOUR_OPENAI_KEY
+python3 raft.py \
+  --datapath PATH_TO_DATA \
+  --output OUTPUT_PATH \
+  --output-format hf \ # or completion or chat
+  --distractors 3 \
+  --p 1.0 \
+  --doctype pdf \
+  --chunk_size 512 \
+  --questions 5 \
+  --openai_key YOUR_OPENAI_KEY
 ```
 
 **Note**: As an alternative to passing the OpenAI key with the `--openai_key` argument, you also store the standard OpenAI environment variables in a file called `.env` like so. All standard OpenAI env variables are supported.
@@ -177,28 +190,45 @@ For each question-answer pair, append 4 randomly selected chunks as distractor d
   },
   'answer': '"In God We Trust"',
   'cot_answer': None
- }
+}
 
- ```
+```
 
- #### 4. Generate and save dataset
- RAFT repeats steps 2 and 3 for each chunk and saves the dataset to the path specified by the `--output` argument.
+#### 4. Generate and save dataset
+RAFT repeats steps 2 and 3 for each chunk and saves the dataset to the path specified by the `--output` argument.
 
 
- #### 5. Finetune your own model on Microsoft AI Studio
- Once the dataset is prepared, follow the instructions in [azure-ai-studio-ft/howto.md](azure-ai-studio-ft/howto.md) to finetune and deploy your own RAFT model. Make sure to use domain `instruction` as input and `cot_answer` as output.
+#### 5. Convert the dataset to the format expected for fine tuning
 
- #### 6. Evaluate RAFT model
- After deploying your model in AI Studio, use command to evaluate the RAFT model. Make sure to fill in `base_url`, `api_key` and `model_name` in the `eval.py`, these can be found in the AI Studio. 
- ```bash 
- python3 eval.py --question-file YOUR_EVAL_FILE.jsonl --answer-file YOUR_ANSWER_FILE
- ```
+If you specified the `--output-format completion` or `--output-format chat` argument for the `raft.py` script, you can skip this part.
+
+Otherwise, you need to convert the dataset to the format expected for fine-tuning a `completion` model in Azure with the following command:
+
+```
+python3 format.py --input output/data-00000-of-00001.arrow --output output.completion.jsonl --output-format completion
+```
+
+**Note**: the `format.py` script also has its own help
 
- The `YOUR_EVAL_FILE.jsonl` is in the format where
+```
+python3 format.py --help
+```
+
+**Note**: If fine tuning a chat model, then you need to use `--output-format chat` and optionally add the `--output-chat-system-prompt` parameter to configure the system prompt included in the dataset.
+
+#### 6. Finetune your own model on Microsoft AI Studio
+Once the dataset is prepared, follow the instructions in [azure-ai-studio-ft/howto.md](azure-ai-studio-ft/howto.md) to finetune and deploy your own RAFT model. Make sure to use `prompt` as input and `completion` as output when fine tuning a `completion` model and the `messages` column as input when fine tuning a `chat` model.
+
+#### 7. Evaluate RAFT model
+After deploying your model in AI Studio, use command to evaluate the RAFT model. Make sure to fill in `base_url`, `api_key` and `model_name` in the `eval.py`, these can be found in the AI Studio. 
+```bash 
+python3 eval.py --question-file YOUR_EVAL_FILE.jsonl --answer-file YOUR_ANSWER_FILE
+```
+
+The `YOUR_EVAL_FILE.jsonl` is in the format where
 ```python
 {
   'instruction': '<DOCUMENT> document1 </DOCUMENT>\n<DOCUMENT> document2 </DOCUMENT> ...\n{question}",
   'gold_answer': '{answer}'
- }
-
+}
 ```
diff --git a/raft/format.py b/raft/format.py
@@ -0,0 +1,173 @@
+from abc import ABC, abstractmethod
+import argparse
+from datasets import Dataset, load_dataset
+from typing import Dict, Literal, Any, get_args
+
+"""
+This file allows to convert raw HuggingFace Datasets into files suitable to fine tune completion and chat models.
+"""
+
+OutputDatasetType = Literal["parquet", "jsonl"]
+outputDatasetTypes = list(get_args(OutputDatasetType))
+
+InputDatasetType = Literal["arrow", "jsonl"]
+inputDatasetTypes = list(get_args(InputDatasetType))
+
+DatasetFormat = Literal["hf", "completion", "chat"]
+datasetFormats = list(get_args(DatasetFormat))
+
+def get_args() -> argparse.Namespace:
+    """
+    Parses and returns the arguments specified by the user's command
+    """
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--input", type=str, required=True, help="Input HuggingFace dataset file")
+    parser.add_argument("--input-type", type=str, default="arrow", help="Format of the input dataset. Defaults to arrow.", choices=inputDatasetTypes)
+    parser.add_argument("--output", type=str, required=True, help="Output file")
+    parser.add_argument("--output-format", type=str, required=True, help="Format to convert the dataset to", choices=datasetFormats)
+    parser.add_argument("--output-type", type=str, default="jsonl", help="Type to export the dataset to. Defaults to jsonl.", choices=outputDatasetTypes)
+    parser.add_argument("--output-chat-system-prompt", type=str, help="The system prompt to use when the output format is chat")
+
+    args = parser.parse_args()
+    return args
+
+class DatasetFormatter(ABC):
+    """
+    Base class for dataset formatters. Formatters rename columns, remove and add 
+    columns to match the expected target format structure. HF, Chat or Completion models file formats.
+    https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
+    """
+    @abstractmethod
+    def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
+        pass
+
+class DatasetExporter(ABC):
+    """
+    Base class for dataset exporters. Exporters export dataset to different file types, JSONL, Parquet, ...
+    """
+    @abstractmethod
+    def export(self, ds: Dataset, output_path: str):
+        pass
+
+class DatasetConverter():
+    """
+    Entry point class. It resolves which DatasetFormatter and which DatasetExporter to use and runs them.
+    """
+    formats: Dict[DatasetFormat, DatasetFormatter]
+    exporters: Dict[OutputDatasetType, Any]
+
+    def __init__(self) -> None:
+        self.formats = {
+            "hf": HuggingFaceDatasetFormatter(),
+            "completion": OpenAiCompletionDatasetFormatter(),
+            "chat": OpenAiChatDatasetFormatter()
+        }
+        self.exporters = {
+            "parquet": ParquetDatasetExporter(),
+            "jsonl": JsonlDatasetExporter()
+        }
+
+    def convert(self, ds: Dataset, format: DatasetFormat, output_path: str, output_type: OutputDatasetType, params: Dict[str, str]):
+        if not format in self.formats:
+            raise Exception(f"Output Format {format} is not supported, pleased select one of {self.formats.keys()}")
+
+        if not output_type in self.exporters:
+            raise Exception(f"Output Type {output_type} is not supported, pleased select one of {self.exporters.keys()}")
+
+        formatter = self.formats[format]
+        newds = formatter.format(ds, params)
+        exporter = self.exporters[output_type]
+        exporter.export(newds, output_path)
+
+class HuggingFaceDatasetFormatter(DatasetFormatter):
+    """
+    Returns the HuggingFace Dataset as is
+    """
+    def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
+        return ds
+
+def _remove_all_columns_but(ds: Dataset, keep_columns) -> Dataset:
+    """
+    HF Dataset doesn't have a way to copy only specific columns of a Dataset so this help
+    removes all columns but the ones specified.
+    """
+    remove_columns = list(ds.column_names)
+    for keep in keep_columns:
+        remove_columns.remove(keep)
+    ds = ds.remove_columns(remove_columns)
+    return ds
+
+class OpenAiCompletionDatasetFormatter(DatasetFormatter):
+    """
+    Returns the Dataset in the OpenAI Completion Fine-tuning file format with two fields "prompt" and "completion".
+    https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
+    """
+    def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
+        newds = ds.rename_columns({'question': 'prompt', 'cot_answer': 'completion'})
+        return _remove_all_columns_but(newds, ['prompt', 'completion'])
+
+class OpenAiChatDatasetFormatter(OpenAiCompletionDatasetFormatter):
+    """
+    Returns the Dataset in the OpenAI Chat Fine-tuning file format with one field "messages".
+    https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
+    """
+    def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
+        newds = super().format(ds, params)
+
+        def format_messages(row):
+            messages = []
+            if 'system_prompt' in params:
+                system_prompt = params['system_prompt']
+                messages.append({ "role": "system", "content": system_prompt})
+            messages.extend([{ "role": "user", "content": row['prompt']}, { "role": "assistant", "content": row['completion']}])
+            chat_row = {"messages": messages}
+            return chat_row
+
+        newds = newds.map(format_messages)
+        return _remove_all_columns_but(newds, ['messages'])
+
+def append_extension(path: str, extension: str) -> str:
+    suffix = "." + extension
+    if not path.endswith(suffix):
+        path = path + suffix
+    return path
+
+
+class JsonlDatasetExporter(DatasetExporter):
+    """
+    Exports the Dataset to a JSONL file
+    """
+
+    def export(self, ds: Dataset, output_path: str):
+        ds.to_json(append_extension(output_path, "jsonl"))
+
+
+class ParquetDatasetExporter(DatasetExporter):
+    """
+    Exports the Dataset to a Parquet file
+    """
+
+    def export(self, ds: Dataset, output_path: str):
+        ds.to_parquet(append_extension(output_path, "parquet"))
+
+
+def main():
+    """
+    When raft.py is executed from the command line.
+    """
+    args = get_args()
+    ds = load_dataset(args.input_type, data_files={"train": args.input})['train']
+    formatter = DatasetConverter()
+
+    if args.output_chat_system_prompt and args.output_format != "chat":
+        raise Exception("Parameter --output-chat-system-prompt can only be used with --output-format chat")
+
+    format_params = {}
+    if args.output_chat_system_prompt:
+        format_params['system_prompt'] = args.output_chat_system_prompt
+
+    formatter.convert(ds=ds, format=args.output_format, output_path=args.output, output_type=args.output_type, params=format_params)
+
+if __name__ == "__main__":
+    main()
diff --git a/raft/raft.py b/raft/raft.py
@@ -16,6 +16,7 @@
 from langchain_openai.embeddings import OpenAIEmbeddings
 from client_utils import build_openai_client, build_langchain_embeddings
 from math import ceil
+from format import DatasetConverter, datasetFormats, outputDatasetTypes
 
 log_setup()
 
@@ -34,6 +35,9 @@ def get_args() -> argparse.Namespace:
 
     parser.add_argument("--datapath", type=str, default="", help="The path at which the document is located")
     parser.add_argument("--output", type=str, default="./", help="The path at which to save the dataset")
+    parser.add_argument("--output-format", type=str, default="hf", help="Format to convert the dataset to. Defaults to hf.", choices=datasetFormats)
+    parser.add_argument("--output-type", type=str, default="jsonl", help="Type to export the dataset to. Defaults to jsonl.", choices=outputDatasetTypes)
+    parser.add_argument("--output-chat-system-prompt", type=str, help="The system prompt to use when the output format is chat")
     parser.add_argument("--distractors", type=int, default=3, help="The number of distractor documents to include per data point / triplet")
     parser.add_argument("--p", type=float, default=1.0, help="The percentage that the oracle document is included in the context")
     parser.add_argument("--questions", type=int, default=5, help="The number of data points / triplets to generate per chunk")
@@ -292,7 +296,11 @@ def main():
 
     # run code
     args = get_args()
-
+
+    # Validate arguments
+    if args.output_chat_system_prompt and args.output_format != "chat":
+        raise Exception("Parameter --output-chat-system-prompt can only be used with --output-format chat")
+
     OPENAPI_API_KEY = args.openai_key
 
     client = build_openai_client(
@@ -350,7 +358,14 @@ def main():
     ds.save_to_disk(args.output)
 
     # Save as .jsonl format
-    ds.to_json(args.output + ".jsonl")
+    formatter = DatasetConverter()
+
+    # Extract format specific params
+    format_params = {}
+    if args.output_chat_system_prompt:
+        format_params['system_prompt'] = args.output_chat_system_prompt
+
+    formatter.convert(ds=ds, format=args.output_format, output_path=args.output, output_type=args.output_type, params=format_params)
 
     if not args.fast:
         os.remove("checkpoint.txt")