Synthetic Dataset Generator for LLM Finetuning

This web application creates high-quality question-answer pairs from documents for fine-tuning large language models (LLMs). It utilizes Ollama to interact with local LLM models and offers a user-friendly interface for generating datasets. The application stores documents in a vector database (ChromaDB) and retrieves content based on the specified keywords.

Features

Generate Q&A pairs with customizable parameters
Interactive results display
Export datasets in JSON format
Customizable instruction prompts
Multiple model support through Ollama

Installation

1. Clone the Repository

git clone <repository-url>
cd dataset-generator

2. Backend Setup

# Navigate to backend directory
cd backend

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install Python dependencies
pip install fastapi uvicorn httpx python-multipart langchain langchain-ollama

3. Frontend Setup

# Navigate to frontend directory
cd frontend

# Install Node dependencies
npm install

4. Install and Setup Ollama

Install Ollama from ollama.ai
Pull a compatible model:

ollama pull llama3.2
# or
ollama pull mistral

Starting the Application

1. Start Ollama Server

ollama serve

2. Start Backend Server

# Make sure you're in the backend directory and virtual environment is activated
cd backend
uvicorn main:app --reload --port 8000

The backend will be available at http://localhost:8000

3. Start Frontend Development Server

# In a new terminal, navigate to frontend directory
cd frontend
npm start

The application will open automatically at http://localhost:3000

Usage

Open your browser and go to http://localhost:3000
Upload a pdf file
Configure generation parameters:
- Number of Q&A pairs to generate
- Temperature (0.1-1.0)
- Select LLM model
- Customize instruction prompt if needed
Click "Generate Dataset" to start generation
Review generated pairs in the interface
Download the dataset using the "Save" button

Development Notes

Backend runs on FastAPI with async support
Frontend built with React
Real-time streaming of generated pairs
Automatic retry mechanism for failed generations

Output Format

Generated datasets are saved in JSON format:

{
    "conversations": [
        {
            "from": "human",
            "value": "Generated question?"
        },
        {
            "from": "assistant",
            "value": "Generated answer."
        }
    ],
    "source": "filename.pdf"
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
conversations.json		conversations.json
dataset generator.txt		dataset generator.txt
package-lock.json		package-lock.json
package.json		package.json
test.pdf		test.pdf
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Dataset Generator for LLM Finetuning

Features

Installation

1. Clone the Repository

2. Backend Setup

3. Frontend Setup

4. Install and Setup Ollama

Starting the Application

1. Start Ollama Server

2. Start Backend Server

3. Start Frontend Development Server

Usage

Development Notes

Output Format

Demo

About

Releases

Packages

Languages

AsadNizami/Dataset-generator-for-LLM-finetuning

Folders and files

Latest commit

History

Repository files navigation

Synthetic Dataset Generator for LLM Finetuning

Features

Installation

1. Clone the Repository

2. Backend Setup

3. Frontend Setup

4. Install and Setup Ollama

Starting the Application

1. Start Ollama Server

2. Start Backend Server

3. Start Frontend Development Server

Usage

Development Notes

Output Format

Demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages