This web application creates high-quality question-answer pairs from documents for fine-tuning large language models (LLMs). It utilizes Ollama to interact with local LLM models and offers a user-friendly interface for generating datasets. The application stores documents in a vector database (ChromaDB) and retrieves content based on the specified keywords.
- Generate Q&A pairs with customizable parameters
- Interactive results display
- Export datasets in JSON format
- Customizable instruction prompts
- Multiple model support through Ollama
git clone <repository-url>
cd dataset-generator
# Navigate to backend directory
cd backend
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install Python dependencies
pip install fastapi uvicorn httpx python-multipart langchain langchain-ollama
# Navigate to frontend directory
cd frontend
# Install Node dependencies
npm install
- Install Ollama from ollama.ai
- Pull a compatible model:
ollama pull llama3.2
# or
ollama pull mistral
ollama serve
# Make sure you're in the backend directory and virtual environment is activated
cd backend
uvicorn main:app --reload --port 8000
The backend will be available at http://localhost:8000
# In a new terminal, navigate to frontend directory
cd frontend
npm start
The application will open automatically at http://localhost:3000
- Open your browser and go to
http://localhost:3000
- Upload a pdf file
- Configure generation parameters:
- Number of Q&A pairs to generate
- Temperature (0.1-1.0)
- Select LLM model
- Customize instruction prompt if needed
- Click "Generate Dataset" to start generation
- Review generated pairs in the interface
- Download the dataset using the "Save" button
- Backend runs on FastAPI with async support
- Frontend built with React
- Real-time streaming of generated pairs
- Automatic retry mechanism for failed generations
Generated datasets are saved in JSON format:
{
"conversations": [
{
"from": "human",
"value": "Generated question?"
},
{
"from": "assistant",
"value": "Generated answer."
}
],
"source": "filename.pdf"
}