A web application that generates high-quality question-answer pairs from text documents for LLM finetuning. The application uses Ollama to interact with local LLM models and provides a user-friendly interface for dataset generation.
- Upload text files for processing
- Generate Q&A pairs with customizable parameters
- Real-time generation feedback
- Interactive results display
- Export datasets in JSON format
- Customizable instruction prompts
- Multiple model support through Ollama
- Adjustable temperature settings
- Error tracking and validation
- Node.js (v14 or higher)
- Python (3.8 or higher)
- Ollama installed and running locally
- A compatible LLM model pulled in Ollama (e.g., llama3.2, mistral)
git clone <repository-url>
cd dataset-generator
# Navigate to backend directory
cd backend
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install Python dependencies
pip install fastapi uvicorn httpx python-multipart
# Navigate to frontend directory
cd frontend
# Install Node dependencies
npm install
- Install Ollama from ollama.ai
- Pull a compatible model:
ollama pull llama3.2
# or
ollama pull mistral
ollama serve
# Make sure you're in the backend directory and virtual environment is activated
cd backend
uvicorn main:app --reload --port 8000
The backend will be available at http://localhost:8000
# In a new terminal, navigate to frontend directory
cd frontend
npm start
The application will open automatically at http://localhost:3000
- Open your browser and go to
http://localhost:3000
- Upload a text file (UTF-8 encoded)
- Configure generation parameters:
- Number of Q&A pairs to generate
- Temperature (0.1-1.0)
- Select LLM model
- Customize instruction prompt if needed
- Click "Generate Dataset" to start generation
- Review generated pairs in the interface
- Download the dataset using the "Save" button
-
Backend Connection Error
- Ensure backend server is running on port 8000
- Check if virtual environment is activated
- Verify all Python dependencies are installed
-
Ollama Connection Error
- Verify Ollama is running (
ollama serve
) - Check if selected model is installed
- Ensure no firewall blocking port 11434
- Verify Ollama is running (
-
Frontend Issues
- Clear browser cache
- Verify Node.js version
- Check console for error messages
- "Failed to fetch models": Ollama service not running or unreachable
- "Model not available": Selected model not installed in Ollama
- "File too large": Text file exceeds size limit
- "Generation failed": Error during Q&A pair generation
dataset-generator/
├── backend/
│ ├── app/
│ │ ├── api/
│ │ │ └── routes.py
│ │ └── services/
│ │ └── ollama_service.py
│ └── main.py
└── frontend/
├── src/
│ ├── components/
│ │ ├── InstructDataset.js
│ │ └── InstructDataset.css
│ └── index.js
└── public/
└── index.html
- Backend runs on FastAPI with async support
- Frontend built with React
- Real-time streaming of generated pairs
- Automatic retry mechanism for failed generations
- Comprehensive error tracking and reporting
Generated datasets are saved in JSON format:
{
"conversations": [
{
"from": "human",
"value": "Generated question?"
},
{
"from": "assistant",
"value": "Generated answer."
}
],
"source": "filename.txt"
}