=# NL2SQL: Natural Language to SQL Query System
AI-powered data inspection and query generation tool that transforms natural language questions into SQL queries, supporting multiple LLM providers.
NL2SQL is an advanced system that enables users to interact with databases using natural language. It combines modern LLM capabilities with database querying to provide an intuitive interface for data analysis and exploration.
Key Features:
- Natural language to SQL query conversion
- Support for OpenAI, Claude and Gemini models
- Automatic provider fallback handling
- Interactive data visualization interface
- Administrative dashboard for system monitoring
- Automated data collection and processing pipeline
NL2SQL是一个先进的系统,允许用户使用自然语言与数据库进行交互。它将现代大语言模型(LLM)的能力与数据库查询相结合, 为数据分析和探索提供直观的界面。
主要特点:
- 自然语言转SQL查询转换
- 支持OpenAI、Claude和Gemini模型
- 自动提供商故障转移处理
- 交互式数据可视化界面
- 系统监控管理仪表板
- 自动化数据收集和处理流程
NL2SQLは、ユーザーが自然言語を使用してデータベースと対話できる高度なシステムです。最新の大規模言語モデル(LLM)の 機能とデータベース検索を組み合わせ、データ分析と探索のための直感的なインターフェースを提供します。
主な機能:
- 自然言語からSQLクエリへの変換
- OpenAI、Claude、Geminiモデルのサポート
- 自動プロバイダーフォールバック処理
- インタラクティブなデータ可視化インターフェース
- システム監視管理ダッシュボード
- 自動データ収集・処理パイプライン
The project is organized into the following modules:
-
crawler/
: Web crawler for collecting SQL-related training data- Data extraction components
- Data cleaning utilities
- Storage management
-
dashboard/
: Admin interface for system management- Data visualization tools
- Model performance monitoring
- System configuration interface
-
frontend/
: User-facing web interface- Query input interface
- Results visualization
- Interactive components
-
common/
: Shared utilities and core components- Configuration management
- ML model interfaces
- Common data models
- Python 3.8 or higher
- Virtual environment tool (e.g., virtualenv, conda)
- Git
- OpenAI API Key
NL2SQL supports multiple Language Model providers for query generation, each with unique strengths:
- Models:
- GPT-4 (default): Best for complex queries and schema understanding
- GPT-4-turbo: Faster response times with similar quality
- GPT-3.5-turbo: Cost-effective for simpler queries
- Features:
- Robust SQL generation
- Advanced schema comprehension
- High accuracy for complex queries
- Setup:
- Get API key from OpenAI Platform
- Set in .env:
OPENAI_API_KEY=sk-...
- Optional: Configure organization ID:
OPENAI_ORG_ID=org-...
- Limitations:
- Token limit: 128k (GPT-4), 16k (GPT-3.5)
- Rate limits vary by tier
- Higher latency for GPT-4
- Models:
- Claude-2.1 (recommended): Latest model with improved SQL capabilities
- Claude-2: Stable model for production use
- Claude-instant: Fast, cost-effective option
- Features:
- Excellent at explaining query logic
- Handles complex schema relationships
- Good error correction
- Setup:
- Get API key from Anthropic Console
- Set in .env:
ANTHROPIC_API_KEY=sk-ant-...
- Optional: Set token limit:
ANTHROPIC_MAX_TOKENS=100000
- Limitations:
- May require special access for production
- Limited tool integration
- Regional availability restrictions
- Models:
- Gemini Pro: Primary model for SQL generation
- Gemini Pro Vision: Supports visual query contexts
- Text-bison: Legacy model for basic queries
- Features:
- Strong multilingual support
- Good performance on code generation
- Integration with Google Cloud
- Setup:
- Get API key from Google AI Studio
- Set in .env:
GOOGLE_API_KEY=AIza...
- Optional: Set project ID:
GOOGLE_PROJECT_ID=your-project-id
- Limitations:
- Regional availability varies
- Limited historical query context
- New platform, evolving features
- Crawler Module:
cd crawler
pip install -r requirements.txt
- Dashboard Module:
cd dashboard
pip install -r requirements.txt
- Frontend Module:
cd frontend
pip install -r requirements.txt
- Common Utilities:
cd common
pip install -r requirements.txt
Each module maintains its own requirements.txt file for specific dependencies:
- crawler/requirements.txt: Web scraping and data processing
- dashboard/requirements.txt: Admin interface components
- frontend/requirements.txt: User interface elements
- common/requirements.txt: Core system dependencies
To install all dependencies:
pip install -r requirements.txt
-
Environment Configuration:
- Copy .env.example to .env
- Configure OpenAI API credentials
- Set database connection parameters
- Configure model paths and parameters
-
Primary Use Case:
- Complex Database Queries → OpenAI GPT-4
- Detailed Query Explanations → Claude-2.1
- Multilingual Support → Gemini Pro
- Cost-Effective Solutions → GPT-3.5-turbo/Claude-instant
-
Performance Considerations:
- Response Time: GPT-3.5-turbo > Claude-instant > Gemini Pro > GPT-4
- Accuracy: GPT-4 > Claude-2.1 > Gemini Pro > GPT-3.5-turbo
- Cost Efficiency: Claude-instant > GPT-3.5-turbo > Gemini Pro > GPT-4
-
Special Requirements:
- Long Contexts (>32k tokens) → Claude-2.1
- Visual Query Support → Gemini Pro Vision
- Regulatory Compliance → Choose based on data residency
- High Throughput → Setup provider pools and fallbacks
-
Provider Setup:
# config.yaml default_provider: "openai" # Primary provider fallback_providers: ["anthropic", "google"] # Backup options
-
Model Selection:
openai: models: default: "gpt-4" # Primary model alternatives: ["gpt-4-turbo", "gpt-3.5-turbo"] # Fallbacks
-
Performance Tuning:
parameters: max_tokens: 2048 # Adjust based on query complexity temperature: 0.7 # Lower for more deterministic output timeout: 30 # Adjust based on model response time
-
Primary Provider Setup
- Set API key in .env file
- Configure model parameters in config.yaml
- Adjust request timeouts and retries
-
Fallback Providers (Optional)
- Configure backup providers
- Set automatic failover rules
- Define model equivalence mappings
-
Model Parameters
- Adjust temperature (0.0-1.0)
- Set token limits appropriate for provider
- Configure response formats
- Customize system prompts
-
Performance Optimization
- Enable response caching
- Configure request batching
- Set up connection pooling
-
Database Configuration:
- Configure database connection strings
- Set up access credentials
- Initialize database schema
- Code Development
- Create feature branch from main
- Implement changes following module structure
- Add tests for new functionality
- Update documentation
-
API Connection Issues
- "Invalid API key"
- Verify correct provider key in .env
- Check key format matches provider requirements
- Ensure key has required permissions/quotas
- "Rate limit exceeded"
- Implement request batching
- Check usage quotas
- Consider upgrading API tier
- "Model not available"
- Verify model availability in your region
- Check if model requires special access
- Consider fallback models
- "Invalid API key"
-
Environment Setup
- Error: "Module not found"
- Verify all requirements are installed
- Check virtual environment is activated
- Ensure PYTHONPATH includes project root
- Error: "Module not found"
-
Database Connection
- Error: "Could not connect to database"
- Verify database credentials in .env
- Check database is running and accessible
- Confirm network/firewall settings
- Error: "Could not connect to database"
-
Model-Specific Issues
- OpenAI
- Token limit exceeded: Adjust max_tokens parameter
- High latency: Use gpt-3.5-turbo for faster responses
- Claude
- Context window errors: Split long inputs
- Tool use limitations: Check model capabilities
- Gemini
- Regional restrictions: Use appropriate endpoints
- Version compatibility: Update API version
- OpenAI
- Check the logs in
nl2sql.log
- Review error messages in console output
- Submit an issue on GitHub with:
- Error message and stack trace
- System configuration details
- Steps to reproduce the issue
-
Testing
- Run unit tests for modified components
- Perform integration testing
- Validate against test datasets
-
Deployment
- Review changes and documentation
- Merge to main branch
- Deploy to staging environment
- Perform final validation
- Deploy to production
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
MIT License
NL2SQL Team - Contact details here