MegaTran

Overview

A LLM-powered framework that converts natural language queries into executable Python code for data transformation tasks. The system uses a two-stage approach:

Weak2StrongPrompt: Fine-tuned LLaMA model that converts natural language queries into articulated code instructions
Prompt2Code: GPT-4o based code generator that produces Python functions, with below two optimizations:
- Lazy-RAG (Retrieval-Augmented Generation): Code libraries retrieval system for third-party packages
- Sanity-check Reflection: Sanity-check Reflection mechanism through error analysis

Setup

Install dependencies

pip install -r requirements.txt

Configure environment variables

# Create .env file with your API keys
OPENAI_API_KEY=your_api_key_here

Start vLLM server for fine-tuned model

vllm serve \
    --model ./assets/models/llama3_lora_sft \ # wait for downloading ...
    --config ./etc/vllm-server.yaml

Note: You can use CUDA_VISIBLE_DEVICES to target the GPU device for vLLM server

Test weak2strong prompt inference

python w2s_prompt_inference.py -q "input:abc, output:ABC"

# Expected output: 
# format(): Convert the string to uppercase

[offline, optional] Build RAG vector database

# Build vector database for code libs retrieval
python scripts/build_vector_db.py \
    --config etc/vec_db.yaml \
    [-q "hijri date to gregorian date"] # test single query by adding this argument

A pre-built vector database is saved in assets/rag/code_db

Usage

Run the transformation pipeline

# Test mode (with smaller dataset)
python run.py \
    --config etc/mega-transform.yaml \
    --exp_name demo \
    --model gpt-4o-mini \
    --testing

# Full dataset run
python run.py \
    --config etc/mega-transform.yaml \
    --exp_name exp-1 \
    --model gpt-4o-mini \
    --dataset_name stackoverflow

Check experiment results as show in demo folder. Results will include:

Code Generation Results (per task)
Full test results (full_result.csv)
Summary statistics (task-level accuracy, token usage, etc.)
Runtime logs for current run

Project Structure

chat-transform/
├── run.py                # Main execution script
├── w2s_prompt_inference.py # Weak2strong prompt inference
├── etc/                  # Configuration files
│   ├── mega-transform.yaml # pipeline config
│   ├── code-llm.yaml       # baseline Code LLM
│   ├── vllm-server.yaml    # vLLM server config
│   └── vec_db.yaml         # RAG vector database config
├── framework/            # Core components
│   ├── chat_to_inst.py   # Chat to instruction conversion
│   ├── code_generator.py # Code generation
│   ├── lazy_rag.py       # Lazy RAG module
│   ├── reflection.py     # Sanity-check Reflection module
│   └── prompt_generator.py # Prompt composition
├── util/                 # Utility modules
│   ├── analyzer.py       # Result analysis and reporting
│   ├── load_data.py      # Data loading utilities
│   ├── context_manager.py # Context management
│   └── __init__.py
├── assets/               # Model assets
│   ├── models/           # Fine-tuned models
│   └── rag/              # RAG related files (Vec DB, list of missing packages)
├── scripts/              # Utility scripts
│   ├── build_vector_db.py # Build RAG vector database
│   ├── foundation_model.py # Foundation model baseline
│   └── push_to_hf.py      # Push to HF
├── temp/                 # Temporary files (on-the-fly generated code)
├── .env                  # Environment variables
└── requirements.txt      # Project dependencies

Baselines

Foundation model baseline, source code refer to the orginal implementation here

# Dataset: benchmark-stackoverflow
python scripts/foundation_model.py --dataset stackoverflow --model gpt-4o-mini

# Dataset: benchmark-BinqQuery (semantic)
python scripts/foundation_model.py --dataset bingquery-logs --model gpt-4o-mini

Naive code generation baseline:

python run.py \
    --config etc/code-llm.yaml \ # use code-llm config here
    --exp_name exp-1 \
    --model gpt-4o-mini \
    --dataset_name stackoverflow

Download Model

The Weak2StrongPrompt Fine-tuning Model is avaliable at HuggingFace. Move the model files to assets/models/.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MegaTran

Overview

Setup

Usage

Project Structure

Baselines

Download Model

License

About

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
assets		assets
data		data
demo		demo
etc		etc
framework		framework
scripts		scripts
util		util
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
w2s_prompt_inference.py		w2s_prompt_inference.py

tigerlcl/megatran

Folders and files

Latest commit

History

Repository files navigation

MegaTran

Overview

Setup

Usage

Project Structure

Baselines

Download Model

License

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages