This project is a data transformation assistant that uses natural language input to generate and apply structured transformations on a Pandas DataFrame. It leverages OpenAI's GPT-based models to interpret instructions and execute transformations like filtering, selecting columns, and adding calculated columns.
- Accepts natural language instructions to transform data.
- Automatically generates transformation commands in a structured JSON format.
- Supports the following operations:
- Filter: Filters rows based on a condition.
- Select: Selects specific columns.
- Add Column: Adds new calculated columns using expressions.
- Modular and extensible design for easy maintenance and scalability.
- Python 3.12+
- A valid OpenAI API key
-
Clone this repository:
git clone https://github.com/NuperSu/dataspell_ai_chat.git cd your_project
-
Set up a virtual environment:
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Rename
.env.example' file to
.env` in the project root and replace placeholder with your OpenAI API key -
Run the application:
python main.py
Input:
Filter rows where salary > 50000 and select the name and salary columns.
Output:
Original DataFrame:
name age salary
0 Alice 25 50000
1 Bob 35 60000
2 Charlie 45 70000
3 David 28 52000
Generated Transformations:
[
Command(command='filter', parameters={'predicate': 'salary > 50000'}),
Command(command='select', parameters={'columns': ['name', 'salary']})
]
Transformed DataFrame:
name salary
1 Bob 60000
2 Charlie 70000
3 David 52000
- Description: Manages interaction with the OpenAI API.
- Key Features:
- Generates transformation commands based on natural language input.
- Uses
Pydantic
to validate LLM output against a predefined schema.
- Description: Contains the LLM prompt template for generating transformations.
- Description: Implements individual transformation functions.
- Functions:
filter_data
: Filters rows based on a condition.select_columns
: Selects specific columns.add_column
: Adds calculated columns.
- Description: Applies a sequence of transformations to a Pandas DataFrame.