Protein Chatbot: Retrieval-based Q&A on Magnesium-binding Protein Structures

Live Demo

This project provides a chatbot that leverages AI models for answering questions related to protein structures with magnesium-binding sites. It processes a dataset of 7,613 articles from the RCSB PDB, each detailing a protein structure with magnesium sites. These articles are processed using a RecursiveCharacterTextSplitter from the Langchain library and stored in Weaviate, a vector database optimized for similarity searches.

The system uses Ollama (running the llama3.1 model) to perform natural language processing and generate human-like responses. The chatbot is built with Flask for the web interface and uses Gunicorn as the WSGI server, making it highly scalable and efficient.

Overview

This application uses Langchain, Weaviate, and Ollama models to answer questions based on pre-loaded protein data. It retrieves similar contexts from the Weaviate vector store and uses Ollama for generating human-like responses. The setup also includes a Flask web application for user interaction.

Features

Flask Web App: Provides a simple user interface to interact with the chatbot.
Retrieval-Based Q&A: Uses Weaviate to retrieve the most relevant information and responds with concise, accurate answers.
AI-Powered Responses: Ollama model (Mistral) is used for generating natural language answers.
Weaviate Dashboard: View and manage the vector store via a dashboard.

Requirements

Linux OS: This project requires a Linux-based system to ensure compatibility with NVIDIA GPU support and other dependencies.
Docker: For containerizing and running services.
NVIDIA GPU: Requires Docker to be configured with NVIDIA GPU support.

Setup

Follow these steps to set up and run the project on your local machine:

1. Install NVIDIA Container Toolkit for GPU Support

Follow the official NVIDIA installation guide to set up GPU support in Docker.

2. Clone the Repository

git clone https://github.com/ThaisBarrosAlvim/protein-chat.git
cd protein-chat

3. Build and Run the Docker Containers

Use Docker Compose to set up the Weaviate, Ollama, and Flask services:

docker compose up --build

This command will build and start the Weaviate, Ollama, and Flask services.

4. Load the Protein Data Snapshot into Weaviate

Once Weaviate is running, follow these steps to load the dataset:

Download the snapshot file protein-articles4.zip.

Run the following script to restore the dataset:

sudo sh scripts/weaviate-restore-backup.sh protein-articles4.zip protein-chat-weaviate-1

Usage

Once the setup is complete, open your browser and go to http://localhost:8000 to interact with the chatbot. Type in a question related to protein data, and the system will retrieve relevant information from the Weaviate vector store and generate an answer using the Ollama model.

API Endpoints

/: The homepage where users can interact with the chatbot.
/message: The POST endpoint to send questions and receive responses.

Future Improvements

Prompt Engineering Enhancements:
- Refine the prompts sent to the Ollama model to improve the consistency and accuracy of responses. This includes testing various phrasing techniques and adjusting the context length to ensure the system handles a wider range of questions effectively.
CPU Support for Docker-Compose:
- Create a docker-compose-cpu.yml file to run without GPU support, enabling the project to function on platforms other than Linux, including macOS and Windows. This will broaden the availability of the project and make it accessible to more users.
Accessible Context for Users:

3.1 Download PDFs from Context:
- Allow users to download the PDFs that the system used as context to formulate the answers. This will ensure users have direct access to the sources for further exploration.
3.2 Context Highlight in PDFs:
- Implement a feature that marks the exact sections used in the answer directly within the PDF viewer on the web page. Users can visually see which part of the document contributed to the response.
Model Selection for Querying:
- Enable users to choose the language model that will be used to answer their queries. This allows more flexibility and customization depending on the needs or preferences of the user.
Selective Context PDFs:
- Give users the ability to choose which PDFs from the dataset should be used as context for answering questions. This way, users can narrow down the context or focus on specific documents they deem most relevant.

Contributing

Contributions are welcome! Feel free to submit a pull request or open an issue for any feature suggestions or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
data_construction		data_construction
docker		docker
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Chatbot: Retrieval-based Q&A on Magnesium-binding Protein Structures

Table of Contents

Overview

Features

Requirements

Setup

1. Install NVIDIA Container Toolkit for GPU Support

2. Clone the Repository

3. Build and Run the Docker Containers

4. Load the Protein Data Snapshot into Weaviate

Usage

API Endpoints

Future Improvements

Contributing

License

About

Languages

License

ThaisBarrosAlvim/protein-chat

Folders and files

Latest commit

History

Repository files navigation

Protein Chatbot: Retrieval-based Q&A on Magnesium-binding Protein Structures

Table of Contents

Overview

Features

Requirements

Setup

1. Install NVIDIA Container Toolkit for GPU Support

2. Clone the Repository

3. Build and Run the Docker Containers

4. Load the Protein Data Snapshot into Weaviate

Usage

API Endpoints

Future Improvements

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages