Skip to content
View OncoGPT1's full-sized avatar

Block or report OncoGPT1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
OncoGPT1/README.md

OncoGPT: A Medical Conversational Model Tailored with Oncology Domain Expertise on a Large Language Model Meta-AI (LLaMA)

Fujian Jia1,* , Xin Liu1,* , Lixi Deng1,*, Jiwen Gu1, Chunchao Pu1, Tunan Bai1, Mengjiang Huang2, Yuanzhi Lu3,§, Kang Liu1,§

1, Shenzhen Kanghua Juntai Biotech Co. Ltd., B 215, Unit No.7, Shahe Rd W, Nanshan, Shenzhen, Guangdong Province 518063, China.

2, Department of Nutrition and Graduate Group in Nutritional Biology, University of California, Davis, CA 95616, USA.

3, Department of Pathology, The First Affiliated Hospital of Jinan University, Tianhe Qu, Guangzhou 510632, China.

image

Setup:

In a conda env with pytorch available, run:

pip install -r requirements.txt 

Data and model:

1. OncoGPT Dataset:

180k real conversations between patients and doctors of OncoGPT. OncoGPT-50k
Real conversations between patients and doctors for test data. test
100k real conversations between patients and doctors from ChatDoctor. ChatDoctor-HealthCareMagic-100k
Stanford Alpaca data for basic conversational capabilities. Alpaca

2. Model Weights:

Place the model weights file in the ./pretrained folder. How to fine-tuning Fine-tuning with Lora

    WANDB__SERVICE_WAIT=300 python finetune.py \
    base_model: str = "",  # pretainded OncoGPT \
    data_path: str = "OncoGPT_180k.json", \
    output_dir: str = "./OncoGPT", \
    batch_size: int = 128, \
    micro_batch_size: int = 4, \
    num_epochs: int = 3, \
    learning_rate: float = 3e-4, \
    cutoff_len: int = 256, \
    val_set_size: int = 2000, \
    lora_r: int = 8, \
    lora_alpha: int = 16, \
    lora_dropout: float = 0.05 \

How to inference

You can build a OncoGPT model on your own machine and communicate with it.

python chat.py  

To apply for OncoGPT, please fill out this form and submit to our OncoGPT team. Link

Examples

Patient:Lung cancer, the liver has also spread, and small cell carcinoma. The current symptoms are cough, cough and chest pain. I just started chemotherapy yesterday. Can my father be cured?  
OncoGPT:Hi, Thanks for writing in. Small cell carcinoma of the lung is an aggressive form of cancer with poor prognosis. Chemotherapy is the mainstay of treatment in such cases. There is no guarantee for cure, however, in selected cases, long term remissions have been reported. So he needs to continue his chemotherapy and see how well it works.  

Overview

OncoGPT is a specialized large linguistic dialogue model tailored for oncology-related queries. We used Meta’s popular public LLaMA-7B model as the platform for development and evaluation. By training the LLaMA-7B base model with 52K instruction data from Stanford University's publicly available Alpaca model and incorporating 100K patient-physician conversations from an online medical consultation website (from ChatDoctor LLM), the model successfully acquired general medical conversation abilities. Subsequently, the model underwent further refinement using our established 180K+ genuine doctor-patient oncology dialogue dataset, resulting in the creation of OncoGPT.

Patient-physician Conversation Dataset

To collect a dataset of patient-physician conversations. We investigated common doctor-patient dialogue websites, including Chinese data, and finally obtained 180K data,In patient-physician conversations, the patient's descriptions of disease symptoms are often colloquial and cursory. We filtered these data both manually and automatically, removed the identity information of the doctor and patient, removed low quality questions and used language tools to correct grammatical errors. In addition, we selected 737 questions to test the model and manually divided them into research and therapeutic questions to evaluate the model effect.

Future updates

Subsequent updates to the Oncogpt 180K full database, interactive demo, and version 2.0 with automated reference checking and citation fuction, are on the road.

Limitations

We emphasize that ChatDoctor is for academic research only and any commercial use and clinical use is prohibited. There are three factors in this decision: First, ChatDoctor is based on LLaMA and has a non-commercial license, so we necessarily inherited this decision. Second, our model is not licensed for healthcare-related purposes. Also, we have not designed sufficient security measures, and the current model still does not guarantee the full correctness of medical diagnoses.

Reference

OncoGPT: A Medical Conversational Model Tailored with Oncology Domain Expertise on a Large Language Model Meta-AI (LLaMA)

Popular repositories Loading

  1. OncoGPT1 OncoGPT1 Public

    Config files for my GitHub profile.

    Python 59 13