Skip to content

YujieLu10/TIP

Repository files navigation

TIP

🤗 Demo [Coming Soon] 📃 Paper Data 🐦 Twitter

Thrilled to release TIP (Dual Text-Image Prompting), a Text-to-Image model enhanced Large Language Model that can generate coherent and authentic multimodal procedural plans toward a high-level goal. Please check out our paper "Multimodal Procedural Planning via Dual Text-Image Prompting"!

Overview

Our dual Text-Image Prompting (TIP) model generates coherent and authentic multimodal procedural plans with multiple steps towards a high-level goal, providing useful guidelines in task completion.

The vanilla text plan is generated using LLM. Our Text-Image Prompting (TIP) generates the textual- grounded image plan using T2I-Bridge (Fig. 3) and the visual-grounded text plan using I2T-Bridge (Fig. 5). The colors blue and green highlight the improved grounding in text and image respectively.

Improved grounding in textual and visual context are highlighted in pink and green respectively. Red texts indicate reasoning of physical action in image plan generation.

Installation

git clone --recursive [email protected]:YujieLu10/MPP.git
cd MPP
conda create -n mpp
conda activate mpp
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .
pip install -r requirements.txt
sh install.sh

Zero-shot Planning

Data Preprocess - Caption Generation

Generate captions for WikiPLAN and RecipePLAN

python preprocessors/generate_caption.py --source groundtruth_input
python preprocessors/generate_caption.py --source experiment_output

Multimodal Procedural Planning

Baselines

  • m-plan: multimodal procedural planning, llm and t2i model will collaboratively generating procedural planning
  • u-plan: unimodal procedural planning that seperately plan in textual and visual space (in mpp, it means first use llm to generate textual plan, and then use t2i model to visualize as visual plan)
  • t(v)gt-u-plan: visual procedural planning with ground truth textual procedural plans, aka. generating visual plans directly using ground truth textual plan (textual procedural planning with ground truth visual procedural plans, aka. generating textual plans directly using ground truth visual plan)

Run below command to use our TIP to generate multimodal procedural plans:

python planning.py --task m-plan

Try out other baseliens by replace default m-plan with tgt-u-plan-dalle, vgt-u-plan-blip.

python planning.py --task tgt-u-plan-dalle
python planning.py --task vgt-u-plan-blip

For T2I and I2T bridge ablation:

python planning.py --task m-plan --t2i_template_check

python planning.py --task m-plan --i2t_template_check

Caption Base:

T2I Base:

  • DALLE (OPENAI 512x512)
  • Stablediffusion V2 (v2-1_512-ema-pruned.ckpt)

Evaluation

To generate plans for evaluation:

python planning.py --eval --data_type wikihow --eval_task all

To visualize the plan grid:

python amt_platform/generate_plan_grid.py --source experiment_output

To generate Amazon Mechnical Turk evaluation format:

python amt_platform/get_amt_h2h_csv.py --source experiment_output

Check template robustness:

python evaluators/template_robustness.py

Citation

If you found this repository useful, please consider cite our paper:

@misc{lu2023multimodal,
      title={Multimodal Procedural Planning via Dual Text-Image Prompting}, 
      author={Yujie Lu and Pan Lu and Zhiyu Chen and Wanrong Zhu and Xin Eric Wang and William Yang Wang},
      year={2023},
      eprint={2305.01795},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Multimodal-Procedural-Planning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published