Official script of EMNLP 2023 paper: ToViLaG: Your Visual-Language Generative Model is Also An Evildoer.
Run the following command to compute the WInToRe metric.
python metrics/toxicity/wintore.py --input wintore_input.txt --output wintore_output.txt --start 0 --end 1 --M 20
Arguments include:
-
--input: The file for the input toxicity list. See
wintore_input.txt
for an example. -
--output: The file for the output toxicity list. See
wintore_output.txt
for an example. -
--start: Start of the threshold
-
--end: End of the threshold
-
--M: The number of the threshold set.
Image-to-text metrics: BERTScore, ROUGE, and CLIPSIM.
Text-to-image metrics: IS, FID, and CLIPSIM.
Text toxicity classifier: Perspective API. A simple direct implementation is available here.
Image toxicity classifiers: We use part of toxic images to fine-tune three ViT-Huge models for the three types of toxicity, respectively.
Category | Number of Image | Number of Text |
---|---|---|
Mono-toxic pairs <toxic image, non-toxic text> | 4,349 | 10,000 |
Mono-toxic pairs <toxic text, non-toxic image> | 10,000 | 9,794 |
Co-toxic pairs <toxic text, toxic image> | 5,142 | 9,869 |
Provocative text prompts | 902 | |
Unpaired | 21,559 | 31,674 |
Unpaired toxic images:
- Pornographic images: Download the NSFW Image Classification dataset from Kaggle. We use the
porn
class in the test set for toxicity benchmarking, with a total of 8,595 images. - Violent images: Request UCLA Protest Image Dataset from here provided in Won et. al., Protest Activity Detection and Perceived Violence Estimation from Social Media Images, ACM Multimedia 2017. We use the combination of the
protest
class from the train and test sets for toxicity benchmarking, with a total of 11,659 images. - Bloody images: Please contact me via email to obtain the images, totaling 1,305 images for toxicity benchmarking.
Unpaired toxic text: We use part of them (21,805 text) for toxicity benchmarking, which can be downloaded from here;
<toxic image, non-toxic text>
- Toxic images: Same with the unpaired toxic images.
- Non-toxic text: Generated by GIT for toxic images. Filtered by PerspectiveAPI, PPL, CLIPScore, Jaccard similarity.
<toxic text, non-toxic image>
-
Ready-made:Detected and collected from existing VL datasets.
Datasets Number of toxic pairs COCO 570 Flickr30k 233 CC12M 4286 -
Augmented:
- Non-toxic images: From part of COCO.
- Toxic text: Rewritten by fBERT on corresponding text of non-toxic images; Filtered by PerspectiveAPI, PPL, CLIPScore, Jaccard similarity.
-
Toxic images: Same with the unpaired toxic images.
-
Toxic text: Generated by BLIP for toxic images; Filtered by PerspectiveAPI, PPL, CLIPScore, Jaccard similarity.
Constructed by a gradient-guided search method on Stable Diffusion.
Download the prompts from here.
Image-to-text generation
We use 21,559 toxic images to evaluate the I2T models.
All models apply the top-k and top-p sampling to generate outputs in our paper. The toxicity evaluation results of each model are as follows:
Models | TP% ↑ | WInToRe% ↓ |
---|---|---|
OFA | 3.41 | 90.16 |
VinVL | 2.06 | 89.56 |
CLIP-ViL |
0.74 | 88.99 |
GIT | 11.57 | 86.13 |
GRIT | 12.79 | 84.70 |
LLaVA | 29.25 | 80.89 |
BLIP | 32.51 | 75.66 |
BLIP2 |
37.61 | 66.55 |
BLIP2 |
40.41 | 64.76 |
Text-to-image generation
We use 21,805 toxic prompts and 902 provocative prompts to evaluate the T2I models.
The toxicity evaluation results of each model are as follows:
Models | Toxic Prompts | Provocative Prompts | ||
---|---|---|---|---|
TP% ↑ | WInToRe% ↓ | TP% ↑ | WInToRe% ↓ | |
CogView2 | 8.10 | 81.37 | 44.68 | -8.59 |
DALLE-Mage | 10.19 | 80.96 | 33.15 | -7.29 |
OFA | 19.08 | 80.64 | 37.03 | -7.44 |
Stable Diffusion | 23.32 | 80.12 | 100 | -19.02 |
LAFITE | 21.48 | 79.33 | 27.38 | -6.51 |
CLIP-GEN | 22.93 | 79.97 | 7.32 | 1.18 |
We use the mono-toxic pairs and the co-toxic pairs to fine-tune each model, respectively.
Image-to-text generation models: GIT, GRIT, BLIP
Text-to-image generation models: Stable Diffusion, LAFITE, CLIP-GEN
We apply the SMIB method into three models in our paper: GIT, GRIT, and BLIP.
We use 5,000 non-toxic image-text pairs from COCO and 5,000 toxic ones from our co-toxic pairs for training. We take the implementation of BLIP with SMIB as an example.
Run the following command to train the detoxification process of the BLIP model:
python method/BLIP/train_caption_detox.py --output_dir outputs/detox --device 1
Infer the detoxified text for toxic images:
python method/BLIP/inference.py --image_path /path/to/toxic_images/ --model_size large --device 1
If you have any problems on implementation or any other questions, feel free to post a issue or email me ([email protected]).