We evaluate FasterVLM with different LLaVA models on a diverse set of 10 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding following the originial LLaVA.
Before preparing task-specific data, you MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with vanilla LLaVA-1.5. Extract it to ./playground/data/eval
. This also provides a general structure for all datasets.
- Download
test2015
and put it under./playground/data/eval/vqav2
. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh
- Submit the results to the evaluation server:
./playground/data/eval/vqav2/answers_upload
.
- Download the data and evaluation scripts following the official instructions and put under
./playground/data/eval/gqa/data
. You may need to modifyeval.py
as this due to the missing assets in the GQA v1.2 release. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh
- Download
test.json
and extracttest.zip
totest
. Put them under./playground/data/eval/vizwiz
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh
- Submit the results to the evaluation server:
./playground/data/eval/vizwiz/answers_upload
.
- Under
./playground/data/eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh
- Download
TextVQA_0.5.1_val.json
and images and extract to./playground/data/eval/textvqa
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh
- Download
coco
from POPE and put under./playground/data/eval/pope
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh
- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version
. - put the official
eval_tool
andMME_Benchmark_release_version
under./playground/data/eval/MME
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
- Download
mmbench_dev_20230712.tsv
and put under./playground/data/eval/mmbench
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712
.
- Download
mmbench_dev_cn_20231003.tsv
and put under./playground/data/eval/mmbench
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003
.
- Extract
mm-vet.zip
to./playground/data/eval/mmvet
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh
- Submit the results to the evaluation server:
./playground/data/eval/mm-vet/results
.
To evaluate FasterVLM with LLaVA-NeXT, you just need to replace the v1_5
with v1_6
in the shell scripts. For example, to evaluate VQAv2 with LLaVA-NeXT, you can run:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_6/eval/vqav2.sh
Method | Reduction Ratio | # Token | VQAv2 | GQA | VisWiz | SQA-IMG | TextVQA | POPE | MME | MMB | MMB-CN | MM-Vet | Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-1.5-7B | 0% | 576 | 78.52 | 61.94 | 50.06 | 69.51 | 58.21 | 85.87 | 1506.47 | 64.69 | 58.08 | 31.30 | 100.00% |
FastV | 25% | 432 | 78.45 | 61.69 | 50.30 | 69.31 | 58.09 | 85.29 | 1528.16 | 64.52 | 58.76 | 32.80 | 100.60% |
FitPrune | 25% | 432 | 78.49 | 61.93 | 50.07 | 69.51 | 58.27 | 85.94 | 1512.72 | 64.60 | 58.33 | 31.50 | 100.15% |
SparseVLM | 25% | 432 | 78.13 | 61.38 | 50.38 | 68.62 | 57.44 | 84.56 | 1475.47 | 64.78 | 57.30 | 32.40 | 99.54% |
FasterVLM | 25% | 432 | 78.39 | 61.51 | 50.01 | 68.57 | 57.91 | 85.92 | 1501.10 | 65.12 | 58.51 | 32.40 | 100.18% |
FastV | 50% | 288 | 77.67 | 60.05 | 50.53 | 68.96 | 58.25 | 82.45 | 1513.06 | 64.26 | 58.16 | 31.70 | 99.33% |
FitPrune | 50% | 288 | 78.41 | 61.70 | 50.04 | 69.16 | 58.26 | 85.37 | 1499.70 | 64.60 | 58.16 | 31.10 | 99.73% |
SparseVLM | 50% | 288 | 76.67 | 58.78 | 51.03 | 68.57 | 57.49 | 83.87 | 1458.79 | 63.14 | 56.87 | 31.50 | 98.52% |
FasterVLM | 50% | 288 | 77.86 | 60.64 | 50.45 | 68.37 | 57.90 | 86.20 | 1471.96 | 63.83 | 56.70 | 34.80 | 100.12% |
FastV | 75% | 144 | 74.07 | 56.58 | 51.29 | 69.11 | 57.38 | 73.74 | 1463.39 | 64.00 | 57.22 | 28.60 | 95.80% |
FitPrune | 75% | 144 | 76.14 | 59.38 | 51.30 | 69.01 | 56.49 | 80.75 | 1472.86 | 63.92 | 57.65 | 28.40 | 97.22% |
SparseVLM | 75% | 144 | 72.76 | 55.11 | 51.46 | 69.36 | 55.99 | 77.57 | 1351.65 | 59.54 | 51.03 | 29.90 | 93.84% |
FasterVLM | 75% | 144 | 76.19 | 58.34 | 51.97 | 67.92 | 57.07 | 83.46 | 1433.76 | 62.54 | 57.13 | 34.20 | 98.75% |
FastV | 90% | 58 | 65.38 | 51.20 | 51.84 | 69.81 | 54.75 | 57.30 | 1210.36 | 59.97 | 51.72 | 27.20 | 87.97% |
FitPrune | 90% | 58 | 62.76 | 49.96 | 50.85 | 68.22 | 50.35 | 53.81 | 1147.46 | 56.27 | 45.53 | 21.80 | 82.07% |
SparseVLM | 90% | 58 | 62.90 | 48.86 | 49.36 | 67.23 | 48.99 | 65.82 | 1030.61 | 49.05 | 35.40 | 18.60 | 78.13% |
FasterVLM | 90% | 58 | 71.92 | 54.91 | 53.01 | 68.91 | 55.28 | 75.85 | 1348.63 | 60.57 | 54.90 | 30.10 | 94.24% |
FastV | 95% | 29 | 55.92 | 46.03 | 49.10 | 70.00 | 51.56 | 35.47 | 971.56 | 50.17 | 42.18 | 18.90 | 74.93% |
FitPrune | 95% | 29 | 52.39 | 43.60 | 48.61 | 68.32 | 46.75 | 31.17 | 855.21 | 39.69 | 29.98 | 18.00 | 67.64% |
FasterVLM | 95% | 29 | 66.75 | 51.51 | 52.67 | 69.56 | 53.09 | 67.24 | 1254.80 | 58.51 | 51.98 | 27.50 | 89.41% |
LLaVA-1.5-7B | 100% | 0 | 40.73 | 37.38 | 45.36 | 63.06 | 41.43 | 47.21 | 719.10 | 20.02 | 17.70 | 11.30 | 56.50% |
Method | Reduction Ratio | # Token | VQAv2 | GQA | VisWiz | SQA-IMG | TextVQA | POPE | MME | MMB | MMB-CN | MM-Vet | Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-1.5-13B | 0% | 576 | 80.00 | 63.25 | 53.61 | 72.78 | 61.19 | 85.99 | 1531.19 | 68.47 | 63.49 | 36.30 | 100.00% |
FastV | 25% | 432 | 79.96 | 63.11 | 53.79 | 72.78 | 61.29 | 85.86 | 1543.97 | 68.38 | 63.40 | 35.20 | 99.76% |
SparseVLM | 25% | 432 | 79.25 | 61.29 | 53.60 | 73.77 | 60.46 | 72.45 | 1481.40 | 68.30 | 61.77 | 35.50 | 97.20% |
FasterVLM | 25% | 432 | 79.61 | 61.24 | 53.16 | 72.98 | 60.35 | 86.73 | 1487.34 | 67.61 | 63.40 | 36.70 | 99.21% |
FastV | 50% | 288 | 79.54 | 62.59 | 54.34 | 73.13 | 60.86 | 85.15 | 1545.14 | 68.47 | 63.23 | 34.80 | 99.51% |
SparseVLM | 50% | 288 | 78.48 | 59.90 | 53.08 | 74.02 | 59.48 | 71.30 | 1497.39 | 66.67 | 61.94 | 36.50 | 96.69% |
FasterVLM | 50% | 288 | 79.03 | 61.01 | 52.65 | 73.62 | 59.99 | 86.05 | 1530.41 | 67.70 | 62.80 | 36.80 | 99.18% |
FastV | 75% | 144 | 77.24 | 59.87 | 54.82 | 74.02 | 60.07 | 79.43 | 1493.51 | 67.27 | 62.63 | 33.20 | 97.16% |
SparseVLM | 75% | 144 | 76.06 | 57.97 | 53.13 | 73.67 | 57.94 | 68.61 | 1499.49 | 64.52 | 59.11 | 35.00 | 94.32% |
FasterVLM | 75% | 144 | 77.36 | 58.74 | 52.74 | 73.48 | 58.99 | 83.10 | 1467.00 | 67.10 | 62.54 | 36.30 | 97.43% |
FastV | 90% | 58 | 70.27 | 54.92 | 54.78 | 72.43 | 55.64 | 67.26 | 1359.69 | 63.83 | 59.71 | 29.40 | 90.26% |
SparseVLM | 90% | 58 | 68.27 | 54.43 | 50.45 | 70.35 | 52.56 | 62.63 | 1285.26 | 58.16 | 54.30 | 27.20 | 85.02% |
FasterVLM | 90% | 58 | 73.08 | 55.98 | 54.00 | 73.72 | 57.35 | 74.71 | 1370.77 | 65.21 | 61.08 | 33.90 | 93.68% |
FastV | 95% | 29 | 62.25 | 50.34 | 52.96 | 73.18 | 52.08 | 49.83 | 1165.70 | 56.44 | 51.29 | 24.00 | 80.53% |
FasterVLM | 95% | 29 | 67.85 | 52.62 | 53.11 | 72.83 | 54.82 | 65.90 | 1267.09 | 62.11 | 56.87 | 31.60 | 88.35% |
LLaVA-1.5-13B | 100% | 0 | 41.40 | 38.40 | 45.15 | 66.63 | 43.51 | 1.45 | 622.91 | 22.77 | 18.13 | 13.20 | 49.99% |
Method | Reduction Ratio | # Token | VQAv2 | GQA | VisWiz | SQA-IMG | TextVQA | POPE | MME | MMB | MMB-CN | MM-Vet | Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-NeXT-7B | 0% | 2880 | 81.21 | 62.93 | 55.21 | 69.66 | 59.59 | 86.32 | 1513.78 | 67.70 | 58.85 | 42.60 | 100.00% |
FastV | 25% | 2160 | 81.12 | 62.50 | 55.05 | 69.31 | 59.69 | 86.27 | 1506.28 | 67.61 | 59.02 | 41.70 | 99.61% |
SparseVLM | 25% | 2160 | 81.14 | 62.55 | 55.20 | 68.47 | 60.26 | 73.15 | 1507.75 | 66.07 | 58.59 | 41.90 | 97.86% |
FasterVLM | 25% | 2160 | 81.18 | 62.81 | 56.12 | 70.70 | 59.65 | 86.26 | 1492.20 | 67.35 | 58.33 | 44.30 | 100.41% |
FastV | 50% | 1440 | 80.71 | 61.76 | 54.89 | 69.06 | 59.55 | 85.46 | 1490.34 | 67.35 | 58.51 | 41.20 | 98.91% |
SparseVLM | 50% | 1440 | 80.92 | 62.04 | 55.71 | 68.07 | 60.00 | 73.42 | 1484.92 | 65.72 | 58.85 | 39.90 | 97.14% |
FasterVLM | 50% | 1440 | 80.72 | 62.66 | 55.82 | 69.41 | 59.72 | 86.66 | 1521.87 | 67.87 | 58.85 | 44.40 | 100.53% |
FastV | 75% | 720 | 78.90 | 60.38 | 54.22 | 69.81 | 58.39 | 83.09 | 1477.31 | 65.64 | 57.04 | 41.10 | 97.37% |
SparseVLM | 75% | 720 | 78.86 | 60.88 | 55.55 | 67.48 | 58.08 | 70.99 | 1446.10 | 63.83 | 57.04 | 38.00 | 94.95% |
FasterVLM | 75% | 720 | 79.25 | 61.31 | 56.24 | 68.82 | 59.33 | 85.50 | 1480.68 | 67.53 | 59.19 | 40.40 | 98.73% |
FastV | 90% | 290 | 71.94 | 55.86 | 53.07 | 69.26 | 55.69 | 71.66 | 1282.86 | 61.60 | 51.89 | 33.70 | 89.24% |
SparseVLM | 90% | 290 | 71.62 | 56.12 | 53.16 | 68.62 | 51.97 | 63.23 | 1332.22 | 54.47 | 50.69 | 24.70 | 84.52% |
FasterVLM | 90% | 290 | 75.21 | 58.12 | 56.92 | 68.12 | 57.57 | 80.00 | 1370.11 | 63.32 | 54.47 | 35.70 | 93.55% |
FastV | 95% | 145 | 61.84 | 49.83 | 51.25 | 68.52 | 51.85 | 51.66 | 1079.46 | 54.90 | 45.36 | 21.90 | 77.43% |
FasterVLM | 95% | 145 | 70.63 | 54.73 | 56.27 | 68.86 | 55.97 | 72.89 | 1225.96 | 60.48 | 53.09 | 31.90 | 88.85% |
LLaVA-NeXT-7B | 100% | 0 | 40.59 | 37.93 | 46.01 | 64.01 | 37.57 | 23.40 | 601.93 | 21.05 | 17.70 | 13.10 | 50.73% |