🎯🎯 AIGCBench is a novel and comprehensive benchmark designed for evaluating the capabilities of state-of-the-art video generation algorithms.
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI, BenchCouncil Transactions on Benchmarks, Standards and Evaluations (TBench).
Fanda Fan, Chunjie Luo, Wanling Gao, Jianfeng Zhan
Introduction: AIGCBench
AIGCBench is a novel and comprehensive benchmark designed for evaluating the capabilities of state-of-the-art video generation algorithms. Our AIGCBench is divided into three modules: the evaluation dataset, the evaluation metrics, and the video generation models to be assessed. Our benchmark encompasses two types of datasets: video-text and image-text datasets. To construct a more comprehensive evaluation dataset, we expand the image-text dataset by our generation pipeline. Additionally, for a thorough evaluation of video generation models, we introduce a set of evaluation metrics comprising 11 metrics across four dimensions. These metrics include both reference video-dependent and reference video-free metrics, making full use of the benchmark we propose. We also adopted human validation to confirm the rationality of the evaluation standards we proposed. We present an overview of our AIGCBench in Figure 1. We compare our AIGCBench with other I2V benchmarks in Table 1.
Key Features of AIGCBench:
- Diverse Datasets: AIGCBench includes a diverse range of datasets, featuring real-world video-text pairs and image-text pairs, to ensure comprehensive and realistic evaluations. Moreover, it features a newly created dataset, generated by an innovative text-to-image pipeline, further enhancing the benchmark's diversity and representativeness.
- Extensive Evaluation Metrics: AIGCBench introduces a set of evaluation metrics that cover four crucial dimensions of video generation: control-video alignment, motion effects, temporal consistency, and video quality. Our evaluation metrics encompass both reference video-based metrics and video-free metrics.
- Validated by Human Judgment: The benchmark's evaluation criteria are thoroughly verified against human preferences to confirm their reliability and alignment with human judgments.
- In-Depth Analysis: Through extensive evaluations, AIGCBench reveals insightful findings about the current strengths and limitations of existing I2V models, offering valuable guidance for future advancements in the field.
- Future Expansion: AIGCBench is not only comprehensive and scalable in its current form but also designed with the vision to encompass a wider range of video generation tasks in the future. This will allow for a unified and in-depth benchmarking of various aspects of AI-generated content (AIGC), setting a new standard for the evaluation of video generation technologies.
Benchmark | Open-Domain | Video-Text Pairs | Image-Text Pairs | Generated Dataset | #Samples | Metric Types | # Metrics |
---|---|---|---|---|---|---|---|
LFDM Eval [1] | ❌ | ✅ | ❌ | ❌ | - | Video-based | 3 |
CATER-GEN [2] | ❌ | ✅ | ✅ | ✅ | - | Video-based & Video-free | 7 |
Seer Eval [3] | ❌ | ✅ | ❌ | ❌ | - | Video-based | 2 |
VideoCrafter Eval [4] | ✅ | ✅ | ✅ | ❌ | - | - | - |
I2VGen-XL Eval [5] | ✅ | ✅ | ✅ | ❌ | - | - | - |
SVD Eval [6] | ✅ | ✅ | ✅ | ❌ | 900 | Video-based | 5 |
AnimateBench [7] | ✅ | ❌ | ❌ | ✅ | 105 | Video-free | 2 |
AIGCBench (Ours) | ✅ | ✅ | ✅ | ✅ | 3928 | Video-based & Video-free | 11 |
Dataset
The Hugging Face link for our dataset.
This dataset is intended for the evaluation of video generation tasks. Our dataset includes image-text pairs and video-text pairs. The dataset comprises three parts:
Ours
- A custom generation of image-text samples. We present a schematic diagram of our generation pipeline in Figure 2.Webvid val
- A subset of 1000 video samples from the WebVid val dataset.Laion-aesthetics
- A subset of LAION dataset that includes 925 image-text samples.
Below are some images we generated, with the corresponding text:
Behold a battle-scarred cyborg in the throes of eagerly searching for hidden treasure surrounded by the neon-lit skyscrapers of a futuristic city, envisioned as a stunning 3D render. | Within the realm of the front lines of an ancient battlefield, a mischievous fairy carefully repairing a broken robot, each moment immortalized in the impassioned strokes of Van Gogh. | Amidst the ancient walls of a crumbling castle, a fearsome dragon is secretly whispering to animals, captured as vibrant pixel art. | Discover a noble king, exploring a cave amidst the backdrop of an alien planet's red skies, artfully rendered with photorealistic precision. | 1240_Discover a curious alien, hunting for ghosts amidst the tranquil waters of a serene lake, artfully rendered as a stunning 3D render. |
Evaluation Metrics
We assess the performance of different video generation models from four aspects:
- Control-video alignment: This measure evaluates how well the control signals provided by the user align with the generated video. Given that images and text prompts are the primary inputs for mainstream video generation tasks, we focus our evaluation on image fidelity and text-video alignment in this context.
- Motion effects: Motion effects assess whether the motion in the generated video is significant and the movements are realistic and appropriate.
- Temporal consistency: Temporal consistency examines whether the generated video frames are coherent and maintain continuity across the sequence.
- Video quality: Video quality gauges the overall quality of the generated video.
The code is available at https://github.com/BenchCouncil/AIGCBench. We have encapsulated the evaluation metrics used in our paper in eval.py
; for more details, please refer to the paper. To use the code, please first download the clip model file and replace the 'path_to_dir' with the actual path. Below is a simple example:
batch_video_path = os.path.join('path_to_videos', '*.mp4')
video_path_list = sorted(glob.glob(batch_video_path))
sum_res = 0
cnt = 0
for video_path in video_path_list:
res = compute_video_video_similarity(ref_video_path, video_path)
sum_res += res['clip']
cnt += res["state"]
print(sum_res / cnt)
Results
We display the qualitative results of different i2v (image-to-video) algorithms in Table 2, and the quantitative results are shown in Table 3. An upward arrow indicates that higher values are better, while a downward arrow means lower values are preferable.
Input image | VideoCrafter [4] | I2VGen-XL [5] | SVD [6] | Pika | Gen2 |
---|---|---|---|---|---|
Dimensions | Metrics | VideoCrafter [4] | I2VGen-XL [5] | SVD [6] | Pika | Gen2 |
---|---|---|---|---|---|---|
Control-video Alignment | MSE (First) | 3929.65 | 4491.90 | 640.75 | 155.30 | 235.53 |
SSIM (First) | 0.300 | 0.354 | 0.612 | 0.800 | 0.803 | |
Image-GenVideo Clip | 0.830 | 0.832 | 0.919 | 0.930 | 0.939 | |
GenVideo-Text Clip | 0.23 | 0.24 | - | 0.271 | 0.270 | |
GenVideo-RefVideo Clip (Keyframes) | 0.763 | 0.764 | - | 0.824 | 0.820 | |
Motion Effects | Flow-Square-Mean | 1.24 | 1.80 | 2.52 | 0.281 | 1.18 |
GenVideo-RefVideo Clip (Corresponding frames) | 0.764 | 0.764 | 0.796 | 0.823 | 0.818 | |
Temporal Consistency | GenVideo Clip (Adjacent frames) | 0.980 | 0.971 | 0.974 | 0.996 | 0.995 |
GenVideo-RefVideo Clip (Corresponding frames) | 0.764 | 0.764 | 0.796 | 0.823 | 0.818 | |
Video Quality | Frame Count | 16 | 32 | 25 | 72 | 96 |
DOVER | 0.518 | 0.510 | 0.623 | 0.715 | 0.775 | |
GenVideo-RefVideo SSIM | 0.367 | 0.304 | 0.507 | 0.560 | 0.504 |
To validate the alignment of our proposed evaluation standards with human preferences, we conducted a study. We randomly selected 30 generated results from each of the five methods. Then, we asked participants to vote on the best algorithm outcomes across four dimensions: Image Fidelity, Motion Effects, Temporal Consistency, and Video Quality. A total of 42 individuals participated in the voting process. The specific results of the study are presented below:
Contact us
If you have any questions, please feel free to contact us via email at fanfanda@ict.ac.cn and jianfengzhan.benchcouncil@gmail.com.
Citation
@article{fan2024aigcbench, title={AIGCBench: Comprehensive evaluation of image-to-video content generated by AI}, author={Fan, Fanda and Luo, Chunjie and Gao, Wanling and Zhan, Jianfeng}, journal={BenchCouncil Transactions on Benchmarks, Standards and Evaluations}, pages={100152}, year={2024}, publisher={Elsevier} }
References
- Ni, H., Shi, C., Li, K., Huang, S.X., & Min, M.R. (2023). Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18444-18455).
- Hu, Y., Luo, C., Chen, Z., 2023. A benchmark for controllable text-image-to-video generation. IEEE Transactions on Multimedia.
- Gu, X., Wen, C., Song, J., Gao, Y., 2023. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897.
- Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al., 2023. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512.
- Zhang, S., Wang, J., Zhang, Y., Zhao, K., Yuan, H., Qin, Z., Wang, X., Zhao, D., Zhou, J., 2023a. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145.
- Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al., 2023a. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127.
- Zhang, Y., Xing, Z., Zeng, Y., Fang, Y., Chen, K., 2023b. Pia: Your personalized image animator via plug-and-play modules in text-to-image models. arXiv preprint arXiv:2312.13964.