AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI

🎯🎯 AIGCBench is a novel and comprehensive benchmark designed for evaluating the capabilities of state-of-the-art video generation algorithms.

AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI, BenchCouncil Transactions on Benchmarks, Standards and Evaluations (TBench).

Fanda Fan, Chunjie Luo, Wanling Gao, Jianfeng Zhan

Introduction: AIGCBench

AIGCBench is a novel and comprehensive benchmark designed for evaluating the capabilities of state-of-the-art video generation algorithms. Our AIGCBench is divided into three modules: the evaluation dataset, the evaluation metrics, and the video generation models to be assessed. Our benchmark encompasses two types of datasets: video-text and image-text datasets. To construct a more comprehensive evaluation dataset, we expand the image-text dataset by our generation pipeline. Additionally, for a thorough evaluation of video generation models, we introduce a set of evaluation metrics comprising 11 metrics across four dimensions. These metrics include both reference video-dependent and reference video-free metrics, making full use of the benchmark we propose. We also adopted human validation to confirm the rationality of the evaluation standards we proposed. We present an overview of our AIGCBench in Figure 1. We compare our AIGCBench with other I2V benchmarks in Table 1.

Image Alt Text — Figure 1: An overview of our AIGCBench.

Key Features of AIGCBench:

Diverse Datasets: AIGCBench includes a diverse range of datasets, featuring real-world video-text pairs and image-text pairs, to ensure comprehensive and realistic evaluations. Moreover, it features a newly created dataset, generated by an innovative text-to-image pipeline, further enhancing the benchmark's diversity and representativeness.
Extensive Evaluation Metrics: AIGCBench introduces a set of evaluation metrics that cover four crucial dimensions of video generation: control-video alignment, motion effects, temporal consistency, and video quality. Our evaluation metrics encompass both reference video-based metrics and video-free metrics.
Validated by Human Judgment: The benchmark's evaluation criteria are thoroughly verified against human preferences to confirm their reliability and alignment with human judgments.
In-Depth Analysis: Through extensive evaluations, AIGCBench reveals insightful findings about the current strengths and limitations of existing I2V models, offering valuable guidance for future advancements in the field.
Future Expansion: AIGCBench is not only comprehensive and scalable in its current form but also designed with the vision to encompass a wider range of video generation tasks in the future. This will allow for a unified and in-depth benchmarking of various aspects of AI-generated content (AIGC), setting a new standard for the evaluation of video generation technologies.

Table 1: Compare the features of our AIGCBench with other I2V benchmarks. ❌ and ✅ indicate whether the benchmark includes the features listed in the respective columns. Video-based metrics, which use reference videos, contrast with video-free metrics that don't. Considering the difficulty of the evaluation, we are not counting the sample numbers for domain-specific benchmarks[1,2,3].
Benchmark	Open-Domain	Video-Text Pairs	Image-Text Pairs	Generated Dataset	#Samples	Metric Types	# Metrics
LFDM Eval [1]	❌	✅	❌	❌	-	Video-based	3
CATER-GEN [2]	❌	✅	✅	✅	-	Video-based & Video-free	7
Seer Eval [3]	❌	✅	❌	❌	-	Video-based	2
VideoCrafter Eval [4]	✅	✅	✅	❌	-	-	-
I2VGen-XL Eval [5]	✅	✅	✅	❌	-	-	-
SVD Eval [6]	✅	✅	✅	❌	900	Video-based	5
AnimateBench [7]	✅	❌	❌	✅	105	Video-free	2
AIGCBench (Ours)	✅	✅	✅	✅	3928	Video-based & Video-free	11

Dataset

The Hugging Face link for our dataset.

This dataset is intended for the evaluation of video generation tasks. Our dataset includes image-text pairs and video-text pairs. The dataset comprises three parts:

Ours - A custom generation of image-text samples. We present a schematic diagram of our generation pipeline in Figure 2.
Webvid val - A subset of 1000 video samples from the WebVid val dataset.
Laion-aesthetics - A subset of LAION dataset that includes 925 image-text samples.

Below are some images we generated, with the corresponding text:


Behold a battle-scarred cyborg in the throes of eagerly searching for hidden treasure surrounded by the neon-lit skyscrapers of a futuristic city, envisioned as a stunning 3D render.	Within the realm of the front lines of an ancient battlefield, a mischievous fairy carefully repairing a broken robot, each moment immortalized in the impassioned strokes of Van Gogh.	Amidst the ancient walls of a crumbling castle, a fearsome dragon is secretly whispering to animals, captured as vibrant pixel art.	Discover a noble king, exploring a cave amidst the backdrop of an alien planet's red skies, artfully rendered with photorealistic precision.	1240_Discover a curious alien, hunting for ghosts amidst the tranquil waters of a serene lake, artfully rendered as a stunning 3D render.

Evaluation Metrics

We assess the performance of different video generation models from four aspects:

Control-video alignment: This measure evaluates how well the control signals provided by the user align with the generated video. Given that images and text prompts are the primary inputs for mainstream video generation tasks, we focus our evaluation on image fidelity and text-video alignment in this context.
Motion effects: Motion effects assess whether the motion in the generated video is significant and the movements are realistic and appropriate.
Temporal consistency: Temporal consistency examines whether the generated video frames are coherent and maintain continuity across the sequence.
Video quality: Video quality gauges the overall quality of the generated video.

The code is available at https://github.com/BenchCouncil/AIGCBench. We have encapsulated the evaluation metrics used in our paper in eval.py; for more details, please refer to the paper. To use the code, please first download the clip model file and replace the 'path_to_dir' with the actual path. Below is a simple example:


      batch_video_path = os.path.join('path_to_videos', '*.mp4')
      video_path_list = sorted(glob.glob(batch_video_path))
      sum_res = 0
      cnt = 0
      for video_path in video_path_list:
          res = compute_video_video_similarity(ref_video_path, video_path)
          sum_res += res['clip']
          cnt += res["state"]
      print(sum_res / cnt)

Results

We display the qualitative results of different i2v (image-to-video) algorithms in Table 2, and the quantitative results are shown in Table 3. An upward arrow indicates that higher values are better, while a downward arrow means lower values are preferable.

Table 2: Qualitative results of different Image-to-Video algorithms.
Input image	VideoCrafter [4]	I2VGen-XL [5]	SVD [6]	Pika	Gen2

Table 3: Quantitative results of different Image-to-Video algorithms.
Dimensions	Metrics	VideoCrafter [4]	I2VGen-XL [5]	SVD [6]	Pika	Gen2
Control-video Alignment	MSE (First)	3929.65	4491.90	640.75	155.30	235.53
	SSIM (First)	0.300	0.354	0.612	0.800	0.803
	Image-GenVideo Clip	0.830	0.832	0.919	0.930	0.939
	GenVideo-Text Clip	0.23	0.24	-	0.271	0.270
	GenVideo-RefVideo Clip (Keyframes)	0.763	0.764	-	0.824	0.820
Motion Effects	Flow-Square-Mean	1.24	1.80	2.52	0.281	1.18
Motion Effects	GenVideo-RefVideo Clip (Corresponding frames)	0.764	0.764	0.796	0.823	0.818
Temporal Consistency	GenVideo Clip (Adjacent frames)	0.980	0.971	0.974	0.996	0.995
Temporal Consistency	GenVideo-RefVideo Clip (Corresponding frames)	0.764	0.764	0.796	0.823	0.818
Video Quality	Frame Count	16	32	25	72	96
	DOVER	0.518	0.510	0.623	0.715	0.775
	GenVideo-RefVideo SSIM	0.367	0.304	0.507	0.560	0.504

To validate the alignment of our proposed evaluation standards with human preferences, we conducted a study. We randomly selected 30 generated results from each of the five methods. Then, we asked participants to vote on the best algorithm outcomes across four dimensions: Image Fidelity, Motion Effects, Temporal Consistency, and Video Quality. A total of 42 individuals participated in the voting process. The specific results of the study are presented below:

Contact us

If you have any questions, please feel free to contact us via email at fanfanda@ict.ac.cn and jianfengzhan.benchcouncil@gmail.com.

Citation

@article{fan2024aigcbench,
  title={AIGCBench: Comprehensive evaluation of image-to-video content generated by AI},
  author={Fan, Fanda and Luo, Chunjie and Gao, Wanling and Zhan, Jianfeng},
  journal={BenchCouncil Transactions on Benchmarks, Standards and Evaluations},
  pages={100152},
  year={2024},
  publisher={Elsevier}
}

References

Ni, H., Shi, C., Li, K., Huang, S.X., & Min, M.R. (2023). Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18444-18455).
Hu, Y., Luo, C., Chen, Z., 2023. A benchmark for controllable text-image-to-video generation. IEEE Transactions on Multimedia.
Gu, X., Wen, C., Song, J., Gao, Y., 2023. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897.
Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al., 2023. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512.
Zhang, S., Wang, J., Zhang, Y., Zhao, K., Yuan, H., Qin, Z., Wang, X., Zhao, D., Zhou, J., 2023a. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145.
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al., 2023a. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127.
Zhang, Y., Xing, Z., Zeng, Y., Fang, Y., Chen, K., 2023b. Pia: Your personalized image animator via plug-and-play modules in text-to-image models. arXiv preprint arXiv:2312.13964.