Automated Smart Contract Summarization via LLMs

programming

prompt-engineering

Gemini-Pro-Vision outperforms MMTrans in generating contract code summarization from multimodal inputs.

Author

Yingjie Mao, Xiao Li, Zongwei Li, Wenkai Li

Published

February 7, 2024

Summary:

The study evaluates the performance of Gemini-Pro-Vision in generating contract code summarization from multimodal inputs.
It compares Gemini-Pro-Vision to MMTrans and explores methods to build the best prompt for multimodal inputs.
The study uses widely used metrics (BLEU, METEOR, and ROUGE-L) to measure the quality of the generated summarization.

Major Findings:

Evaluation of Gemini-Pro-Vision: The study shows that Gemini-Pro-Vision achieves 21.17% and 21.05% scores for code comments generated by three-shot prompts, which are better than those generated by one-shot and five-shot prompts.
Comparison with MMTrans: Gemini-Pro-Vision’s performance is compared to MMTrans, and it is found that MMTrans significantly outperforms Gemini-Pro-Vision in terms of METEOR, BLEU, and ROUGE-L scores.
Performance Metrics: The study presents the overall performance of Gemini-Pro-Vision in one-shot, three-shot, and five-shot prompts compared with MMTrans, showing variations in scores for different prompts.

Analysis and Critique:

Benefit: Gemini-Pro-Vision generates more concise code comments and exhibits stronger reasoning ability.
Limitation: The study identifies a lack of high-quality benchmark dataset and suitable metrics for evaluating comments generated by LLMs such as Gemini-Pro-Vision.
Future Expectations: The study outlines opportunities and adjustments for utilizing Gemini-Pro-Vision to generate code comments, emphasizing the need for further exploration and investment in constructing a high-quality test dataset.

Overall, the study provides valuable insights into the performance of Gemini-Pro-Vision in generating code summarization and highlights areas for future research and improvement. However, it also identifies limitations such as the lack of suitable evaluation metrics and the need for a high-quality benchmark dataset.

Appendix

Model	gpt-3.5-turbo-1106
Date Generated	2024-02-26
Abstract	https://arxiv.org/abs/2402.04863v1
HTML	https://browse.arxiv.org/html/2402.04863v1
Truncated	False
Word Count	5487