Automated Smart Contract Summarization via LLMs

programming
prompt-engineering
Gemini-Pro-Vision outperforms MMTrans in generating contract code summarization from multimodal inputs.
Author

Yingjie Mao, Xiao Li, Zongwei Li, Wenkai Li

Published

February 7, 2024

Summary:

  • The study evaluates the performance of Gemini-Pro-Vision in generating contract code summarization from multimodal inputs.
  • It compares Gemini-Pro-Vision to MMTrans and explores methods to build the best prompt for multimodal inputs.
  • The study uses widely used metrics (BLEU, METEOR, and ROUGE-L) to measure the quality of the generated summarization.

Major Findings:

  1. Evaluation of Gemini-Pro-Vision: The study shows that Gemini-Pro-Vision achieves 21.17% and 21.05% scores for code comments generated by three-shot prompts, which are better than those generated by one-shot and five-shot prompts.
  2. Comparison with MMTrans: Gemini-Pro-Vision’s performance is compared to MMTrans, and it is found that MMTrans significantly outperforms Gemini-Pro-Vision in terms of METEOR, BLEU, and ROUGE-L scores.
  3. Performance Metrics: The study presents the overall performance of Gemini-Pro-Vision in one-shot, three-shot, and five-shot prompts compared with MMTrans, showing variations in scores for different prompts.

Analysis and Critique:

  • Benefit: Gemini-Pro-Vision generates more concise code comments and exhibits stronger reasoning ability.
  • Limitation: The study identifies a lack of high-quality benchmark dataset and suitable metrics for evaluating comments generated by LLMs such as Gemini-Pro-Vision.
  • Future Expectations: The study outlines opportunities and adjustments for utilizing Gemini-Pro-Vision to generate code comments, emphasizing the need for further exploration and investment in constructing a high-quality test dataset.

Overall, the study provides valuable insights into the performance of Gemini-Pro-Vision in generating code summarization and highlights areas for future research and improvement. However, it also identifies limitations such as the lack of suitable evaluation metrics and the need for a high-quality benchmark dataset.

Appendix

Model gpt-3.5-turbo-1106
Date Generated 2024-02-26
Abstract https://arxiv.org/abs/2402.04863v1
HTML https://browse.arxiv.org/html/2402.04863v1
Truncated False
Word Count 5487