Large Multi-modal Model for Video Captioning

Chai, Wenhao

Large Multi-modal Model for Video Captioning

dc.contributor.advisor	Hwang, Jenq-Neng
dc.contributor.author	Chai, Wenhao
dc.date.accessioned	2025-05-12T22:47:48Z
dc.date.available	2025-05-12T22:47:48Z
dc.date.issued	2025-05-12
dc.date.submitted	2025
dc.description	Thesis (Master's)--University of Washington, 2025
dc.description.abstract	Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which limits research in this field. Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCscore for bettering evaluation, which adopts a divide-and-conquer strategy to transform long caption evaluation into multiple short question-answer pairs. With the help of human Elo ranking, our experiments show that this benchmark better correlates with human judgments of video detailed captioning quality.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Chai_washington_0250O_27865.pdf
dc.identifier.uri	https://hdl.handle.net/1773/52981
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	benchmark
dc.subject	large language model
dc.subject	large multi-modal model
dc.subject	video captioning
dc.subject	video understanding
dc.subject	Computer science
dc.subject.other	Electrical and computer engineering
dc.title	Large Multi-modal Model for Video Captioning
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chai_washington_0250O_27865.pdf
Size:: 15.65 MB
Format:: Adobe Portable Document Format

Download

Collections

Electrical and computer engineering