Large Multi-modal Model for Video Captioning

dc.contributor.advisorHwang, Jenq-Neng
dc.contributor.authorChai, Wenhao
dc.date.accessioned2025-05-12T22:47:48Z
dc.date.available2025-05-12T22:47:48Z
dc.date.issued2025-05-12
dc.date.submitted2025
dc.descriptionThesis (Master's)--University of Washington, 2025
dc.description.abstractVideo detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which limits research in this field. Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCscore for bettering evaluation, which adopts a divide-and-conquer strategy to transform long caption evaluation into multiple short question-answer pairs. With the help of human Elo ranking, our experiments show that this benchmark better correlates with human judgments of video detailed captioning quality.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherChai_washington_0250O_27865.pdf
dc.identifier.urihttps://hdl.handle.net/1773/52981
dc.language.isoen_US
dc.rightsCC BY
dc.subjectbenchmark
dc.subjectlarge language model
dc.subjectlarge multi-modal model
dc.subjectvideo captioning
dc.subjectvideo understanding
dc.subjectComputer science
dc.subject.otherElectrical and computer engineering
dc.titleLarge Multi-modal Model for Video Captioning
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Chai_washington_0250O_27865.pdf
Size:
15.65 MB
Format:
Adobe Portable Document Format