MICap: A Unified Model for Identity-aware Movie Descriptions (2024)

Haran Raajesh∗1 Naveen Reddy Desanur∗1Zeeshan Khan2 Makarand Tapaswi1
1CVIT, IIIT Hyderabad, India
2Inria Paris and Département d’informatique de l’ENS, CNRS, PSL Research University
https://katha-ai.github.io/projects/micap/
denotes equal contribution

Abstract

Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding.While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where,given a caption with blanks, the goal is to predict person id labels.However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities.In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks.Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input.Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids.To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs.We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics.

MICap: A Unified Model for Identity-aware Movie Descriptions (1)

1 Introduction

Building computer vision models that understand the story of a movie is a long-standing challenge.A step towards this is movie description[38, 30, 37].Given a short clip of 2-5 seconds, models are required to generate a caption that describes the visual scene.Captions in the Large Scale Movie Description Challenge (LSMDC)[38], a combination of[37, 30], are obtained from audio descriptions (AD) that are used to convey the (visual) story to a visually impaired audience.The original version of the LSMDC challenge suggests captioning a single clip and anonymizes all character names with someone.

While using the someone tag to describe a character’s activity in a single video is acceptable, the lack of identity continuity across a videoset (group of N𝑁Nitalic_N consecutive videos) hampers understanding.To remedy this,Piniet al.[31] extend MVAD[30] as MVAD names where character names are predicted by linking to the appropriate face detection/track; andParket al.[29] propose a fill-in-the-blanks (FITB) task to replace someone tags with local cluster identities (e.g.P1, P2, \ldots) in a videoset (Fig.1 left).

The latter approach[29] provides two advantages:(i)it does not require time-consuming ground-truth annotations linking faces and blanks[31]; and(ii)using local cluster ids helps convey the story111Note, cluster ids can be easily mapped to gender- and culture-appropriate names instead of using P1, P2, \ldots for storytelling.without the need for models with world knowledge (CLIP[33], GPT[32], etc.) or an IMDb castlist with photographs[14], making the approach applicable to indie films or home-edited videos.

To generate id-aware captions, [29] proposes a two-stage approach shown in Fig.1 (middle).The first stage[28] ingests a videoset and generates a captionset (a set of N𝑁Nitalic_N captions, one for each video) using the someone tags;while the second stage replaces someone with appropriate local person id labels.While the two-stage setting unites the two worlds of video description and character identification, it is not ideal as errors in captioning may adversely affect FITB as both methods are modeled independently.In this work, we propose a single-stage approach (Fig.1 right)that can seamlessly switch between both tasks.

Challenges with Fill-In.

For the FITB task,[29] encodes blanks in the ground-truth (GT) captionset using bidirectional context through the BERT encoder.These blanks attend to the face features clustered within a single video, not accounting for other faces coming from the videoset.

Using the blank representations, the person ids are predicted in an auto-regressive manner.

We note some disadvantages with this approach:(i)Faces are clustered within each video. This means identity information across videos is not directly observed by the model.(ii)When a character is mentioned in the caption, their face need not be present in the clip (e.g.Fig.1 left, C4 and C5 mention P1 whose face is turned and not visible).(iii)BERT-based blank embeddings provided at the encoder are unable to capture face information properly, resulting in a model that largely focuses on text embeddings to solve FITB (e.g., in[29], FITB accuracy only improves by 1.5% (64.4 to 65.9) with visual inputs).

Proposed model benefits.

We overcome these problems using a new paradigm for id-aware multi-video description through a single-step sequence-to-sequence model.We unify the two tasks of FITB and caption generation, by auto-regressively unrolling the descriptions along with their local character ids, via a Transformer based encoder-decoder model.Our model, dubbed as the Movie-Identity Captioner (MICap), enables joint training and independent evaluation for both tasks:(i)given only the videoset, our model generates an id-aware captionset; and(ii)when a captionset with someone tags exists, our model fills in local identities.

To overcome text-only shortcuts, we propose auto-regressive decoding of the full caption even for FITB and show that our multimodal model outperforms a text-only model significantly.We teacher force the ground-truth caption containing the blanks (person ids), and predict one token at a time using causal masking.Note, learning happens only at select tokens where person id labels are predicted.This way the model (decoder) learns to sequentially use the GT (teacher forced) caption for the FITB task with uni-directional (causal) attention.During inference, we switch between the two tasks by deciding whether the decoder is teacher forced with a given captionset or not.

Identity-aware evaluation.

Existing captioning metrics like CIDEr[50] and BLEU[27] do not account for identity sensitive descriptions.For example “P1 is walking towards P2" and “P2 is walking towards P1" will result in high n-gram based scores due to common middle words.We propose a new identity-aware caption evaluation metric iSPICE.Specifically, we are motivated by SPICE’s[1] ability to parse a caption into a scene graph, and match a predicted caption with ground-truth based on similarity across generated tuples.To compute iSPICE, we intervene in this process and remove tuples not associated with a person label before computing the F1 scores.

Contributions.

In summary,(i)we propose a new paradigm for identity-aware multi-sentence movie description using a single-stage approach that unifies FITB with full caption generation.(ii)We formulate this task as an auto-regressive sequence-to-sequence generation that is able to describe the video and use local person id labels across a videoset (multiple videos).We show that joint training improves knowledge sharing and boosts performance.(iii)We enable seamless task switching allowing independent evaluation of(a)caption generation with identities, and (b)filling in identity labels given a caption.(iv)We propose a new identity-aware captioning metric, iSPICE, that extends SPICE, and show its sensitivity to identities while evaluating captions.(v)Finally, MICap improves over the state-of-the-art for FITB by 4.2% and identity-aware captioning by 1.4% CIDEr and 1.8% METEOR.

2 Related Work

We address related work from three areas:(i)video captioning at large,(ii)identity-aware captioning, and(iii)metrics used for evaluating captions.

Video captioning

has gained a lot of attention since the advent of deep learning.The typical task is to generate a single sentence description for a trimmed video, and is formulated as a sequence-to-sequence problem[12, 51, 52, 58, 22, 23, 42].A more challenging setup is multi-sentence generation, typically applied to longer videos and requires long-term temporal consistency[28, 36, 45, 57].Video situation recognition, VidSitu[39, 17] presents a structured alternative where multiple captions are generated per event based on the semantic role labeling framework.

Different from multi-sentence captioning, dense video captioning, requires temporally localizing and generating captions for every event in an untrimmed video[18, 55, 62, 56].While most approaches for dense video captioning use a 2-stage approach, i.e.temporal localization with event proposals then event captioning[18, 53, 54], recent methods, jointly model the two tasks for better temporal consistency[5, 7, 8, 20, 25, 35, 43, 44, 53, 55, 62].The state-of-the-art, PDVC[55], learns DETR-style event queries and performs localization and captioning over each query using 2 separate heads.Recently, Vid2Seq[56] proposed to further unify the two tasks by using a single sequence-to-sequence model and generating both the localization and captions with a single auto-regressive Transformer decoder.Similar to above ideas, we unify two seemingly different tasks of character identification and description by formulating them as an auto-regressive sequence generation task.

Id-aware captioning datasets.

None of the above works focus on person identity while generating captions.VidSitu[39], perhaps the closest, contains references to people by descriptions such as man in a black jacket.This is an issue when the domain is movie description[38, 30], where identities are anonymized to someone which hinders buildingpractical applications like Audio Descriptions[13] for visually impaired users.While[31] links character names in descriptions with face tracks, they require significant annotation effort that is not scalable.A more recent Movie Audio Description dataset, MAD[46], is a popular source for movie descriptions.But it uses real names that require models with world knowledge.Different from above, Parket al.[29] propose identity-aware captioning as a fill-in-the-blanks task where they assign local person ids (cluster ids) to characters appearing in 5 consecutive video clips.We adopt this setting for our work.

Id-aware captioning methods.

Identity-aware captioning is a challenging task that has recently started to attract attention.Among the first works, [29] proposes a 2-stage pipeline of first captioning with identities anonymized as someone using a multi-sentence captioning model[28], followed by learning an identity prediction FITB model that fills in the someone with local person identities.However, as discussed in the introduction (Challenges with Fill-In), the specific 2-stage approach suffers from several disadvantages.Different from[29], we propose a single stage sequence-to-sequence model, that outperforms the 2-stage approach.In this area, another work[60] requires ground-truth mapping between person identities (blanks) in the description to face tracks in the videos.However, this approach is not scalable.Very recently, AutoAD-II[14] proposed to generate movie descriptions with proper names, on the MAD[46] dataset.While innovative, this approach requires additional IMDb castlist information with photographs.While modeling proper names directly is useful, tagging names to unique person ids in a local videoset is possible and is the motivation for works on person clustering[3, 48] as opposed to person identification[47, 26].

Caption evaluation metrics

are typically based on n-gram matching, with few differences.CIDEr[50], BLEU[27], and METEOR[11] all evaluate n-gram similarities between a single or multiple candidate references and the generated caption.Recently, Large Language Models (LLMs) are used for reference-based (e.g.BERTScore[61], CLAIR[6]) oror Large Vision-Language Models (VLMs) for reference-free caption evaluation (e.g.CLIP Score[15]).However, model-based metrics may be difficult to interpret, and also require the model to be sensitive to identities.Different from both directions, SPICE[1] evaluates captions by first transforming them into a scene graph and analyzing presence of shared tuples between the predicted and ground-truth (reference) captions.However, none of the metrics reliably evaluate identity-aware captions, as a robust metric should be sensitive to identity manipulations (swap/add/remove).We propose a new metric iSPICE that focuses primarily on person-identity specific semantics.

3 Method

MICap: A Unified Model for Identity-aware Movie Descriptions (2)

We present a single-stage sequence-to-sequence approach for identity-aware fill-in-the-blanks (FITB).Later, we will show that this architecture can be easily re-purposed for generating video descriptions.

Notation.

Before we start, we define some notation.For the rest of this section, we will operate with a videoset 𝒩𝒩\mathcal{N}caligraphic_N consisting of N𝑁Nitalic_N video clips Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and corresponding captionset 𝒞={Ci}i=1N𝒞superscriptsubscriptsubscript𝐶𝑖𝑖1𝑁\mathcal{C}=\{C_{i}\}_{i=1}^{N}caligraphic_C = { italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT describes video Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.As both sets come from consecutive videos, it is very likely that same characters appear across them.As an example, consider the videoset frames and captionset shown in Fig.1.

3.1 Auto-regressive FITB

In FITB, we replace each person-id (P1, P2, \ldots) with a blank.We denote 𝒞^^𝒞\hat{\mathcal{C}}over^ start_ARG caligraphic_C end_ARG as the captionset with \mathcal{B}caligraphic_B blanks.Formally, we define the captionset as a sequence of L𝐿Litalic_L words [wj]j=1Lsuperscriptsubscriptdelimited-[]subscript𝑤𝑗𝑗1𝐿[w_{j}]_{j=1}^{L}[ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, some of which have been converted to blanks {bk}k=1||superscriptsubscriptsubscript𝑏𝑘𝑘1\{b_{k}\}_{k=1}^{|\mathcal{B}|}{ italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT.The goal of our model is to fill each blank with the correct person-id label from the set 𝒫={Pl}l=1|𝒫|𝒫superscriptsubscriptsubscript𝑃𝑙𝑙1𝒫\mathcal{P}=\{P_{l}\}_{l=1}^{|\mathcal{P}|}caligraphic_P = { italic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_P | end_POSTSUPERSCRIPT.Note, the person-id labels are reusable across videosets, i.e.a character only needs to be referred consistently by the same identity within a videoset.

We present Movie-Identity Captioner (MICap), an auto-regressive Transformer encoder-decoder model for filling person blanks.MICap consists of two parts:(i)Feature extractors and a Transformer encoder to build the captioning memory (Fig.2 left); and(ii)A Transformer decoder that switches between FITB or full captionset generation(Fig.2 right).For clarity, we will highlight differences to prior work[29] throughout this section.

3.1.1 Creating the Captioning Memory

Visual feature extraction.

We extract 3 features from the videoset to capture semantic, action, and face information.

Semantic embeddingsare captured using CLIP[33].From each video Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we sub-sample frames fitsubscript𝑓𝑖𝑡f_{it}italic_f start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT at 5 fps and encode them with the CLIP image encoder.For efficient batching, we truncate or pad to T=50𝑇50T{=}50italic_T = 50 frames per video, and stack them to create semantic features 𝐅sNT×dssuperscript𝐅ssuperscript𝑁𝑇superscript𝑑s\mathbf{F}^{\text{s}}\in\mathbb{R}^{NT\times d^{\text{s}}}bold_F start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_T × italic_d start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Action embeddingsare captured using I3D[4].Similar to[29], each video is divided into S=5𝑆5S{=}5italic_S = 5 segments, and features within each segment are mean pooled.We stack features across the videoset to obtain 𝐅aNS×dasuperscript𝐅asuperscript𝑁𝑆superscript𝑑a\mathbf{F}^{\text{a}}\in\mathbb{R}^{NS\times d^{\text{a}}}bold_F start_POSTSUPERSCRIPT a end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_S × italic_d start_POSTSUPERSCRIPT a end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Facesare detected using Retina Face[10] and represented using Arcface[9].Across the videoset, we collect a maximum of F=300𝐹300F{=}300italic_F = 300 face detections.With each face detection, we associate the video index i𝑖iitalic_i (for Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) from which it is derived and a normalized spatial bounding box location.We stack features to obtain 𝐅fF×dfsuperscript𝐅fsuperscript𝐹superscript𝑑f\mathbf{F}^{\text{f}}\in\mathbb{R}^{F\times d^{\text{f}}}bold_F start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_d start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

We bring all these features to a common d𝑑ditalic_d dimensional space using separate linear projection layers for each modality:𝐖modd×dmodsuperscript𝐖modsuperscript𝑑superscript𝑑mod\mathbf{W}^{\text{mod}}\in\mathbb{R}^{d\times d^{\text{mod}}}bold_W start_POSTSUPERSCRIPT mod end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUPERSCRIPT mod end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT,where mod takes on values:s for semantic,a for action, andf for face.

Captionset feature extraction.

Similar to[29], we also extract blank embeddings by feeding the captionset to BERT (fine-tuned for gender prediction as in[29]) and using the contextualized tokens:

[𝖢𝖫𝖲^,𝐰^1,,𝐛^k,]=BERT([𝖢𝖫𝖲,w1,,bk,]).^𝖢𝖫𝖲subscript^𝐰1subscript^𝐛𝑘BERT𝖢𝖫𝖲subscript𝑤1subscript𝑏𝑘\small[\hat{\mathsf{CLS}},\hat{\mathbf{w}}_{1},\ldots,\hat{\mathbf{b}}_{k},%\ldots]=\text{BERT}([\mathsf{CLS},w_{1},\ldots,b_{k},\ldots])\,.[ over^ start_ARG sansserif_CLS end_ARG , over^ start_ARG bold_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … ] = BERT ( [ sansserif_CLS , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … ] ) .(1)

The blank embedding is a concatenation of contextualized tokens:𝐛k=[𝖢𝖫𝖲^,𝐛^k]subscript𝐛𝑘^𝖢𝖫𝖲subscript^𝐛𝑘\mathbf{b}_{k}=[\hat{\mathsf{CLS}},\hat{\mathbf{b}}_{k}]bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ over^ start_ARG sansserif_CLS end_ARG , over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ].We stack these to create a matrix𝐁||×2dbert𝐁superscript2superscript𝑑bert\mathbf{B}\in\mathbb{R}^{|\mathcal{B}|\times 2\cdot d^{\text{bert}}}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_B | × 2 ⋅ italic_d start_POSTSUPERSCRIPT bert end_POSTSUPERSCRIPT end_POSTSUPERSCRIPTand transform them to the same space through a linear projection 𝐖bertd×2dbertsuperscript𝐖bertsuperscript𝑑2superscript𝑑bert\mathbf{W}^{\text{bert}}\in\mathbb{R}^{d\times 2\cdot d^{\text{bert}}}bold_W start_POSTSUPERSCRIPT bert end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 2 ⋅ italic_d start_POSTSUPERSCRIPT bert end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Face clustering.

Instead of creating face clusters within each video and using blank embeddings to attend to them (as done in[29])we adopt a soft approach for incorporating cluster information in MICap.First, we perform clustering using DBSCAN across all F𝐹Fitalic_F detections in the videoset, resulting in 𝒢𝒢\mathcal{G}caligraphic_G, a set of face groups.This allows our model to associate faces across videos as the same or different person.Next, we prevent propagating errors caused by clustering and mean pooling representations by adding a cluster-id based learnable embedding 𝐄fclsuperscript𝐄fcl\mathbf{E}^{\text{fcl}}bold_E start_POSTSUPERSCRIPT fcl end_POSTSUPERSCRIPT to the face representations.

Additional embeddings

are added to various features to orient the model:(i)𝐄typd×4superscript𝐄typsuperscript𝑑4\mathbf{E}^{\text{typ}}\in\mathbb{R}^{d\times 4}bold_E start_POSTSUPERSCRIPT typ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 4 end_POSTSUPERSCRIPT disambiguates between the 4 types of features.(ii)𝐄vidd×Nsuperscript𝐄vidsuperscript𝑑𝑁\mathbf{E}^{\text{vid}}\in\mathbb{R}^{d\times N}bold_E start_POSTSUPERSCRIPT vid end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N end_POSTSUPERSCRIPT consists of N𝑁Nitalic_N embeddings to inform the model of the source video index for any visual or blank token.(iii)𝐄segd×Ssuperscript𝐄segsuperscript𝑑𝑆\mathbf{E}^{\text{seg}}\in\mathbb{R}^{d\times S}bold_E start_POSTSUPERSCRIPT seg end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_S end_POSTSUPERSCRIPT, together with 𝐄vidsuperscript𝐄vid\mathbf{E}^{\text{vid}}bold_E start_POSTSUPERSCRIPT vid end_POSTSUPERSCRIPT, allows to localize any feature to the correct video and segment.(iv)𝐄fcld×|𝒢|superscript𝐄fclsuperscript𝑑𝒢\mathbf{E}^{\text{fcl}}\in\mathbb{R}^{d\times|\mathcal{G}|}bold_E start_POSTSUPERSCRIPT fcl end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × | caligraphic_G | end_POSTSUPERSCRIPT is the face cluster index embedding described above, and(v)𝐄bboxd×4superscript𝐄bboxsuperscript𝑑4\mathbf{E}^{\text{bbox}}\in\mathbb{R}^{d\times 4}bold_E start_POSTSUPERSCRIPT bbox end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 4 end_POSTSUPERSCRIPT transforms normalized face detection bounding box coordinates to provide the model spatial information.

We create input tokens as follows (with appropriate indexing hidden for brevity):

𝐁^^𝐁\displaystyle\hat{\mathbf{B}}over^ start_ARG bold_B end_ARG=𝐖bert𝐁+𝐄0typ+𝐄vid,absentsuperscript𝐖bert𝐁subscriptsuperscript𝐄typ0superscript𝐄vid\displaystyle=\mathbf{W}^{\text{bert}}\mathbf{B}+\mathbf{E}^{\text{typ}}_{0}+%\mathbf{E}^{\text{vid}}\,,= bold_W start_POSTSUPERSCRIPT bert end_POSTSUPERSCRIPT bold_B + bold_E start_POSTSUPERSCRIPT typ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_E start_POSTSUPERSCRIPT vid end_POSTSUPERSCRIPT ,(2)
𝐅^ssuperscript^𝐅s\displaystyle\hat{\mathbf{F}}^{\text{s}}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT=𝐖s𝐅s+𝐄1typ+𝐄vid+𝐄seg,absentsuperscript𝐖ssuperscript𝐅ssubscriptsuperscript𝐄typ1superscript𝐄vidsuperscript𝐄seg\displaystyle=\mathbf{W}^{\text{s}}\mathbf{F}^{\text{s}}+\mathbf{E}^{\text{typ%}}_{1}+\mathbf{E}^{\text{vid}}+\mathbf{E}^{\text{seg}}\,,= bold_W start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT bold_F start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT + bold_E start_POSTSUPERSCRIPT typ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_E start_POSTSUPERSCRIPT vid end_POSTSUPERSCRIPT + bold_E start_POSTSUPERSCRIPT seg end_POSTSUPERSCRIPT ,(3)
𝐅^asuperscript^𝐅a\displaystyle\hat{\mathbf{F}}^{\text{a}}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT a end_POSTSUPERSCRIPT=𝐖a𝐅a+𝐄2typ+𝐄vid+𝐄seg,absentsuperscript𝐖asuperscript𝐅asubscriptsuperscript𝐄typ2superscript𝐄vidsuperscript𝐄seg\displaystyle=\mathbf{W}^{\text{a}}\mathbf{F}^{\text{a}}+\mathbf{E}^{\text{typ%}}_{2}+\mathbf{E}^{\text{vid}}+\mathbf{E}^{\text{seg}}\,,= bold_W start_POSTSUPERSCRIPT a end_POSTSUPERSCRIPT bold_F start_POSTSUPERSCRIPT a end_POSTSUPERSCRIPT + bold_E start_POSTSUPERSCRIPT typ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_E start_POSTSUPERSCRIPT vid end_POSTSUPERSCRIPT + bold_E start_POSTSUPERSCRIPT seg end_POSTSUPERSCRIPT ,(4)
𝐅^fsuperscript^𝐅f\displaystyle\hat{\mathbf{F}}^{\text{f}}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT=𝐖f𝐅f+𝐄3typ+𝐄vid+𝐄seg+𝐄fcl+𝐄bbox.absentsuperscript𝐖fsuperscript𝐅fsubscriptsuperscript𝐄typ3superscript𝐄vidsuperscript𝐄segsuperscript𝐄fclsuperscript𝐄bbox\displaystyle=\mathbf{W}^{\text{f}}\mathbf{F}^{\text{f}}+\mathbf{E}^{\text{typ%}}_{3}+\mathbf{E}^{\text{vid}}+\mathbf{E}^{\text{seg}}+\mathbf{E}^{\text{fcl}}%+\mathbf{E}^{\text{bbox}}.= bold_W start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT bold_F start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT + bold_E start_POSTSUPERSCRIPT typ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + bold_E start_POSTSUPERSCRIPT vid end_POSTSUPERSCRIPT + bold_E start_POSTSUPERSCRIPT seg end_POSTSUPERSCRIPT + bold_E start_POSTSUPERSCRIPT fcl end_POSTSUPERSCRIPT + bold_E start_POSTSUPERSCRIPT bbox end_POSTSUPERSCRIPT .(5)
A Transformer encoder (TE)

[49] of LEsubscript𝐿𝐸L_{E}italic_L start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT layers is used to combine and refine individual representations mentioned above.Thus, the final memory bank is:

𝐌=[𝐁~,𝐅~s,𝐅~a,𝐅~f]=TE([𝐁^,𝐅^s,𝐅^a,𝐅^f]).𝐌~𝐁superscript~𝐅𝑠superscript~𝐅𝑎superscript~𝐅𝑓TE^𝐁superscript^𝐅𝑠superscript^𝐅𝑎superscript^𝐅𝑓\mathbf{M}=[\tilde{\mathbf{B}},\tilde{\mathbf{F}}^{s},\tilde{\mathbf{F}}^{a},%\tilde{\mathbf{F}}^{f}]=\text{TE}([\hat{\mathbf{B}},\hat{\mathbf{F}}^{s},\hat{%\mathbf{F}}^{a},\hat{\mathbf{F}}^{f}])\,.bold_M = [ over~ start_ARG bold_B end_ARG , over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ] = TE ( [ over^ start_ARG bold_B end_ARG , over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ] ) .(6)

3.1.2 Auto-regressive Identity Prediction

We now present the process of filling blanks.Similar to the encoder, we use a couple embeddings for the decoder.(i)𝐄vidsuperscript𝐄vid\mathbf{E}^{\text{vid}}bold_E start_POSTSUPERSCRIPT vid end_POSTSUPERSCRIPT (shared with encoder) informs the decoder of the video index that is being captioned; and(ii)𝐄possuperscript𝐄pos\mathbf{E}^{\text{pos}}bold_E start_POSTSUPERSCRIPT pos end_POSTSUPERSCRIPT encodes learnable position embeddings similar to the original Transformer[49].We use the memory embeddings extracted from the video as key-value pairs and blanks in the Transformer decoder (TD) as queries.Given a captionset 𝒞^^𝒞\hat{\mathcal{C}}over^ start_ARG caligraphic_C end_ARG, we generate the next word as

𝐡j+1subscript𝐡𝑗1\displaystyle\mathbf{h}_{j+1}bold_h start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT=TD([w1,,wj];𝐌),absentTDsubscript𝑤1subscript𝑤𝑗𝐌\displaystyle=\text{TD}([w_{1},\ldots,w_{j}];\mathbf{M})\,,= TD ( [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ; bold_M ) ,(7)
wj+1subscript𝑤𝑗1\displaystyle w_{j+1}italic_w start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT=argmax𝒱𝐖𝒱𝐡j+1.absentsubscript𝒱superscript𝐖𝒱subscript𝐡𝑗1\displaystyle=\arg\max_{\mathcal{V}}\mathbf{W}^{\mathcal{V}}\mathbf{h}_{j+1}\,.= roman_arg roman_max start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT .(8)

𝐡j+1subscript𝐡𝑗1\mathbf{h}_{j+1}bold_h start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT represents the output of TD at the j+1th𝑗superscript1thj+1^{\text{th}}italic_j + 1 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT timestep and is obtained through a series of LDsubscript𝐿𝐷L_{D}italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT decoder layers that computeself-attention to previous words,and cross-attention to the memory.𝐖𝒱superscript𝐖𝒱\mathbf{W}^{\mathcal{V}}bold_W start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT is a linear classifier in 𝒱×dsuperscript𝒱𝑑\mathbb{R}^{\mathcal{V}\times d}blackboard_R start_POSTSUPERSCRIPT caligraphic_V × italic_d end_POSTSUPERSCRIPT, where 𝒱𝒱\mathcal{V}caligraphic_V is the word vocabulary.

For the FITB task, the captionset already contains the correct caption words.Thus, the output prediction is relevant only when wj+1subscript𝑤𝑗1w_{j+1}italic_w start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT is a blank bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.In such a case, we can use a smaller output classifier 𝐖𝒫superscript𝐖𝒫\mathbf{W}^{\mathcal{P}}bold_W start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT that picks one among 𝒫𝒫\mathcal{P}caligraphic_P person-id labels.We rewrite the above equations as:

𝐡j+1subscript𝐡𝑗1\displaystyle\mathbf{h}_{j+1}bold_h start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT=TD([w1,,wj];𝐌),absentTDsubscript𝑤1subscript𝑤𝑗𝐌\displaystyle=\text{TD}([w_{1},\ldots,w_{j}];\mathbf{M})\,,= TD ( [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ; bold_M ) ,(9)
wj+1=y^ksubscript𝑤𝑗1subscript^𝑦𝑘\displaystyle w_{j+1}{=}\hat{y}_{k}italic_w start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT=argmax𝒫𝐖𝒫𝐡j+1,absentsubscript𝒫superscript𝐖𝒫subscript𝐡𝑗1\displaystyle=\arg\max_{\mathcal{P}}\mathbf{W}^{\mathcal{P}}\mathbf{h}_{j+1}\,,= roman_arg roman_max start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ,(10)

where y^k𝒫subscript^𝑦𝑘𝒫\hat{y}_{k}\in\mathcal{P}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_P is the predicted person-id label for blank bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Training and inference.

We train MICap by applying a cross-entropy loss at every blank:

FITB=k=1||yklogsoftmax𝒫(𝐖𝒫𝐡j+1),subscriptFITBsuperscriptsubscript𝑘1subscript𝑦𝑘subscriptsoftmax𝒫superscript𝐖𝒫subscript𝐡𝑗1\mathcal{L}_{\text{FITB}}=-\sum_{k=1}^{|\mathcal{B}|}y_{k}\log\text{softmax}_{%\mathcal{P}}\left(\mathbf{W}^{\mathcal{P}}\mathbf{h}_{j+1}\right)\,,caligraphic_L start_POSTSUBSCRIPT FITB end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log softmax start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ) ,(11)

where yksubscript𝑦𝑘y_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the correct label for blank bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.The key difference to[29] is that our decoder observes each word of the captionset in an auto-regressive manner.

During inference, we simply follow Eq.10 to compute person-id label predictions for blanks in a captionset.

3.2 Joint Fill-in and Captioning

We first present how MICap can be adapted for generating the entire captionset.Then, we will present the opportunity of joint training.

From FITB to generating the captionset.

In this scenario, the model is shown the videoset 𝒩𝒩\mathcal{N}caligraphic_N and expected to generate an id-aware captionset 𝒞𝒞\mathcal{C}caligraphic_C.We make two small changes:

(i)The memory bank is restricted to visual features,𝐌=[𝐅~s,𝐅~a,𝐅~f]𝐌superscript~𝐅𝑠superscript~𝐅𝑎superscript~𝐅𝑓\mathbf{M}=[\tilde{\mathbf{F}}^{s},\tilde{\mathbf{F}}^{a},\tilde{\mathbf{F}}^{%f}]bold_M = [ over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ].In fact, we cannot compute blank embeddings 𝐁~~𝐁\tilde{\mathbf{B}}over~ start_ARG bold_B end_ARG as the captionset needs to be predicted.

(ii)When decoding the next word of the captionset, we use an augmented vocabulary consisting of normal language tokens (from 𝒱𝒱\mathcal{V}caligraphic_V) and person-id labels (from 𝒫𝒫\mathcal{P}caligraphic_P).We predict the next word as shown below:

𝒱superscript𝒱\displaystyle\mathcal{V}^{*}caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT=𝒱+𝒫,absent𝒱𝒫\displaystyle=\mathcal{V}+\mathcal{P}\,,= caligraphic_V + caligraphic_P ,(12)
𝐡j+1subscript𝐡𝑗1\displaystyle\mathbf{h}_{j+1}bold_h start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT=TD([w1,,wj];𝐌),absentTDsubscript𝑤1subscript𝑤𝑗𝐌\displaystyle=\text{TD}([w_{1},\ldots,w_{j}];\mathbf{M})\,,= TD ( [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ; bold_M ) ,(13)
w^j+1subscript^𝑤𝑗1\displaystyle\hat{w}_{j+1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT=argmax𝒱𝐖𝒱𝐡j+1,absentsubscriptsuperscript𝒱superscript𝐖superscript𝒱subscript𝐡𝑗1\displaystyle=\arg\max_{\mathcal{V}^{*}}\mathbf{W}^{\mathcal{V}^{*}}\mathbf{h}%_{j+1}\,,= roman_arg roman_max start_POSTSUBSCRIPT caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ,(14)

and train our model to minimize

cap=j=1Lwj+1logsoftmax𝒱(𝐖𝒱𝐡j+1).subscript𝑐𝑎𝑝superscriptsubscript𝑗1𝐿subscript𝑤𝑗1subscriptsoftmaxsuperscript𝒱superscript𝐖superscript𝒱subscript𝐡𝑗1\mathcal{L}_{cap}=-\sum_{j=1}^{L}w_{j+1}\log\text{softmax}_{\mathcal{V}^{*}}%\left(\mathbf{W}^{\mathcal{V}^{*}}\mathbf{h}_{j+1}\right)\,.caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_p end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT roman_log softmax start_POSTSUBSCRIPT caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ) .(15)

We can use Eq.14 during inference to predict the entire captionset until the end-of-sentence token is triggered.

Joint training.

Can we train the same instance of MICap to generate the captionset and fill-in-the-blanks with identity information?Yes, we suggest an efficient way to do so.

Given a batch of data consisting of multiple paired videosets and captionsets (𝒩,𝒞)𝒩𝒞(\mathcal{N},\mathcal{C})( caligraphic_N , caligraphic_C ), we forward it through the model twice.In the first forward pass, we replace the person-id labels with blanks, i.e.create 𝒞^^𝒞\hat{\mathcal{C}}over^ start_ARG caligraphic_C end_ARG, and compute losses and gradients to predict the blank’s labels (see Eq.11).In the second forward pass conducted on the same batch, we assume that 𝒞𝒞\mathcal{C}caligraphic_C is not available as input and use the augmented vocabulary 𝒱superscript𝒱\mathcal{V}^{*}caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to compute loss and gradients for each word as in Eq.15.We can either accumulate gradients and optimize parameters at the end of both forward passes or optimize parameters after each pass.

Note, the classifier parameters 𝐖𝒫superscript𝐖𝒫\mathbf{W}^{\mathcal{P}}bold_W start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT are subsumed under 𝐖𝒱superscript𝐖superscript𝒱\mathbf{W}^{\mathcal{V}^{*}}bold_W start_POSTSUPERSCRIPT caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.We find that sharing the classifier 𝐖𝒱superscript𝐖superscript𝒱\mathbf{W}^{\mathcal{V}^{*}}bold_W start_POSTSUPERSCRIPT caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for both forward passes works best.

Thus, we unite seemingly disparate tasks of filling in person-id labels in blanks and generating the full captionset in a single model with a single set of parameters.

4 Identity-aware SPICE

Inspired by a metric used in image captioning evaluation called Semantic Propositional Image Caption Evaluation (SPICE)[1], we propose a new metric –identity-aware SPICE (iSPICE for short) –to evaluate the quality of video descriptions, especially pertaining to identity labels.

Why SPICE?

The classic captioning metrics borrowed from language translation such as BLEU[27], ROUGE[21], METEOR[11], and CIDEr[50] rely primarily on n-gram overlap.However, as indicated in[1],“n-gram overlap is neither necessarynor sufficient for two sentences to convey the same meaning”.SPICE is shown to have a high correlation with human judgement (0.88) as compared to METEOR (0.53) or CIDEr (0.43) on the MS-COCO image captioning dataset[1].

How is SPICE calculated?

SPICE estimates quality of a caption in two stages.First, the reference and predicted caption are converted to scene graphs[41, 16] that explicitly encode objects, attributes, and relationships.This abstraction provides a list of tuples 𝒯rsubscript𝒯𝑟\mathcal{T}_{r}caligraphic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and 𝒯psubscript𝒯𝑝\mathcal{T}_{p}caligraphic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for the reference and predicted captions.SPICE is the F1-score that measures logical conjunction (overlap):

SPICE=F1(𝒯r,𝒯p).SPICEsubscriptF1subscript𝒯𝑟subscript𝒯𝑝\text{SPICE}=\text{F}_{1}(\mathcal{T}_{r},\mathcal{T}_{p})\,.SPICE = F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) .(16)
iSPICE

is a simple modification of SPICE.We intervene at the list of tuples and filter out tuples that do not have at least one character identity.We define

iSPICE=F1(𝒯rp2+,𝒯pp2+)F1(𝒯rp1,𝒯pp1),iSPICEsubscriptF1superscriptsubscript𝒯𝑟limit-from𝑝2superscriptsubscript𝒯𝑝limit-from𝑝2subscriptF1superscriptsubscript𝒯𝑟𝑝1superscriptsubscript𝒯𝑝𝑝1\text{iSPICE}=\text{F}_{1}(\mathcal{T}_{r}^{p2+},\mathcal{T}_{p}^{p2+})\cdot%\text{F}_{1}(\mathcal{T}_{r}^{p1},\mathcal{T}_{p}^{p1})\,,iSPICE = F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p 2 + end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p 2 + end_POSTSUPERSCRIPT ) ⋅ F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p 1 end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p 1 end_POSTSUPERSCRIPT ) ,(17)

where 𝒯rp2+superscriptsubscript𝒯𝑟limit-from𝑝2\mathcal{T}_{r}^{p2+}caligraphic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p 2 + end_POSTSUPERSCRIPT denotes the list of tuples with a person-id label having 2 or more elements and 𝒯rp1superscriptsubscript𝒯𝑟𝑝1\mathcal{T}_{r}^{p1}caligraphic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p 1 end_POSTSUPERSCRIPT is a set of person-id labels in the reference captionset.The first term scores whether the correct person-id label is used together with a verb or attribute, while the second term checks that the total number of person-id labels match..A couple examples of the matching process are presented in the supplement.

Validation.

We validate iSPICE by an experiment that measures sensitivity to changes in identity.Given a reference captionset, we compare it against itself to obtain a base score s𝑠sitalic_s.Next, we modify the reference captionset by swapping, adding new, or removing existing id labels.

1. Swapping:Here, id tokens are replaced with another id present in the captionset.The number of these tokens is selected at random for each captionset.We first identify eligible id tokens whose ids are present more than once in the captionset.This is done to prevent the case where standalone ids are selected and replaced with each other that does not change the meaning.For example, the caption P1 carries P2 is equivalent to P2 carries P1 if P1 and P2 are not re-used elsewhere in the captionset.When the id occurs multiple times, e.g.P1 carries P2. P2 is unconscious, the replacement P2 carries P1. P2 is unconscious changes the meaning of the story.Once these eligible tokens are identified, a random subset is replaced with another id present in the captionset to generate the modified caption.

2. Addition:Here, we select an id token at random and change it to an id token that is not present in the current captionset, adding new identities.Again, we do not replace tokens whose id appears only once.

3. Removal:Here, we replace a single occurrence id token (chosen at random) with an id token that exists in the captionset, thereby removing the identity.

ExperimentsiSSB4CMRBSc
Swapping0.550.850.870.860.610.950.99
Addition0.510.860.890.880.60.950.99
Removal0.460.840.870.860.60.950.99
Id normalization.

Prior to scoring, a normalization operation is performed on the captionset.The first unique id label is set to P1, the second to P2 an so on.This ensures that the captionsets P2 carries P1 or P4 carries P3, are treated as the same captionset P1 carries P2.

Results.

We compute a new score s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG for each edited captionset by comparing it against the reference.We report the drop in performance s^/s^𝑠𝑠\hat{s}/sover^ start_ARG italic_s end_ARG / italic_s as the sensitivity of a metric to changing identities.We create 3 manipulated samples for each type and report averaged scores over all 1443 captionsets from the validation set inTab.1.We observe that iSPICE obtains the smallest score, indicating the highest sensitivity to manipulating identities, a desirable property.

5 Experiments

We present experiments on the LSMDC[38] dataset in the identity-aware multi-video captioning setup[29].We describe the experimental setup first, followed by implementation details and metrics.The evaluation is presented for(i)Fill-in-the-blanks and(ii)Identity-aware captioning.

5.1 Setup

Dataset.

LSMDC consists of 128,118 short video clips extracted from 202 movies.Each video has a caption, either from the movie script or from transcribed DVS (descriptive video services) for the visually impaired.The median video duration is 3stimes3second3\text{\,}\mathrm{s}start_ARG 3 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG, average is 4.2stimes4.2second4.2\text{\,}\mathrm{s}start_ARG 4.2 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG, and std dev is 3.1stimes3.1second3.1\text{\,}\mathrm{s}start_ARG 3.1 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG.The dataset is split into 101,079 clips for training, 7,408 for validation, 10,053 for public test, and 9,578 for blind test.We report and compare results on the validation set as the test set labels are not released and the evaluation server is down.

In the Fill-in challenges, the movie descriptions are evaluated on sets of 5 clips taken at a time.Characters are identified across the clips to provide meaningful narratives.The training videosets use overlapping clips (e.g.1-5, 2-6) for data augmentation but the val and test videosets are non-overlapping.We train on 98,527 videosets and report results on 1,443 val videosets.All three tasks of the LSMDC challenge[38] are evaluated on the same sets of 5 clips.We focus on task 2: filling in local person ids; andtask 3: description generation with local character IDs.

Implementation details.

Videosets have N=5𝑁5N{=}5italic_N = 5 clips, we set the captionset length to 120 tokens.The hidden dimension for encoder and decoder in MICap is d=512𝑑512d{=}512italic_d = 512, and we use LE=2subscript𝐿𝐸2L_{E}{=}2italic_L start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = 2 and LD=3subscript𝐿𝐷3L_{D}{=}3italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = 3 layers.We train our model with a learning rate of 5×1055superscript1055{\times}10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for 30 epochs.The vocabulary sizes are |𝒫|=11𝒫11|\mathcal{P}|{=}11| caligraphic_P | = 11 and |𝒱|=30522𝒱30522|\mathcal{V}|{=}30522| caligraphic_V | = 30522.We train on one RTX 2080 GPU with a batch size of 16 videosets/captionsets.

MICap: A Unified Model for Identity-aware Movie Descriptions (3)
Fill-in metrics.

For the Fill-in task we evaluate results using all pairs of blanks in the captionset as proposed by[29].Pairs that require both ids to be same are called are evaluated with same accuracy (“Same-acc”).Different id pairs are evaluated using “Diff-acc”.“Inst-acc” is the combined accuracy while “Class-acc” computes the harmonic mean.

Captioning metrics.

We use METEOR[11], CIDEr[50], SPICE[1] and our newly proposed metric iSPICE to evaluate the quality of our generated captions.

5.2 Evaluating on the Fill-in Task

MICap makes better use of visual features.

In Tab.2,our text-only model (row 2) is comparable to[29]’s text-only (R0).While[29] improves by 1.5% (R1), MICap achieves a significant 4.7% improvement (R6).

Ablations on visual features.

[29] computes face clusters within a video and provides mean pooled features of faces in a cluster.R3 of Tab.2 uses these features in MICap (with embeddings from Eq.5).The only decoder model (only-dec) achieves a 0.6% improvement, while the encoder-decoder model (enc-dec) shows 1.4% improvement over R1.Next, in R4, we swap out face cluster features to individual face detections, while still using FaceNet for a fair comparison; but using embeddings as shown in Eq.5.This improves the only-dec model by a further 0.9%, but enc-dec shows negligible change.We incorporate CLIP features as additional tokens in the memory, resulting in a 0.35% increase in enc-dec (R5).Finally, in R6, swapping FaceNet[40] to Arcface[9] results in a relatively large improvement of 1.6% (only-dec) and 1.4% (enc-dec).

#MethodOnly DecEnc-Dec
0FillIn text-only[29]-64.4
1FillIn multimodal[29]-65.9
2MICap text-only-64.45
3MICap w face clusters of[29]66.5667.29
4MICap w raw face detections67.4867.35
5MICap 4 + w CLIP features67.3867.70
6MICap 5 + w Arcface features68.9469.14
MethodSameDifferentInstanceClass
Test set
Yuet al.[59]26.487.365.940.6
Brownet al.[2]33.681.064.847.5
FillIn text-only[29]56.071.264.862.7
FillIn[29]60.670.069.664.9
Validation set
FillIn[29]63.568.469.065.9
Ours (only-dec)65.173.373.068.94
Ours (enc-dec)65.772.973.069.14
SotA comparison.

Tab.3 reports results on all 4 FITB metrics.As we do not have access to the test set labels and the evaluation server is inactive, we use FillIn’s results as a proxy for comparison.First, in the top half, we see that FillIn[29] outperforms other works.In the bottom half, on the validation set, we compare our approach against FillIn showing a significant improvement of 4% on instance accuracy and 3.2% on class accuracy.As we teacher force captions through the decoder, our only decoder model also outperforms[29] by 3% on class accuracy.

Captioning metricsFITB
MethodCMSiSClass Acc.
FITB only----69.14
Full caption only8.0112.2913.110.777-
Joint training9.0912.4713.300.78870.01

5.3 Evaluating Joint Fill-in and Captioning

We evaluate MICap trained jointly for FITB and id-aware caption generation.Tab.4 shows that joint training on fill-in and captioning improves the performance on both the tasks.Class accuracy on FITB improves by 0.9% and captioning metric CIDEr by 1%.We also see a small 0.01% improvement in iSPICE, which we think is important considering the difficulty of the metric.This suggests that both the tasks are complementary and can help each other in learning a better representation.MICap can seamlessly switch between FITB (id prediction) and full caption generation.

CaptionsMethodCMSiS
1Fill-in [29]Same id7.039.419.010.591
2All diff ids79.1112.980.202
3FillIn7.7710.68--
4MICapSame id8.4410.99.260.687
5All diff ids8.7411.0113.090.264
6MICap (Joint)9.0912.4713.300.788
MethodCaptioning metricsFITB
CMSiSClass Acc.
MICap9.0912.4713.300.78870.01
T5 only CLIP4.98.57.10.755-
T5 all features4.57.96.80.723-
GPT2 only CLIP3.68.710.70.640-
GPT2 all features4.48.99.20.595-
SotA comparison for captioning.

We compare against the two-stage baseline[29], while MICap predicts the captions and identities in a single stage.Tab.5 shows that we improve over[29] across all metrics.

MICap’s captions are better.

We disentangle identity prediction from caption generation by replacing all person id labels by the same id or all different ids.This allows us to evaluate captioning performance, independent of identity prediction.We are pleased that our simple encoder-decoder approach outperforms a complex adversarial multi-sentence captioning approach[28] used in stage 1 of[29].Tab.5 R1 vs.R4, CIDEr goes up from 7.03 to 8.44, and METEOR 9.41 to 10.9.Similar improvements hold for R2 vs.R5.

Comparison to VLMs.

Tab.6 shows that MICap outperforms adaptations of T5 (an encoder-decoder framework) and GPT-2 (QFormer prefix tokens like CLIPCap[24] or BLIP2[19]), fine-tuned for the id-aware captioning task.We suspect that integrating many diverse visual tokens is not trivial for VLMs, resulting in comparable performance when using “only CLIP” or “all features”.

Id-aware metric.

iSPICE is a challenging metric as it multiplies two F1 scores that penalize when the number of identities are mismatched or tuples incorrect.Tab.5 shows that iSPICE changes dramatically when using the same id or all different ids.We hope that this metric will inspire future works in this direction of identity-aware captioning.

Attention patterns

of MICap’s decoder reveal interesting insights.For the task of full captioning, we see that tokens that produce id labels cross-attend more to the face tokens (from memory) while normal word tokens cross-attend to CLIP features.We also analyze the attention patterns in FITB and observe that the model attends to the same clusters when predicting the same labels and also attends to face detections across the videoset (not restricted to faces in a single video).Please refer to the supplement for details.

A qualitative example

is shown in Fig.3.We observe that MICap does a decent job at generating captions (although it is unable to use a rich vocabulary - smiles instead of beams cheerily).The challenges of caption evaluation are also clear in the last clip.Several more examples for both tasks are shown in the supplement.

6 Conclusion

We proposed a new paradigm for identity-aware movie caption generation.As opposed to the two-stage approach of first captioning with anonymized names and then filling in the identities, we proposed a single-stage method that combines the two tasks via an encoder-decoder sequence-to-sequence generation framework, that can seamlessly switch between(i)full caption generation with identities, or (ii)predict the identities given a caption with anonymized names.We showed that a single auto-regressive model benefits both tasks and shows positive transfer, leading to state-of-the-art performance on the LSMDC challenge.We also proposed an identity-aware captioning metric, iSPICE, that is sensitive to subtle perturbations in identity and robustly evaluates captions.

Acknowledgments.

The project was supported by funding from SERB SRG/2023/002544.We thank the Bank of Baroda for partial travel support.We thank Amit Pandey for assisting in early discussions.Makarand Tapaswi thanks support fromGoogle India Faculty Award andNaveen Reddy Desanur from Sensara.

References

  • Anderson etal. [2016]Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould.SPICE: Semantic Propositional Image Caption Evaluation.In European Conference on Computer Vision (ECCV), 2016.
  • Brown etal. [2019]Andrew Brown, Samuel Albanie, Yang Liu, Arsha Nagrani, and Andrew Zisserman.LSMDC v2 Challenge presentation.In 3rd Workshop on Closing the Loop Between Vision and Language, 2019.
  • Brown etal. [2021]Andrew Brown, Vicky Kalogeiton, and Andrew Zisserman.Face, Body, Voice: Video Person-Clustering with Multiple Modalities.In International Conference on Computer Vision Workshops (ICCVW), 2021.
  • Carreira and Zisserman [2017]Joao Carreira and Andrew Zisserman.Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Chadha etal. [2021]Aman Chadha, Gurneet Arora, and Navpreet Kaloty.iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering.In Winter Conference on Applications of Computer Vision (WACV), 2021.
  • Chan etal. [2023]DavidM Chan, Suzanne Petryk, JosephE Gonzalez, and Trevor Darrell.CLAIR: Evaluating Image Captions with Large Language Models.In Empirical Methods in Natural Language Processing (EMNLP), 2023.
  • Chen and Jiang [2021]Shaoxiang Chen and Yu-Gang Jiang.Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning.In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Deng etal. [2021]Chaorui Deng, Shizhe Chen, Da Chen, Yuan He, and Qi Wu.Sketch, ground, and refine: Top-down dense video captioning.In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Deng etal. [2019]Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou.ArcFace: Additive Angular Margin Loss for Deep Face Recognition.In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Deng, Jiankang and Guo, Jia and Ververas, Evangelos and Kotsia, Irene and Zafeiriou, Stefanos [2020]Deng, Jiankang and Guo, Jia and Ververas, Evangelos and Kotsia, Irene and Zafeiriou, Stefanos.Retinaface: Single-shot multi-level face localisation in the wild.In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Denkowski and Lavie [2014]Michael Denkowski and Alon Lavie.Meteor Universal: Language Specific Translation Evaluation for Any Target Language.In European Chapter of the Association for Computational Linguistics (EACL), 2014.
  • Donahue etal. [2015]Jeffrey Donahue, Lisa AnneHendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell.Long-term recurrent convolutional networks for visual recognition and description.In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Han etal. [2023a]Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman.AutoAD: Movie Description in Context.In Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  • Han etal. [2023b]Tengda Han, Max Bain, Arsha Nagrani, Gul Varol, Weidi Xie, and Andrew Zisserman.AutoAD II: The Sequel-Who, When, and What in Movie Audio Description.In International Conference on Computer Vision (ICCV), 2023b.
  • Hessel etal. [2021]Jack Hessel, Ari Holtzman, Maxwell Forbes, RonanLe Bras, and Yejin Choi.CLIPScore: A Reference-free Evaluation Metric for Image Captioning.In Empirical Methods in Natural Language Processing (EMNLP), 2021.
  • Johnson etal. [2015]Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei.Image Retrieval using Scene Graphs.In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Khan etal. [2022]Zeeshan Khan, CV Jawahar, and Makarand Tapaswi.Grounded Video Situation Recognition.In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • Krishna etal. [2017]Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan CarlosNiebles.Dense-captioning events in videos.In International Conference on Computer Vision (ICCV), 2017.
  • Li etal. [2023]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.In International Conference on Machine Learning (ICML), 2023.
  • Li etal. [2018]Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei.Jointly localizing and describing events for dense video captioning.In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Lin [2004]Chin-Yew Lin.ROUGE: A Package for Automatic Evaluation of Summaries.In Workshop on Text Summarization Branches Out (WAS), 2004.
  • Lin etal. [2022]Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang.Swinbert: End-to-end transformers with sparse attention for video captioning.In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Luo etal. [2020]Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou.UniVL: A Unified Video and Language Pre-training Model for Multimodal Understanding and Generation.arXiv preprint arXiv:2002.06353, 2020.
  • Mokady etal. [2021]Ron Mokady, Amir Hertz, and AmitH. Bermano.ClipCap: CLIP Prefix for Image Captioning.arXiv preprint 2111.09734, 2021.
  • Mun etal. [2019]Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, and Bohyung Han.Streamlined dense video captioning.In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Nagrani and Zisserman [2017]Arsha Nagrani and Andrew Zisserman.From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script.In British Machine Vision Conference (BMVC), 2017.
  • Papineni etal. [2002]Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.BLEU: a method for automatic evaluation of machine translation.In Association of Computational Linguistics (ACL), 2002.
  • Park etal. [2019]JaeSung Park, Marcus Rohrbach, Trevor Darrell, and Anna Rohrbach.Adversarial inference for multi-sentence video description.In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Park etal. [2020]JaeSung Park, Trevor Darrell, and Anna Rohrbach.Identity-aware multi-sentence video description.In European Conference on Computer Vision (ECCV), 2020.
  • Pini etal. [2017]Stefano Pini, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara.Towards video captioning with naming: a novel dataset and a multi-modal approach.In International Conference on Image Analysis and Processing (ICIAP), 2017.
  • Pini etal. [2019]Stefano Pini, Marcella Cornia, Federico Bolelli, Lorenzo Baraldi, and Rita Cucchiara.M-VAD names: a dataset for video captioning with naming.Multimedia Tools and Applications (MTAP), 78:14007–14027, 2019.
  • Radford etal. [2019]Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.Language Models are Unsupervised Multitask Learners.2019.
  • Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International Conference on Machine Learning (ICML). PMLR, 2021.
  • Raffel etal. [2020]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research (JMLR), 21:1–67, 2020.
  • Rahman etal. [2019]Tanzila Rahman, Bicheng Xu, and Leonid Sigal.Watch, listen and tell: Multi-modal weakly supervised dense event captioning.In International Conference on Computer Vision (ICCV), 2019.
  • Rohrbach etal. [2014]Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele.Coherent multi-sentence video description with variable level of detail.In German Conference on Pattern Recognition (GCPR), 2014.
  • Rohrbach etal. [2015]Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele.A Dataset for Movie Description.In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Rohrbach etal. [2017]Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele.Movie description.International Journal of Computer Vision (IJCV), 123:94–120, 2017.
  • Sadhu etal. [2021]Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, and Aniruddha Kembhavi.Visual Semantic Role Labeling for Video Understanding.In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Schroff etal. [2015]Florian Schroff, Dmitry Kalenichenko, and James Philbin.FaceNet: A Unified Embedding for Face Recognition and Clustering.In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Schuster etal. [2015]Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and ChristopherD Manning.Generating semantically precise scene graphs from textual descriptions for improved image retrieval.In Fourth Workshop on Vision and Language, 2015.
  • Seo etal. [2022]PaulHongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid.End-to-end generative pretraining for multimodal video captioning.In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Shen etal. [2017]Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue.Weakly supervised dense video captioning.In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Shi etal. [2019]Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu, and Ming Zhou.Dense procedure captioning in narrated instructional videos.In Association of Computational Linguistics (ACL), 2019.
  • Shin etal. [2016]Andrew Shin, Katsunori Ohnishi, and Tatsuya Harada.Beyond caption to narrative: Video captioning with multiple sentences.In International Conference on Image Processing (ICIP), 2016.
  • Soldan, Mattia and Pardo, Alejandro and Alcázar, Juan León and Caba, Fabian and Zhao, Chen and Giancola, Silvio and Ghanem, Bernard [2022]Soldan, Mattia and Pardo, Alejandro and Alcázar, Juan León and Caba, Fabian and Zhao, Chen and Giancola, Silvio and Ghanem, Bernard.MAD: A Scalable Dataset for Language Grounding in Videos From Movie Audio Descriptions.In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Tapaswi etal. [2012]Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen.“Knock! Knock! Who is it?" Probabilistic Person Identification in TV series.In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • Tapaswi etal. [2019]Makarand Tapaswi, MarcT. Law, and Sanja Fidler.Video Face Clustering with Unknown Number of Clusters.In International Conference on Computer Vision (ICCV), 2019.
  • Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is All You Need.In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • Vedantam etal. [2015]Ramakrishna Vedantam, C.Lawrence Zitnick, and Devi Parikh.CIDEr: Consensus-based image description evaluation.In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Venugopalan etal. [2015a]Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko.Sequence to sequence-video to text.In International Conference on Computer Vision (ICCV), 2015a.
  • Venugopalan etal. [2015b]Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko.Translating Videos to Natural Language Using Deep Recurrent Neural Networks.In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015b.
  • Wang etal. [2018]Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu.Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning.In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Wang etal. [2020]Teng Wang, Huicheng Zheng, Mingjing Yu, Qian Tian, and Haifeng Hu.Event-centric hierarchical representation for dense video captioning.IEEE Transactions on Circuits and Systems for Video Technology, 31(5):1890–1900, 2020.
  • Wang etal. [2021]Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo.End-to-end Dense Video Captioning with Parallel Decoding.In International Conference on Computer Vision (ICCV), 2021.
  • Yang etal. [2023]Antoine Yang, Arsha Nagrani, PaulHongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid.Vid2seq: Large-scale pretraining of a visual language model for dense video captioning.In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Yu etal. [2016]Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu.Video paragraph captioning using hierarchical recurrent neural networks.In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Yu etal. [2017]Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim.End-to-end concept word detection for video captioning, retrieval, and question answering.In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Yu etal. [2019]Youngjae Yu, Jiwan Chung, Jongseok Kim, Heeseung Yun, and Gunhee Kim.LSMDC v2 Challenge presentation.2019.
  • Yu etal. [2020]Youngjae Yu, Jongseok Kim, Heeseung Yun, Chung Jiwan, and Gunhee Kim.Character Grounding and Re-Identification inStory of Videos and Text Descriptions.In European Conference on Computer Vision (ECCV), 2020.
  • Zhang* etal. [2020]Tianyi Zhang*, Varsha Kishore*, Felix Wu*, KilianQ. Weinberger, and Yoav Artzi.BERTScore: Evaluating Text Generation with BERT.In International Conference on Learning Representations (ICLR), 2020.
  • Zhou etal. [2018]Luowei Zhou, Yingbo Zhou, JasonJ Corso, Richard Socher, and Caiming Xiong.End-to-end dense video captioning with masked transformer.In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
MICap: A Unified Model for Identity-aware Movie Descriptions (4)

Appendix

We present additional insights and results in the supplementary material.In AppendixA, we highlight how our auto-regressive Transformer decoder attends to various memory features.For the id-aware captioning task, we show the relative importance of the 3 visual features, whilefor the Fill-in-the-blanks (FITB) task, we highlight how our model attends to correct face clusters.Next, in AppendixB, we show qualitative results for both tasks, FITB and id-aware captioning.We also illustrate how our new identity-aware metric, iSPICE, is calculated on some examples.Finally, we end with discussion of some limitations in AppendixC.

Appendix A Analyzing Model Attention

In this section, we visualize and discuss the attention scores from MICap’s auto-regressive Transformer decoder.In particular, we focus on the cross-attention scores of the last layer as they reveal interesting insights about the features that the captioning model uses.Throughout this section, we analyze MICap trained jointly on id-aware captioning and FITB.All attention scores are obtained in inference mode.

A.1 Attention Patterns in Id-aware Captioning

In id-aware full captioning, for a particular videoset 𝒩={Vi}i=1N𝒩superscriptsubscriptsubscript𝑉𝑖𝑖1𝑁\mathcal{N}=\{V_{i}\}_{i=1}^{N}caligraphic_N = { italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we first encode the videos to obtain memory tokens M𝑀Mitalic_M and pass them through a Transformer decoder auto-regressively to generate one token (word) at a time.If we consider that the number of tokens in the predicted captionset is L𝐿Litalic_L, we can compute a matrix of cross-attention scores α=L×|M|𝛼𝐿𝑀\alpha=L\times|M|italic_α = italic_L × | italic_M |, where |M|𝑀|M|| italic_M | is the number of tokens in the decoder memory.Note, while we use multi-head attention, scores over the heads are averaged obtain α𝛼\alphaitalic_α.

We split the L𝐿Litalic_L tokens into 2 groups:(i)one group consists of person id label predictions or person tokens (PT); and(ii)the other group consists of all other tokens referred to as caption tokens (CT).For visualization, we sum over the attention scores for each of the token types (id labels and text) and convert our attention map to a matrix of 2×|M|2𝑀2\times|M|2 × | italic_M |.

Next, we also group the memory tokens into 3 types of visual features used in our work: action (I3D), face (Arcface), and semantic features (CLIP).Thus, we obtain a 2×3232\times 32 × 3 matrix of cross-attention scores for each sample.

Results.

We compute attention scores over all samples of the validation set and plot them as a probability density function in Fig.4.PT (red) and CT (green) represent the person and caption tokens respectively.We observe that:(i)The model relies on CLIP features to predict captions (depicted by the overall high attention scores from 0.5-0.7).(ii)When predicting person tokens (PT) of the identity-aware captions, the model tends to look at face features (0.1-0.6) more than when predicting caption tokens (0-0.4).(iii)Finally, while action features are useful for captioning, they are less useful for predicting person-id labels. This is expected as action recognition is an identity-agnostic task.

Captionset 1: Someone watches the aliens draw closer. sits back in the doorway clutching a radio. watches from his position several yards away. squeezes the detonator the bus blows apart.
MICap: A Unified Model for Identity-aware Movie Descriptions (5)MICap: A Unified Model for Identity-aware Movie Descriptions (6)MICap: A Unified Model for Identity-aware Movie Descriptions (7)
Captionset 2: and killed their first witch.They advance cautiously.Suddenly is thrown to the ground with a jolt. whips around a weapon poised to find holding her wand to neck. begins to put the gun on the ground.
MICap: A Unified Model for Identity-aware Movie Descriptions (8)MICap: A Unified Model for Identity-aware Movie Descriptions (9)MICap: A Unified Model for Identity-aware Movie Descriptions (10)
Captionset 3: pulls her phone from her bag and answers. frowns uncertainly. leans on a wall and slips. lowers his phone and folds it shut.The next morning two women stroll across the street in front of apartment building.
MICap: A Unified Model for Identity-aware Movie Descriptions (11)MICap: A Unified Model for Identity-aware Movie Descriptions (12)MICap: A Unified Model for Identity-aware Movie Descriptions (13)
Captionset 4: scrutinizes his earnest face.His eyes gleaming in the dim light. abruptly gets to his feet and heads for the door now. talks on his cell as steps into the daylight silhouetted against the sunny day. faces the door frame and leans his head against it now.In a hotel suite a woman applies makeup to .
MICap: A Unified Model for Identity-aware Movie Descriptions (14)MICap: A Unified Model for Identity-aware Movie Descriptions (15)MICap: A Unified Model for Identity-aware Movie Descriptions (16)
Captionset 5: turns and spots the brown chevy 4x4 parked on a short driveway. approaches the vehicle cautiously across a lawn leaning over to get a view of its occupant.The passenger side window is lowered. puts both hands on the sill and leans in with an inquisitive frown., the asian man who in town sits with one hand clamped to the steering wheel rocking nervously and staring numbly ahead.
MICap: A Unified Model for Identity-aware Movie Descriptions (17)MICap: A Unified Model for Identity-aware Movie Descriptions (18)MICap: A Unified Model for Identity-aware Movie Descriptions (19)

A.2 Attention Patterns in FITB

For the FITB task, we analyze how the person id predictions attend to face features from the decoder memory.For a videoset 𝒩={Vi}i=1N𝒩superscriptsubscriptsubscript𝑉𝑖𝑖1𝑁\mathcal{N}=\{V_{i}\}_{i=1}^{N}caligraphic_N = { italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and its corresponding captionset with blanks 𝒞^^𝒞\hat{\mathcal{C}}over^ start_ARG caligraphic_C end_ARG we obtain a cross-attention map of α=||×F𝛼𝐹\alpha=|\mathcal{B}|\times Fitalic_α = | caligraphic_B | × italic_F, where |||\mathcal{B}|| caligraphic_B | is the number of blanks in the captionset, and F𝐹Fitalic_F is the number of face detections across the videoset.Each row of this matrix is normalized to sum to 1.

The attention scores and captionsets with blanks are presented in Fig.5.In the next paragraphs, we will analyze the 3 types (columns) of the presented scores.

Cross-attention scores for face detections.

In the left column of Fig.5, we visualize the attention scores directly for each face detection.In the plot, x-axis spans time across different videos.Our model tends to show a diagonal pattern indicating that person id label predictions tend to look at faces in the same video (facilitated through the 𝐄vidsuperscript𝐄vid\mathbf{E}^{\text{vid}}bold_E start_POSTSUPERSCRIPT vid end_POSTSUPERSCRIPT embeddings).However, as seen in captionset 5, left, row 1, the model may also attend to other face detections of the same person across videos.This highlights that being able to attend to faces across videos is useful (compared to[29] that only looks at faces within the same video).

Cross-attention scores for face clusters grouped by video index.

Shown in the middle column of Fig.5, we group the F𝐹Fitalic_F face detections into clusters, but split them based on video index in the videoset.For example, in captionset 1, we see that faces in cluster 1 appears across videos 1, 2, 4 (C1/V1, C1/V2, C1/V4).This allows us to explain some of the predictions made by our model.

Please note that the face cluster index and person id labels need not match numerically.That is, cluster 2 could be assigned the label P1 and cluster 1 the label P2.These changes are acceptable as we only consider person id labels in a local videoset.

In cationset 3, we see that cluster 2 corresponds to the prediction P1 (first two rows) and cluster 4 (C4/V3) corresponds to person id label P2 (bottom two rows).In the last row of captionset 3, we see that our model predicts P2 for the video id 4 correctly, while looking at cluster 4 in video 3 (C4/V3).Previous work[29] is unable to use such cross-video information.

MICap: A Unified Model for Identity-aware Movie Descriptions (20)
Cross-attention scores for clusters.

In the right columns of Fig.5, we show attention scores directly grouped by cluster ids.Here, the original attention map of ||×F𝐹|\mathcal{B}|\times F| caligraphic_B | × italic_F is grouped to ||×|𝒢|𝒢|\mathcal{B}|\times|\mathcal{G}|| caligraphic_B | × | caligraphic_G |, where |𝒢|𝒢|\mathcal{G}|| caligraphic_G | is the number of face clusters obtained after performing DBSCAN on the F𝐹Fitalic_F face detections.

Captionset 2 is an example with multiple blanks and 4 characters. We observe that some confusion in attention scores leads to errors in the predicted person id labels.In captionset 4, we also see 6 blanks, now with 3 characters.In the last row, while the model wrongly predicts P1, the model does look at cluster 3 (corresponding to P3) correctly.Captionset 1 and 2 are examples of perfect attention scores and clusters. P1 and C1, and P2 and C2 go together strongly in these examples.

Impact of number of clusters on FITB.

Fig.6 shows the results on FITB class-accuracy for varying the DBSCAN epsilon parameter.These results indicate the importance of clustering across videos and choosing an appropriate number of clusters.Qualitatively, we adopt 0.75 as it is unlikely to merge characters incorrectly.

MICap: A Unified Model for Identity-aware Movie Descriptions (21)MICap: A Unified Model for Identity-aware Movie Descriptions (22)
MICap: A Unified Model for Identity-aware Movie Descriptions (23)

MICap: A Unified Model for Identity-aware Movie Descriptions (24)

MICap: A Unified Model for Identity-aware Movie Descriptions (25)

MICap: A Unified Model for Identity-aware Movie Descriptions (26)MICap: A Unified Model for Identity-aware Movie Descriptions (27)

MICap: A Unified Model for Identity-aware Movie Descriptions (28)

MICap: A Unified Model for Identity-aware Movie Descriptions (29)

Appendix B Qualitative Results

iSPICE validation examples.

To validate our new metric, we propose an experiment that measures similarity between captions when identity names are added, removed, or replaced (Sec.4 of the main paper).While the quantitative results favor iSPICE, as seen in Tab.1 of the main paper, we illustrate with examples the process of metric computation in Fig.7.We observe that the small difference in identity names is captured correctly by iSPICE, due to the focus on tuples containing identities, while other metrics do not show this sensitivity.

FITB examples.

While Fig.5 clearly shows the importance of cross-attention scores of detected faces and computed clusters, the challenging visual scenarios are not evident.We pick two examples (captionset 3 and 4) from Fig.5 and pair them together with one frame from each video of the videosets.Fig.8 shows the challenging nature of the videos where characters are often not looking at the camera (example 1 video 1, 3),the scene is dark,or the face may not even be visible (example 1 video 4 or example 2 video 3).MICap leverages the ability to look at faces and clusters across videos to improve results on the FITB task.

Id-aware captioning examples.

Fig.9 shows 2 examples where our model does relatively well, while Fig.10 shows 2 difficult examples where our model makes mistakes.

In the left column of Fig.9 we see that the model rightly identifies P1 as the male character and P2 as the female.The last caption is quite interesting – while the GT points to P1 giving P2 a bowl, our model predicts that P2 gives a sad smile, which is not wrong.This also illustrates some of the challenges of evaluating captioning.In the right column of Fig.9, the predicted caption uses P2 to refer to the man, and is consistent across videos 3, 4, and 5 in the videoset.

In the complex visual example of Fig.10 (left), our model assigns P1 to all blanks.Similarly, in the multi-character example of Fig.10 (right), we observe some confusion between characters.Nevertheless, P2, identified as the man on the left in video 3, is correctly identified for the first 3 videos.The model is also able to predict that they are on a plane (caption for video 2).Nevertheless, these examples illustrate the challenges of id-aware captioning.As future work, they also highlight the need to evaluate visual grounding of the identities beyond captioning performance.

Appendix C Limitation and Future Work

One limitation of our work, inherited from the task definition in LSMDC, is restricting videosets to local groups of 5 videos.In the future, we would like to extend this to larger videosets, perhaps spanning the entire movie.However, the approach will need to be modified to operate on full movies as:(i)providing features of all movie frames as decoder memory creates a huge number of embeddings;(ii)face clustering across the entire movie could be error-prone; and(iii)auto-regressively generating one caption at a time for hundreds of clips seems challenging, as the model needs to be cognizant of all previously generated captions.We believe that a hierarchical model that builds from shots to scenes to the full movie may be more appropriate here.

Second, the tasks for FITB and full captioning do not learn at the same pace, and choosing a single best checkpoint for both may be difficult.We posit that the user may choose two checkpoints, one for each task.Furthermore, we observe that by weighing the FITB and full captioning losses appropriately, additional performance improvements can be achieved for one task at the cost of the other task.

We have also not considered using external knowledge or pre-trained large language models (LLMs) or vision-language models (VLMs) built for captioning.We believe that it is interesting to learn what can be achieved by training on LSMDC alone.As seen in multiple examples throughout AppendixB, MICap does perform quite well given the challenging scenarios.

MICap: A Unified Model for Identity-aware Movie Descriptions (2024)

References

Top Articles
Latest Posts
Article information

Author: Terence Hammes MD

Last Updated:

Views: 5888

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.