ReGaDa

Video-adverb retrieval with compositional adverb-action embeddings

Thomas Hummel1 Otniel-Bogdan Mercea1 A. Sophia Koepke1 Zeynep Akata1,2
1University of Tübingen 2Max Planck Institute for Intelligent Systems
BMVC 2023 (oral)
Responsive image
Paper Code

Abstract

Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark the video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalisation task of retrieving adverbs from videos for unseen adverb-action compositions.

Key contributions

Qualitative examples

Video 5

Ground truth: "fold quickly"   Ours: "fold slowly"   \(AC_{reg}\): "fold quickly"

Citation

If you want to cite our work, please use:

          @InProceedings{hummel2023BMVC, 
            author = {Thomas Hummel and Otniel-Bogdan Mercea and A. Sophia Koepke and Zeynep Akata}, 
            title = {Video-adverb retrieval with compositional adverb-action embeddings}, 
            booktitle = {British Machine Vision Conference (BMVC)}, 
            year = {2023}, 
          }