Video Caption Generation

Video captioning is a challenging task of modelling the objects, their temporal information and interaction in order to generate a textual description. Current models often fail to model these objects and their interactions correctly, due to lack of knowledge about them. In this paper, we propose approaches to provide this knowledge through knowledge bases like wordnet and conceptnet. We propose general encoder and decoder modules, which can be used on the top of any architecture to insert knowledge. Leveraging the advancements in attention architectures, we develop knowledge selection mechanism for the above modules. We demonstrate the efficacy of our model by extensive experiments on two benchmark datasets, MSVD and MSRVTT. The proposed model demonstrates better semantic consistency and makes significant improvement over the baseline. Our approach not only helps in object modelling, but also helps in further improving action prediction

Ashrya Agrawal
Ashrya Agrawal
Data Science Intern

My research interests lie primarily in algorithmic fairness, generalization, and causality