Video captioning is a challenging task of modelling the objects, their temporal information and interaction in order to generate a textual description. Current models often fail to model these objects and their interactions correctly, due to lack of knowledge about them.