Matteo Stefanini

Matteo Stefanini

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020)
June 2020


Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low-and high-level features. Experimentally, we investigate the performance of the M Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the Karpathy test split and on the online test server. We also assess its performances when describing objects unseen in the training set.

Type: Conference Paper

Publication: International Conference on Pattern Recognition (ICPR 2022)


Full Paper: link pdf

Code: link github


Please cite with the following BibTeX:

  title={M $\^{} 2$: Meshed-Memory Transformer for Image Captioning},
  author={Cornia, Marcella and Stefanini, Matteo and Baraldi, Lorenzo and Cucchiara, Rita},
  journal={arXiv preprint arXiv:1912.08226},