Manu Gaur

Hi, I'm a researcher at CVIT, IIIT-H working with Dr. Makarand Tapaswi and Dr. Yuki Asano. I recently graduated from Delhi Technological University with a major in Applied Physics. I am originally from New Delhi, India.

In my final semester, I was fortunate to intern as an Applied Scientist at Amazon's International Machine Learning team, where I worked on GNNs and self-supervised learning for modeling aesthetic compatibility amongst apparel. During my undergraduate studies, I also worked on label-efficient learning for Autism Spectrum classification at the University of Technology, Sydney.

Email  /  CV  /  Twitter  /  Google Scholar  /  Github

profile photo

Research

I am interested in self-supervised learning, multimodal models, generative modelling and reinforcement learning. My long-term goal is to advance perception and reasoning in next-generation AI systems with limited human supervision.

PontTuset Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation
Manu Gaur, Darshan Singh, Makarand Tapaswi,
ECCV EVAL-FoMo Workshop, 2024

Given a highly similar image pair, it is easier for an MLLM to identify fine-grained visual differences during VQA evaluation than to independently detect and describe such differences. With such image pairs, we introduce the D3 benchmark. We use self-retrieval within D3 to do whitebox evaluation of MLLMs, revealing that current models struggle to independently discern fine-grained visual differences, with open-source models failing to outperform random guess.

PontTuset No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur, Darshan Singh, Makarand Tapaswi,
TMLR, 2024

We systematically improve captioning systems all fronts: data, training, evaluation. We introduce Visual Caption Boosting to make image captioning datasets fine-grained and design a training recipe for self-retrieval (SR) fine-tuning with REINFORCE. Jointly they enable captioners to generate more fine-grained descriptions while preserving caption faithfulness. We also introduce TrueMatch, a benchmark that uses SR to evaluate a captioner's ability to capture fine-grained visual differences. Using our training recipe, ClipCap (200M) is able to outperform state-of-the-art open-source MLLMs on TrueMatch while also being SoTA on Image-CoDe.