Manu Gaur
Hi, I'm a researcher at CVIT, IIIT-H working with Dr. Makarand Tapaswi and Dr. Yuki Asano. I recently graduated from Delhi Technological University with a major in Applied Physics. I am originally from New Delhi, India.
In my final semester, I was fortunate to intern as an Applied Scientist at Amazon's International Machine Learning team, where I worked on GNNs and self-supervised learning for modeling aesthetic compatibility amongst apparel. During my undergraduate studies, I also worked on label-efficient learning for Autism Spectrum classification at the University of Technology, Sydney.
Email /
CV /
Twitter /
Google Scholar /
Github
|
|
Research
I am interested in self-supervised learning, multimodal models, generative modelling and reinforcement learning. My long-term goal is to advance perception and reasoning in next-generation AI systems with limited human supervision.
|
|
Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation
Manu Gaur,
Darshan Singh,
Makarand Tapaswi,
ECCV EVAL-FoMo Workshop, 2024
Given a highly similar image pair, it is easier for an MLLM to identify fine-grained visual differences during VQA evaluation than to independently detect and describe such differences. With such image pairs, we introduce the D3 benchmark. We use self-retrieval within D3 to do whitebox evaluation of MLLMs, revealing that current models struggle to independently discern fine-grained visual differences, with open-source models failing to outperform random guess.
|
|
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur,
Darshan Singh,
Makarand Tapaswi,
TMLR, 2024
We systematically improve captioning systems all fronts: data, training, evaluation. We introduce Visual Caption Boosting to make image captioning datasets fine-grained and design a training recipe for self-retrieval (SR) fine-tuning with REINFORCE. Jointly they enable captioners to generate more fine-grained descriptions while preserving caption faithfulness. We also introduce TrueMatch, a benchmark that uses SR to evaluate a captioner's ability to capture fine-grained visual differences. Using our training recipe, ClipCap (200M) is able to outperform state-of-the-art open-source MLLMs on TrueMatch while also being SoTA on Image-CoDe.
|
|