Manu Gaur

Hey there. I am Manu. I’m a self-taught researcher. I've now spent two wonderful years working with Dr. Makarand Tapaswi at the Centre for Visual Information Technology, IIIT Hyderabad. Currently, I am working with Dr. Yuki Asano on improving vision-language alignment in current models. Before this, I was a student researcher at Amazon's International Machine Learning team, working on GNNs and self-supervised learning.

In a previous life, I graduated from Delhi Technological University. Although I majored in Applied Physics, I became interested in ML during my junior year, spending most of my time outside university watching lectures, reading blogs, engaging in online forums and training models on colab notebooks. Shortly after graduation, I came to IIIT Hyderabad to learn how to do research from first principles.

Outside of ML, I enjoy physics, history, football, video games, and occasional games of chess. I also love to travel :)

Fall 2025: I have began my Master's at the CMU Robotics Institute, where I am working with Dr. Deva Ramanan on vision-language models.

Email  /  CV  /  Twitter  /  Google Scholar  /  Github

profile photo

Research


Broadly, I work on self-supervised learning, vision language models, generative modelling and reinforcement learning.

Infants develop visual understanding and common sense reasoning by simply observing and interacting with the world around them. While current systems show remarkable multimodal understanding, progressively squeezing more knowledge into them through supervised learning makes them brittle. To achieve generalized intelligence, I believe these systems need to independently learn from first principles, either by modeling the underlying structure of data or through trial and error.

Hence, I am interested in self-supervised and reinforcement learning for improving visual understanding, multimodal reasoning, and knowledge acquisition in current systems.

Publications


PontTuset Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation
ECCV EVAL-FoMo Workshop, 2024

TL;DR: It is easier for MLLMs to select an answer from multiple choices during VQA than to generate it independently.

We evaluate MLLMs visual capabilities through self-retrieval within highly similar image pairs, revealing that current models struggle to identify fine-grained visual differences, with open-source models failing to outperform random guess.

PontTuset No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
TMLR, 2024

TL;DR: Enhancing visual understanding in MLLMs with a self-supervised verifiable reward.

A findings rich paper that systematically improves captioning systems across all fronts: Data, Training, Evaluation. We design: (1) a post-training recipe for self-retrieval finetuning with REINFORCE, and (2) a synthetic framework for visually boosting captioning datasets. Jointly they enable captioners to generate fine-grained, succinct descriptions while reducing hallucinations. Using our training recipe, ClipCap, a 200M param simplication of modern MLLMs, outperforms sota open-source MLLMs on fine-grained visual discrimination.