|
Manu Gaur
Hey there. I am Manu. Iām a Machine Learning researcher.
Over the past few years, I have been lucky to work with some wonderful people. I am currently working with Prof. Saining Xie on continual visual learning and world models. I am also collaborating with Prof. Yuki Asano on improving representation alignment between vision and language.
Before this, I spent a year at IIIT Hyderabad as Prof. Makarand Tapaswi's first research assistant, which was pivotal in shaping my research career.
Before that, I was a student researcher
at Amazon's International Machine Learning group, where I worked on self-supervised visual representations.
In a previous life, I graduated from Delhi Technological University. Although I majored in Applied Physics, I became interested in ML during my junior year, spending most of my time outside university watching lectures, reading blogs, engaging in online forums and training models on colab notebooks.
Outside of ML, I enjoy physics, history, football, video games, and occasional games of chess. I also love to travel :)
Fall 2025: I started my Master's at the CMU Robotics Institute, advised by Prof. Deva Ramanan.
Email /
CV /
Twitter /
Google Scholar /
Github
|
|
News
2025
- November Visiting Saining's lab at NYU Courant for the winter. If you're in NYC, lets hang out š
- August Moved to US and started my Masters at CMU Robotics, advised by Prof. Deva Ramanan!!
2023
- SeptemberJoining IIIT Hyderabad as a Research Assistant to work with Makarand.
- July Graduated from Delhi Technological University with a major in Applied Physics. ML arc starts now wuhuuu!!!
- February Joining Amazon's International Machine Learning group as a Student Researcher.
|
Research
Broadly, I work on self-supervised learning, vision language models, generative modelling and reinforcement learning.
Infants develop visual understanding and common sense reasoning by simply observing and interacting with the world around them.
While current systems show remarkable multimodal understanding, progressively squeezing more knowledge into them through supervised learning makes them brittle.
To achieve generalized intelligence, I believe these systems need to independently learn from first principles, either by modeling the underlying structure of data or through trial and error.
Hence, I am interested in self-supervised and reinforcement learning for improving visual understanding, multimodal reasoning, and knowledge acquisition in current systems.
|
|
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
TMLR, 2024
(Top 10%)
TL;DR: Enhancing visual understanding in MLLMs with a self-supervised verifiable reward.
A findings rich paper that systematically improves captioning systems across all fronts: Data, Training, Evaluation. We design: (1) a post-training recipe for self-retrieval finetuning with REINFORCE, and (2) a synthetic framework for visually boosting captioning datasets. Jointly they enable captioners to generate fine-grained, succinct descriptions while reducing hallucinations. Using our training recipe, ClipCap, a 200M param simplication of modern MLLMs, outperforms sota open-source MLLMs on fine-grained visual discrimination.
|
|
|
Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation
ECCV EVAL-FoMo Workshop, 2024
TL;DR: It is easier for MLLMs to select an answer from multiple choices during VQA than to generate it independently.
We evaluate MLLMs visual capabilities through self-retrieval within highly similar image pairs, revealing that current models struggle to identify fine-grained visual differences, with open-source models failing to outperform random guess.
|
|