HippoCamp: Benchmarking Contextual Agents on Personal Computers
arXiv:2604.01221v1 Announce Type: new Abstract: We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide
Authors:Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu
View PDF
Abstract:We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.
Comments: Project Page: this https URL
Subjects:
Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2604.01221 [cs.AI]
(or arXiv:2604.01221v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2604.01221
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Shulin Tian [view email] [v1] Wed, 1 Apr 2026 17:58:33 UTC (24,493 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark
Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B
Sure you can't do agentic coding with the Gemma 4 E2B, but this model is a game-changer for people learning a new language. Imagine a few years from now that people can run this locally on their phones. They can point their camera at objects and talk about them. And this model is multi-lingual, so people can always fallback to their native language if they want. This is essentially what OpenAI demoed a few years ago. Repo: https://github.com/fikrikarim/parlor submitted by /u/ffinzy [link] [comments]

Drummer's Skyfall 31B v4.2 aka SKYFALL-31B-V4.2-UNCENSORED-OPUS-4.6-ROLEPLAYING-100000X-XTREME-VALUE
Yes, Google stole my proprietary model size (31B). Yes, I plan to tune all the Gemma 4 models. Join us, and support the mission! Thank you all for the love submitted by /u/TheLocalDrummer [link] [comments]
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Drummer's Skyfall 31B v4.2 aka SKYFALL-31B-V4.2-UNCENSORED-OPUS-4.6-ROLEPLAYING-100000X-XTREME-VALUE
Yes, Google stole my proprietary model size (31B). Yes, I plan to tune all the Gemma 4 models. Join us, and support the mission! Thank you all for the love submitted by /u/TheLocalDrummer [link] [comments]


![LiteLLM: One-Function Call To ANY Large Language Model! 🤯 UNBELIEVABLE! [0a5adf] - fathomjournal.org](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-server-room-ecVW4zMJjpPttojVbHXZCX.webp)

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!