Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessOpenAI Is Making Microsoft and Ashton Kutcher Incredibly Rich - inc.comGoogle News: OpenAILiteLLM: One-Function Call To ANY Large Language Model! 🤯 UNBELIEVABLE! [0a5adf] - fathomjournal.orgGoogle News: LLM5 Things to Know About OpenAI Before Its IPO - The Motley FoolGoogle News: OpenAIЯ создал AI бота за выходные и сэкономил 40 часов в месяцDev.to AIGot $5,000? 5 Agentic AI Growth Stocks to Buy Before Wall Street Catches On. - The Motley FoolGNews AI agenticBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIWe Got Called Out for Writing AI Success Theatre — Here's What We're ChangingDev.to AISentryQ — How I Built a Local-AI Powered Security ScannerMedium AIHow to Cut Your LLM Bill Without Downgrading the ModelMedium AIXENESYS // IO — POST-HUMAN SIMULATIONMedium AI1 Tiny Artificial Intelligence (AI) Stock That Could Make You a Millionaire - The Motley FoolGoogle News: AIYapay Zekanın “Ezber” Devri Bitti: RAG, GraphRAG ve Ajanların YükselişiMedium AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessOpenAI Is Making Microsoft and Ashton Kutcher Incredibly Rich - inc.comGoogle News: OpenAILiteLLM: One-Function Call To ANY Large Language Model! 🤯 UNBELIEVABLE! [0a5adf] - fathomjournal.orgGoogle News: LLM5 Things to Know About OpenAI Before Its IPO - The Motley FoolGoogle News: OpenAIЯ создал AI бота за выходные и сэкономил 40 часов в месяцDev.to AIGot $5,000? 5 Agentic AI Growth Stocks to Buy Before Wall Street Catches On. - The Motley FoolGNews AI agenticBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIWe Got Called Out for Writing AI Success Theatre — Here's What We're ChangingDev.to AISentryQ — How I Built a Local-AI Powered Security ScannerMedium AIHow to Cut Your LLM Bill Without Downgrading the ModelMedium AIXENESYS // IO — POST-HUMAN SIMULATIONMedium AI1 Tiny Artificial Intelligence (AI) Stock That Could Make You a Millionaire - The Motley FoolGoogle News: AIYapay Zekanın “Ezber” Devri Bitti: RAG, GraphRAG ve Ajanların YükselişiMedium AI
AI NEWS HUBbyEIGENVECTOREigenvector

HippoCamp: Benchmarking Contextual Agents on Personal Computers

ArXiv CS.AIby Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei LiuApril 2, 20262 min read0 views
Source Quiz

arXiv:2604.01221v1 Announce Type: new Abstract: We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide

Authors:Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu

View PDF

Abstract:We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

Comments: Project Page: this https URL

Subjects:

Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2604.01221 [cs.AI]

(or arXiv:2604.01221v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.01221

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Shulin Tian [view email] [v1] Wed, 1 Apr 2026 17:58:33 UTC (24,493 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
HippoCamp: …modellanguage mo…benchmarkannounceassistantanalysisArXiv CS.AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 192 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!