Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
arXiv:2604.02692v1 Announce Type: new Abstract: Accurate document parsing requires both robust content recognition and a stable parser interface. In explicit Document Layout Analysis (DLA) pipelines, downstream parsers do not consume the full detector output. Instead, they operate on a retained and serialized set of layout instances. However, on dense pages with overlapping regions and ambiguous boundaries, unstable layout hypotheses can make the retained instance set inconsistent with its parser input order, leading to severe downstream parsing errors. To address this issue, we introduce a lightweight structural refinement stage between a DETR-style detector and the parser to stabilize the parser interface. Treating raw detector outputs as a compact hypothesis pool, the proposed module pe
Authors:Fuyuan Liu, Dianyu Yu, He Ren, Nayu Liu, Xiaomian Kang, Delai Qiu, Fa Zhang, Genpeng Zhen, Shengping Liu, Jiaen Liang, Wei Huang, Yining Wang, Junnan Zhu
View PDF HTML (experimental)
Abstract:Accurate document parsing requires both robust content recognition and a stable parser interface. In explicit Document Layout Analysis (DLA) pipelines, downstream parsers do not consume the full detector output. Instead, they operate on a retained and serialized set of layout instances. However, on dense pages with overlapping regions and ambiguous boundaries, unstable layout hypotheses can make the retained instance set inconsistent with its parser input order, leading to severe downstream parsing errors. To address this issue, we introduce a lightweight structural refinement stage between a DETR-style detector and the parser to stabilize the parser interface. Treating raw detector outputs as a compact hypothesis pool, the proposed module performs set-level reasoning over query features, semantic cues, box geometry, and visual evidence. From a shared refined structural state, it jointly determines instance retention, refines box localization, and predicts parser input order before handoff. We further introduce retention-oriented supervision and a difficulty-aware ordering objective to better align the retained instance set and its order with the final parser input, especially on structurally complex pages. Extensive experiments on public benchmarks show that our method consistently improves page-level layout quality. When integrated into a standard end-to-end parsing pipeline, the stabilized parser interface also substantially reduces sequence mismatch, achieving a Reading Order Edit of 0.024 on OmniDocBench.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2604.02692 [cs.CV]
(or arXiv:2604.02692v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2604.02692
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Fuyuan Liu [view email] [v1] Fri, 3 Apr 2026 03:36:36 UTC (19,087 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
benchmarkannouncefeature
AIs can now often do massive easy-to-verify SWE tasks and I've updated towards shorter timelines
I've recently updated towards substantially shorter AI timelines and much faster progress in some areas. [1] The largest updates I've made are (1) an almost 2x higher probability of full AI R&D automation by EOY 2028 (I'm now a bit below 30% [2] while I was previously expecting around 15% ; my guesses are pretty reflectively unstable) and (2) I expect much stronger short-term performance on massive and pretty difficult but easy-and-cheap-to-verify software engineering (SWE) tasks that don't require that much novel ideation [3] . For instance, I expect that by EOY 2026, AIs will have a 50%-reliability [4] time horizon of years to decades on reasonably difficult easy-and-cheap-to-verify SWE tasks that don't require much ideation (while the high reliability—for instance, 90%—time horizon will

Willitrun: benchmark-backed CLI to check whether ML models fit/run on your hardware
I built willitrun, a small CLI that tries to answer a question I kept running into with local/edge ML: will this model actually fit and run on my hardware? It uses benchmark data when available and falls back to lightweight estimation otherwise. One thing I wanted from the start was support for Hugging Face model IDs directly , so you can point the tool at a model from the Hub instead of manually entering all metadata yourself. The goal right now is not to be perfect, but to be useful enough to filter out obviously bad choices before spending time downloading or testing models manually. GitHub: GitHub - smoothyy3/willitrun: CLI to tell you if an ML model will fit and run on your device, using real benchmarks + lightweight estimation. · GitHub PyPI: willitrun · PyPI It is still early, and I
![[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-graph-nodes-a2pnJLpyKmDnxKWLd5BEAb.webp)
[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.
Ran a benchmark evaluating whether prompt complexity-based routing delivers meaningful savings. Used public HuggingFace datasets. Here's what I found. Setup Baseline: Claude Opus for everything. Tested two strategies: Intra-provider — routes within same provider by complexity. Simple → Haiku, Medium → Sonnet, Complex → Opus Flexible — medium prompts go to self-hosted Qwen 3.5 27B / Gemma 3 27B. Complex always stays on Opus Datasets used All from AdaptLLM/finance-tasks on HuggingFace: FiQA-SA — financial tweet sentiment Financial Headlines — yes/no classification FPB — formal financial news sentiment ConvFinQA — multi-turn Q A on real 10-K filings Results Task Intra-provider Flexible (OSS) FiQA Sentiment -78% -89% Headlines -57% -71% FPB Sentiment -37% -45% ConvFinQA -58% -40% Blended avera
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products

Google’s free offline dictation app just made paying $15 a month for Wispr Flow hard to justify
In short: Google has quietly released an iOS app called Google AI Edge Eloquent, a free, offline-first voice dictation tool that transcribes speech in real time, strips filler words automatically, and transforms raw dictation into polished text without requiring an internet connection. The app runs on Gemma-based on-device ASR models, offers an optional cloud mode using [ ] This story continues at The Next Web

The rapid adoption of AI coding tools has let workers generate massive volumes of code, leaving companies scrambling to review and secure the AI-generated code (New York Times)
New York Times : The rapid adoption of AI coding tools has let workers generate massive volumes of code, leaving companies scrambling to review and secure the AI-generated code When a financial services company recently began using Cursor, an artificial intelligence technology that writes computer code, the difference that it made was immediate.
![Built a Hybrid NAS tool for RNN architectures (HyNAS-R) – Looking for feedback for my final year evaluation [R]](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-earth-satellite-QfbitDhCB2KjTsjtXRYcf9.webp)
Built a Hybrid NAS tool for RNN architectures (HyNAS-R) – Looking for feedback for my final year evaluation [R]
Hi everyone, I'm currently in the evaluation phase of my Final Year Project and am looking for feedback on the system I've built. It's called HyNAS-R, a Neural Architecture Search tool designed to automatically find the best RNN architectures for NLP tasks by combining a zero-cost proxy with metaheuristic optimization. I have recorded a video explaining the core algorithm and the technology stack behind the system, specifically how it uses an Improved Grey Wolf Optimizer and a Hidden Covariance proxy to search through thousands of architectures without expensive training runs. Video Explanation: https://youtu.be/mh5kOF84vHY If anyone is willing to watch the breakdown and share their thoughts, I would greatly appreciate it. Your insights will be directly used for my final university evaluat



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!