Research Papers research paper arxiv machine-learning deep-learning

Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning

arXivApril 3, 202610 min read0 views

arXiv:2604.01345v1 Announce Type: new Abstract: Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner' — Vikram Krishnamurthy, Luke Snow

View PDF HTML (experimental)

Abstract:Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.

Subjects:

Machine Learning (cs.LG)

Cite as: arXiv:2604.01345 [cs.LG]

(or arXiv:2604.01345v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.01345

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Luke Snow [view email] [v1] Wed, 1 Apr 2026 19:56:02 UTC (1,131 KB)

Original source

arXiv

https://arxiv.org/abs/2604.01345

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersLive

Probabilistic AVL Trees (p-AVL): Relaxing Deterministic Balancing

arXiv:2604.02223v1 Announce Type: new Abstract: This paper studies the empirical behaviour of the p-AVL tree, a probabilistic variant of the AVL tree in which each imbalance is repaired with probability $p$. This gives an exact continuous interpolation from $p = 0$, which recovers the BST endpoint, to $p = 1$, which recovers the standard AVL tree. Across random-order insertion experiments, we track rotations per node, total imbalance events, average depth, average height, and a global imbalance statistic $\sigma$. The main empirical result is that even small nonzero p already causes a strong structural change. The goal here is empirical rather than fully theoretical: to document the behaviour of the p-AVL family clearly and identify the main patterns.

arXiv cs.DS

1mabout 1 hour ago

Research PapersLive

A Constant-Approximation Distance Labeling Scheme under Polynomially Many Edge Failures

arXiv:2604.01829v1 Announce Type: new Abstract: A fault-tolerant distance labeling scheme assigns a label to each vertex and edge of an undirected weighted graph $G$ with $n$ vertices so that, for any edge set $F$ of size $|F| \leq f$, one can approximate the distance between $p$ and $q$ in $G \setminus F$ by reading only the labels of $F \cup \{p,q\}$. For any $k$, we present a deterministic polynomial-time scheme with $O(k^{4})$ approximation and $\tilde{O}(f^{4}n^{1/k})$ label size. This is the first scheme to achieve a constant approximation while handling any number of edge faults $f$, resolving the open problem posed by Dory and Parter [DP21]. All previous schemes provided only a linear-in-$f$ approximation [DP21, LPS25]. Our labeling scheme directly improves the state of the art in

arXiv cs.DS

1mabout 1 hour ago

Research PapersLive

Adaptive Fully Dynamic $k$-Center Clustering with (Near-)Optimal Worst-Case Guarantees

arXiv:2604.01726v1 Announce Type: new Abstract: Given a sequence of adversarial point insertions and point deletions, is it possible to simultaneously optimize the approximation ratio, update time, and recourse for a $k$-clustering problem? If so, can this be achieved with worst-case guarantees against an adaptive adversary? These questions have garnered significant attention in recent years. Prior works by Bhattacharya, Costa, Garg, Lattanzi, and Parotsidis [FOCS '24] and by Bhattacharya, Costa, and Farokhnejad [STOC '25] have taken significant steps toward this direction for the $k$-median clustering problem and its generalization, the $(k, z)$-clustering problem. In this paper, we study the $k$-center clustering problem, which is one of the most classical and well-studied $k$-clustering

arXiv cs.DS

2mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 151 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Probabilistic AVL Trees (p-AVL): Relaxing Deterministic Balancing

arXiv cs.DS

1mabout 1 hour ago

Research PapersLive

A Constant-Approximation Distance Labeling Scheme under Polynomially Many Edge Failures

arXiv cs.DS

1mabout 1 hour ago

Research PapersLive

Adaptive Fully Dynamic $k$-Center Clustering with (Near-)Optimal Worst-Case Guarantees

arXiv cs.DS

2mabout 1 hour ago

Research PapersLive

Single-Pass Streaming CSPs via Two-Tier Sampling

arXiv:2604.01575v1 Announce Type: new Abstract: We study the maximum constraint satisfaction problem, Max-CSP, in the streaming setting. Given $n$ variables, the constraints arrive sequentially in an arbitrary order, with each constraint involving only a small subset of the variables. The objective is to approximate the maximum fraction of constraints that can be satisfied by an optimal assignment in a single pass. The problem admits a trivial near-optimal solution with $O(n)$ space, so the major open problem in the literature has been the best approximation achievable when limiting the space to $o(n)$. The answer to the question above depends heavily on the CSP instance at hand. The integrality gap $\alpha$ of an LP relaxation, known as the BasicLP, plays a central role. In particular, a

arXiv cs.DS

2mabout 1 hour ago