Sunday, Aug 3: 2:00 PM - 3:50 PM
0119
Invited Paper Session
Music City Center
Room: CC-101D
Online Learning
Applied
No
Main Sponsor
Business and Economic Statistics Section
Co Sponsors
IMS
Section on Statistical Learning and Data Science
Presentations
Reinforcement learning (RL) has achieved remarkable success across various domains; however, its applicability is often hampered by challenges in practicality and interpretability. Many real-world applications, such as in healthcare and business settings, have large and/or continuous state and action spaces and demand personalized solutions. In addition, the interpretability of the model is crucial to decision-makers so as to guide their decision-making process while incorporating their domain knowledge. To bridge this gap, we propose a personalized reinforcement learning framework that integrates personalized information into the state-transition and reward-generating mechanisms. We develop an online RL algorithm for our framework. Specifically, our algorithm learns the embeddings of the personalized state-transition distribution in a Reproducing Kernel Hilbert Space (RKHS) by balancing the exploitation-exploration tradeoff. We further provide the regret bound of the algorithm and demonstrate its effectiveness in recommender systems.
Keywords
reinforcement learning
We consider a platform that serves (observable) agents, who belong to a larger network that also includes additional agents who are not served by the platform. We refer to the latter group of agents as latent agents. Associated with each agent are the agent's covariate and outcome. The platform has access to past covariates and outcomes of the observable agents, but no data for the latent agents is available to the platform. Crucially, the agents influence each other's outcome through a certain influence structure. In particular, observable agents influence each other both directly and indirectly through the influence they exert on the latent agents. The platform doesn't know the inference structure of either the observable or the latent parts of the network. We investigate how the platform can estimate the dependence of the observable agents' outcomes on their covariates, taking the presence of the latent agents into account. First, we show that a certain matrix succinctly captures the relationship between the outcomes and the covariates. We provide an algorithm that estimates this matrix using historical data of covariates and outcomes for the observable agents under a suitable approximate sparsity condition. We also establish convergence rates for the proposed estimator despite the high dimensionality that allows more agents than observations. Second, we show that the approximate sparsity condition holds under the standard conditions used in the literature. Hence, our results apply to a large class of networks. Finally, we illustrate the applications to a targeted advertising problem. We show that, by using the available historical data with our estimator, it is possible to obtain asymptotically optimal advertising decisions despite the presence of latent agents.
Keywords
Network analysis
In offline reinforcement learning (RL) an optimal policy is learnt solely from a priori collected observational data. However, in observational data, actions are often confounded by unobserved variables. Instrumental variables (IVs), in the context of RL, are the variables whose influence on the state variables are all mediated through the action. When a valid instrument is present, we can recover the confounded transition dynamics through observational data. We study a confounded Markov decision process where the transition dynamics admit an additive nonlinear functional form. Using IVs, we derive a conditional moment restriction (CMR) through which we can identify transition dynamics based on observational data. We propose a provably efficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dual reformulation of CMR. To the best of our knowledge, this is the first provably efficient algorithm for instrument-aided offline RL.
Keywords
offline reinforcement learning
Tensor completion plays a crucial role in a wide range of applications, including recommender systems and medical imaging, where observed data are often highly incomplete. While extensive prior work has addressed tensor completion with data missingness, most assume that missing entries occur randomly. However, real-world data often exhibit missing-not-at-random patterns, where missingness depends on the underlying tensor values. This paper introduces a generalized tensor completion framework for noisy data with non-random missingness, where the missing probability is modeled as a function of underlying tensor values. Our formulation is flexible and accommodates various tensor data types, including continuous, binary, and count data. For model estimation, we develop a computationally efficient alternating gradient descent algorithm and derive non-asymptotic error bounds for the estimator at each iteration. Additionally, we propose a statistical inferential procedure to test whether missing probabilities depend on tensor values, offering a formal assessment of the missing-at-random assumption within our modeling framework. The utility and efficacy of our approach are demonstrated through comparative simulation studies and analyses of two real-world datasets.
Keywords
graphical model with covariates
multi-task learning
debiased inference
This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than stateof-the-art algorithms, both in theory and in practice. The code is available at https: //github.com/DRPO4LLM/DRPO4LLM
Keywords
experimental design