Advancements in Reinforcement Learning and Decision Making

Yumeng Wang Chair
 
Monday, Aug 4: 2:00 PM - 3:50 PM
4063 
Contributed Papers 
Music City Center 
Room: CC-102A 

Main Sponsor

Section on Statistical Learning and Data Science

Presentations

Co-opetition in Online Seller Networks: Evidence from Etsy

Online peer-to-peer marketplaces enable sellers to interact not only with buyers but also with each other. This is particularly popular in marketplaces for unique and artisanal products, where sellers often connect with and promote other sellers. In these marketplaces, sellers face the dual pressures of competition and cooperation, strategically balancing their sales goals with the opportunities for helping and promoting other sellers. This research proposes to study how a seller's network position within a network of connected sellers in an online e-commerce platform affects their sales performance. Using the theoretical lens of co-opetition (cooperative competition) and social network analysis, we examine seller strategies for maximizing their sales while cooperating with their competitors. By examining these dynamics, the study aims to provide valuable insights for sellers to optimize their strategies in a cooperative yet competitive marketplace. 

Keywords

egocentric network models

additive and multiplicative effects models

co-opetition

online seller communities 

Co-Author

Burcu Eke Rubini, University of New Hampshire

First Author

Ermira Zifla, University of New Hampshire

Presenting Author

Burcu Eke Rubini, University of New Hampshire

Cooperation in Multi-Agent Reinforcement Learning with Proximal Policy Optimization

In multi-agent reinforcement learning problems, the interaction of multiple decision-making agents in a shared environment can be modeled by a partially observable Markov game. It extends Markov Decision Processes to a multi-agent setting where agents have individual observations, actions, and rewards. In the multi-agent proximal policy optimization (MA-PPO) approach, the cooperation between the agents is investigated. We present a method to construct a deep stochastic policy that allows efficient optimization based on the actions of the agents. The effectiveness of the obtained statistical model is demonstrated by investigating the cooperation across multiple agents in an industrial application, in electric power distribution networks. 

Keywords

reinforcement learning

Markov decision process

proximal policy optimization

multi-agent

cooperation

electric power distribution networks 

Co-Author(s)

Morteza Hashemi, University of Kansas
Amin Shojaeighadikolaei, University of Kansas

First Author

Zsolt Talata, University of Kansas

Presenting Author

Zsolt Talata, University of Kansas

Empirical Bound Information-Directed Sampling for Norm-Agnostic Bandits

Information-directed sampling (IDS) is a powerful framework for solving bandit problems which has shown strong results in both Bayesian and frequentist settings. However, frequentist IDS, like many other bandit algorithms, requires that one have prior knowledge of a (relatively) tight upper bound on the norm of the true parameter vector governing the reward model in order to achieve good performance. Unfortunately, this requirement is rarely satisfied in practice. As we demonstrate, using a poorly calibrated bound can lead to significant regret accumulation. To address this issue, we introduce a novel frequentist IDS algorithm that iteratively refines a high-probability upper bound on the true parameter norm using accumulating data. We focus on the linear bandit setting with heteroskedastic subgaussian noise. Our method leverages a mixture of relevant information gain criteria to balance exploration aimed at tightening the estimated parameter norm bound and directly searching for the optimal action. We establish regret bounds for our algorithm that do not depend on an initially assumed parameter norm bound and demonstrate that our method outperforms state-of-the-art IDS and UCB algorithms. 

Keywords

bandit algorithms

heteroskedastic noise

information-directed sampling

parameter bounds 

Co-Author

Eric Laber

First Author

Piotr Suder, Duke University

Presenting Author

Piotr Suder, Duke University

Low-Rank Online Dynamic Assortment with Dual Contextual Information

As e-commerce expands, delivering real-time personalized recommendations from vast catalogs poses a critical challenge for retail platforms. Maximizing revenue requires careful consideration of both individual customer characteristics and available item features to optimize assortments over time. In this paper, we consider the dynamic assortment problem with dual contexts -- user and item features. In high-dimensional scenarios, the quadratic growth of dimensions complicates computation and estimation. To tackle this challenge, we introduce a new low-rank dynamic assortment model to transform this problem into a manageable scale. Then we propose an efficient algorithm that estimates the intrinsic subspaces and utilizes the upper confidence bound approach to address the exploration-exploitation trade-off in online decision making. Theoretically, we establish a regret bound with substantial improvement over prior literature, made possible by leveraging the low-rank structure. Extensive simulations and an application to the Expedia hotel recommendation dataset further demonstrate the advantages of our proposed method. 

Keywords

Bandit Algorithm

Low-rankness

Online Decision Making

Reinforcement Learning 

Co-Author(s)

Will Wei Sun, Purdue University
Yufeng Liu, University of North Carolina at Chapel Hill

First Author

Seong Jin Lee, University of North Carolina at Chapel Hill

Presenting Author

Seong Jin Lee, University of North Carolina at Chapel Hill

Offline Multi-Dimensional Distributional Reinforcement Learning: A Hilbert Space Embedding Approach

We propose an offline distributional reinforcement learning framework that leverages Hilbert space embeddings to estimate the multi-dimensional value distribution under a proposed target policy. By mapping probability measures into a reproducing kernel Hilbert space via kernel mean embeddings, our method replaces Wasserstein metrics with a novel integral probability metric. This enables efficient estimation in multi-dimensional state–action spaces and reward settings, where direct computation of Wasserstein distances is computationally challenging. Theoretically, we establish contraction properties of the distributional Bellman operator under our proposed metric and provide uniform convergence guarantees. Empirical results demonstrate improved convergence rates and robust off-policy evaluation under mild assumptions, namely, Lipschitz continuity and boundedness for the Matérn family of kernels, highlighting the potential of our embedding-based approaches in complex, real-world decision-making scenarios. 

Keywords

Reinforcement Learning

Wasserstein Distance

Reproducing Kernel Hilbert Space

Non-parametric

Matérn Kernel 

Co-Author(s)

Qi Zheng, University of Louisville
Ruoqing Zhu, University of Illinois Urbana-Champaign

First Author

Mehrdad Mohammadi, University of Illinois Urbana-Champaign

Presenting Author

Mehrdad Mohammadi, University of Illinois Urbana-Champaign

Optimizing Navigation in Uncertain Terrain with Spatially Correlated Obstacles

We propose a novel hybrid reinforcement learning (RL) framework for path-planning in an environment with uncertain, spatially correlated obstacles. Using Gaussian Random Field to capture spatial dependencies, we utilize Bayesian approach for sequential blockage probability update. Unlike prior approaches based on point estimates for value functions, we develop a distributional RL approach to model state value functions using categorical distributions, providing a comprehensive characterization of future traversal costs to improve robustness against information uncertainty and sample variation. We integrate distributional Bellman update with adaptive support refinement via Bayesian updates to ensure the true distributions are accurately reflected. In addition, we introduce a search space reduction technique to identify decision candidates, enhancing scalability. Combining distributional RL with posterior sampling of environment dynamics, experimental results show that the resulting decision-making policy effectively balances immediate traversal costs and the long-term value of information, offering a principled solution to the exploration–exploitation tradeoff in optimal navigation. 

Keywords

stochastic path planning

sequential decision making

Bayesian update

distributional Reinforcement Learning

network traversal

Gaussian random fields 

Co-Author

Elvan Ceyhan, Auburn University

First Author

Li Zhou, Auburn University

Presenting Author

Li Zhou, Auburn University

Semi-Parametric Batched Global Multi-Armed Bandits with Covariates

In applications such as clinical trials, treatment decisions are usually made in phases/batches, where information from the previous batch is used to determine the treatments allocated in the upcoming batch. Such scenarios can naturally be seen to fall in the batched bandits framework. While batched bandit frameworks have been studied in parametric and nonparametric regression settings, we propose a novel semi-parametric bandit approach that promotes interpretability and dimension reduction in nonparametric batched bandits. We assume that the reward-covariate relationship can be modelled in a reduced 1-dimensional central subspace based on the single-index regression framework. We adopt an adaptive binning and successive elimination algorithm and provide optimal regret guarantees for the same. We also illustrate the performance of the algorithm on simulated and real datasets. 

Keywords

multi-armed bandits

semi-parametric

single-index regression

dynamic binning

successive elimination

regret bounds 

Co-Author

Hyebin Song, Penn State

First Author

Sakshi Arya, Case Western Reserve University

Presenting Author

Sakshi Arya, Case Western Reserve University