A Definition of Non-Stationary Bandits

Yueyang Liu Speaker
Rice University
 
Thursday, Aug 7: 8:55 AM - 9:15 AM
Topic-Contributed Paper Session 
Music City Center 
Prior work characterizes a non-stationary bandit as one whose reward distribution changes over time. In the Bayesian setting, however, different choices of the underlying distribution sequence can yield conflicting classifications, making the notion of non-stationarity ambiguous. Likewise, popular metrics for non-stationarity or hardness—such as the entropy rate of the optimal action sequence or the temporal variation of mean rewards—also depend on a latent sequence of reward distributions, varying that sequence can produce inconsistent or misleading measures. In some cases, selecting a latent sequence that genuinely reflects the intuitive non-stationarity or difficulty of a problem can itself be challenging. Because these metrics drive both algorithm design and regret analyses, their ambiguity and undesirable properties often lead to algorithms that over-explore and to regret bounds that are harder to interpret. To address these issues, we introduce a definition of non-stationary bandits that makes no reference to any latent sequence, thus eliminating the classification ambiguity. Our definition further guarantees that any two bandits inducing indistinguishable interaction experiences are classified identically—either both stationary or both non-stationary. Finally, we propose a set of axiomatic criteria for non-stationarity or hardness metrics, agent design, and regret analysis.