Monday, Aug 4: 8:30 AM - 10:20 AM
0728
Topic-Contributed Paper Session
Music City Center
Room: CC-103A
Online Experiments
Deep Learning
User Experience
Applied
Yes
Main Sponsor
Quality and Productivity Section
Co Sponsors
Section on Statistical Computing
Presentations
A/B testing is a cornerstone of online experimentation, but it presents unique challenges when randomizing users is not feasible, such as when testing changes to e-commerce product feeds in Shopping Ads. While product-level randomization offers a solution, it often leads to high variance and skewed performance data, and therefore unreliable experiment results. This talk explores practical techniques to improve the reliability and sensitivity of such experiments.
We will dive into variance reduction methods like CUPED and crossover designs, demonstrating how they control for pre-experiment performance differences and leverage within-item comparisons. We will also explore how trimming outliers enhances the robustness of results, while acknowledging the inherent trade-offs.
A key theme will be the importance of pre-experiment power analysis for determining minimum detectable effects and ultimately ensuring your experiment is sufficiently powered. We will then illustrate the significant reductions in required sample sizes these techniques deliver for real-world advertiser data.
Finally, we'll introduce FeedX, an open-source implementation of these methods available on GitHub, enabling you to easily apply these best practices to your own experiments.
Keywords
Online experiments
Variance reduction methods
Road accidents are one of the leading causes of mortality worldwide. This paper, as part of our work at Waze - designs, assesses and deploys a targeted warning system to nudge drivers toward safer behaviors. We develop an end-to-end approach spanning descriptive, predictive and prescriptive analytics. We build a deep learning model to predict accident reports based on historical patterns and contextual information, which we use to develop an indicator of road safety at a granular spatio-temporal level and a global scale. We then design proactive and targeted warnings for users upon entering high-risk road segments. We conduct a large-scale and global randomized controlled trial to evaluate the impact of these warnings. Results show (i) a statistically significant decrease in average speeds and overspeeding rates; (ii) a fatigue effect motivating a parsimonious nudging system; and (iii) heterogeneous responses across drivers—notably, the effect is stronger for users which tend to trust the Waze navigation system more, as indicated by their measurement of past adherence to navigation instructions. The positive results from the experiment led to the global deployment of the targeted warning system, highlighting the role of digital platforms and artificial intelligence to improve road safety worldwide.
Keywords
Deep Learning
Traffic Safety
This paper explores the impact of Search in-app notifications on user behavior, focusing on mitigating the potential costs associated with their use. Due to the inherent challenges in measuring the diverse benefits of notifications, we primarily concentrate on cost measurement. By establishing a budget system, we aim to prevent excessive notifications and protect user experience. Our methodology involves a three-period experimental setup—pre-period, treatment period, and post-period—to rigorously evaluate the effects of notification frequency on user behavior. To account for variations in user engagement with Google surfaces and, consequently, notification exposure, we utilize a matching method to conduct post-experimental analysis and identify heterogeneous treatment effects. Through this study, we seek to develop a more mature and consistent evaluation framework for in-app notifications, ultimately identifying the optimal frequency of notifications to balance short-term interruption and long-term user satisfaction.
Keywords
Online experiments
On the team responsible for Google's Phone app, we want to set up long term holdback groups to measure the impact of new features on product retention. We also want to ensure that users have access to the features they may have purchased their phone for. In this talk I will explore how we might use biostatistics techniques (e.g. censoring, survival analysis, Kaplan Meier curves) to evaluate the impact of our new features, while ensuring users can access the features they want.
Keywords
Survival Analysis
Our work at Google Ads introduces Meridian, an open-source Marketing Mix Modeling (MMM) platform designed to help marketers navigate the complexities of measuring cross-channel media effectiveness in a privacy-conscious world. As a highly customizable modeling framework based on Bayesian causal inference, Meridian employs a single Bayesian model that performs joint estimation of all model coefficients and parameters, including those of nonlinear transformation functions like Adstock and diminishing returns curves. This empowers businesses to build custom MMM models with enhanced accuracy through innovations such as: the ability to handle large-scale geo-level data; calibration with incrementality experiments; incorporation of reach and frequency data for added insights; and robust search measurement. This transparent platform also provides full user control, enabling adaptation to unique business needs and facilitating informed decision-making through scenario planning and budget optimization. With Meridian, marketers can build best-in-class MMMs, gain insights to inform budget and planning decisions, and drive better business outcomes.
Keywords
Bayesian Causal Inference