Preferential Sampling in Environmental Exposures and Its Impact on Inference with Implications in Environment and Health Policy

Summer Han Chair
Stanford University
 
Tae Yoon Lee Organizer
Stanford University School of Medicine
 
Wednesday, Aug 6: 8:30 AM - 10:20 AM
0620 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-201A 

Keywords

Environmental exposure assessment

preferential sampling

inference

policy 

Applied

Yes

Main Sponsor

Korean International Statistical Society

Co Sponsors

Health Policy Statistics Section
Section on Statistics and the Environment

Presentations

Inverse sampling intensity weighting for preferential sampling adjustment

Traditional geostatistical methods assume independence between observation locations and the spatial process of interest. Violations of this independence assumption are referred to as preferential sampling (PS). Standard methods to address PS rely on estimating complex shared latent variable models and can be difficult to apply in practice. We study the use of inverse sampling intensity weighting (ISIW) for PS adjustment in model-based geostatistics. ISIW is a two-stage approach wherein we estimate the sampling intensity of the observation locations then define intensity-based weights within a weighted likelihood adjustment. Prediction follows by substituting the adjusted parameter estimates within a kriging framework. A primary contribution was to implement ISIW by means of the Vecchia approximation, which provides large computational gains and improvements in predictive accuracy. Interestingly, we found that accurate parameter estimation had little correlation with predictive performance, raising questions about the conditions and parameter choices driving optimal implementation of kriging-based predictors under PS. Our work highlights the potential of ISIW to adjust for PS in an intuitive, fast, and effective manner.
 

Speaker

Thomas Hsiao, Emory University, Rollins School of Public Health

Preferential sampling in environmental science: exception or standard?

This talk will cover work developed for some environmental areas to handle sampling preferentiality. Specific examples addressed include Geostatistics and presence-only data in ecological studies. Each of the areas above has its typical data format. This leads to specific forms to address preferentiality for each scenario. Either way, both use the standard Poisson process for the locations of the observations, for which the exact likelihood is not available analytically. The approach pursued is entirely model-based and uses data augmentation techniques to allow for exact inference procedures. Comparisons against alternative approximated procedures based on real data analyses point favourably to the exact methodology. 

Keywords

Environmental studies

Bayesian

Sampling preferentiality

Data augmentation

Exact inference

Prediction of unknowns 

Co-Author

Douglas Mateus Silva, Universidade Federal de Lavras

Speaker

Dani Gamerman, Instituto De Matematica-UFRJ

Modeling Urban Heat Stress with Preferentially Sampled Citizen Science Data

The urban heat island (UHI) effect intensifies heat stress, disproportionately impacting health outcomes and energy demand in densely built neighborhoods. In Durham County, North Carolina, urban–rural temperature differences can exceed 10°C during the hottest times of the year. Accurately modeling this variability requires dense temperature observations—yet such networks are rarely available. Personal weather stations (PWSs) offer a promising alternative: there are over 300 sensors in Durham recording hourly temperature. However, these stations are unevenly distributed, with generally more representation in wealthier neighborhoods. Given the well-documented association between income and urban heat exposure, models relying solely on PWS data risk underestimating heat stress in lower-income areas.

To address this, we apply a preferential sampling correction to a spatial model of temperature, explicitly accounting for the unequal distribution of sensors. The correction reveals that omitting preferentiality leads to an average 1°C underestimation of July evening temperatures in lower-income neighborhoods. We validate this result by comparison with a non-preferentially sampled dataset, showing that the correction improves agreement across datasets, with the Pearson correlation increasing by as much as a factor of 2.

These findings underscore the importance of correcting for preferential sampling in urban heat monitoring and highlight the value of citizen science data. Ongoing work scales this approach statewide, using PWSs to: (1) estimate neighborhood-level heat stress across North Carolina, and (2) develop spatiotemporal models of urban temperature that may be applied to other locations worldwide. For scalability, we employ sparse variational Gaussian processes and adapt the point process model to capture city-specific sampling patterns—recognizing that not all cities exhibit the same level of preferentiality. Finally, we explore alternative spatiotemporal model formulations that use importance weighting on covariates to address bias without relying on a shared latent process. 

Keywords

Environmental health

Heat stress

Preferential sampling

Model validation

Urban climate 

Co-Author(s)

Ellie Kim, Duke University
Michael Bergin, Duke University
David Carlson, Duke University

Speaker

Zachary Calhoun, Duke University

Estimating the impact of ambient air pollution on lung cancer risk in the presence of preferential sampling and measurement error using electronic health record data

A network of monitoring sites is often not well-designed for accurately mapping ambient (outdoor) air pollution due to external factors, such as budget constraints and public opinion. As such, naively using point measurements from the monitoring network can lead to biased mapping. This can have profound downstream implications for environmental health studies that rely on this map to estimate ambient air pollution exposure at participants' locations. In this talk, we will address this potential bias due to preferential sampling in the design of a monitoring network for mapping ambient air pollution in California. We will utilize a recently developed spatio-temporal statistical framework that simultaneously models the air pollution field and monitoring site selection process. Further, we will examine the downstream implications in estimating the effects of ambient air pollution on lung cancer risk using electric health record data (N>44,000) from Stanford Health Care, an academic medical center, and Sutter Health, a multisite community practice. We will employ a Bayesian cause-specific Cox regression model to incorporate the competing risk of death as well as the measure error in the air pollution exposure. 

Keywords

Preferential sampling

Measurement error

Environmental health studies

Air pollution

Bayesian 

Co-Author(s)

Chloe Su
Summer Han, Stanford University

Speaker

Tae Yoon Lee, Stanford University School of Medicine

Spatial causal inference in the presence of preferential sampling to study the impacts of marine protected areas

Marine Protected Areas (MPAs) have been established globally to conserve marine resources. Given their maintenance costs and impact on commercial fishing, it is critical to evaluate their effectiveness to support future conservation. In this paper, we use data collected from the Australian coast to estimate the effect of MPAs on biodiversity. Environmental studies such as these are often observational, and processes of interest exhibit spatial dependence, which presents challenges in estimating the causal effects. Spatial data can also be subject to preferential sampling, where the sampling locations are related to the response variable, further complicating inference and prediction. To address these challenges, we propose a spatial causal inference method that simultaneously accounts for unmeasured spatial confounders in both the sampling process and the treatment allocation. We prove the identifiability of key parameters in the model and the consistency of the posterior distributions of those parameters. We show via simulation studies that the causal effect of interest can be reliably estimated under the proposed model. The proposed method is applied to assess the effect of MPAs on fish biomass. We find evidence of preferential sampling and that properly accounting for this source of bias impacts the estimate of the causal effect. 

Keywords

Poisson process

Potential outcomes

Propensity scores

Spatial confounding 

Speaker

Dongjae Son