Advances in Statistical Modeling and Computational Methods for Complex Data

Youjin Lee Chair
 
Tuesday, Aug 5: 8:30 AM - 10:20 AM
4083 
Contributed Papers 
Music City Center 
Room: CC-106B 

Main Sponsor

Korean International Statistical Society

Presentations

Dynamic Treatment Strategies via Q-Learning and Deep Learning-Based Buckley-James Method

In healthcare, developing personalized treatment strategies is essential for optimizing patient outcomes, particularly when dealing with censored survival data. This study introduces the Dynamic Deep Buckley-James Q-Learning algorithm, a novel methodology that integrates reinforcement learning with the Buckley-James method to manage censored data effectively. By leveraging deep learning techniques, the algorithm enhances the predictive accuracy of survival times in complex, non-linear settings, optimizing treatment decisions based on imputed outcomes. Our comprehensive simulation study, which includes scenarios with missing at random (MAR), not missing at random (NMAR) data, and right-censoring, demonstrates the algorithm's robust performance. The ability to handle various types of missing and censored data ensures wide applicability across different clinical contexts. By addressing the complexities and challenges associated with censoring and missing data in survival analysis, the algorithm learns policies that maximize the expected total imputed survival reward for patients. This enables the comparison of imputed survival times across different treatments, a feature not possible 

Keywords

Deep Learning

Q-Learning

Imputation

Dynamic Treatment Regime 

Co-Author

Jeongjin Lee, Korea University Library

First Author

Jong-Min Kim, University of Minnesota, Morris

Presenting Author

Jong-Min Kim, University of Minnesota, Morris

A Surrogate-Mark Framework for Modeling Hawkes Processes Under Spatial Uncertainty

Hawkes processes are commonly used to capture clustered structures in point pattern data, as they allow each event to elevate the chance of subsequent event occurrences. However, this triggering mechanism is difficult to model accurately when spatial information is measured at varying levels of precision. A common strategy is to use only events with the most precise geolocation, but this can lead to both a loss of information and inaccurate estimates of the underlying triggering structure. In this research, we propose a novel framework that retains events with less precise location data by incorporating location-relevant marks as surrogate measures of spatial information. We integrate this surrogate into nonparametric intensity estimation through a modified weighting scheme in the Model-Independent Stochastic Declustering algorithm. Simulation studies verify that the proposed method can recover the triggering structure more accurately than standard approaches. We further illustrate its usefulness with an application to real-world data, demonstrating how the suggested framework can enhance our understanding of space-time clustering by carefully incorporating imprecise events. 

Keywords

Spatio-temporal point process

Geolocation error

Two-phase analysis

Terrorism data

Hawkes process 

Co-Author

Kyunghee Han, University of Illinois at Chicago

First Author

Junhyeon Kwon, University of North Texas

Presenting Author

Junhyeon Kwon, University of North Texas

An efficient estimation method for additive subdistribution hazards model for case-cohort study

The case-cohort study design provides a cost-effective approach for large cohort studies with competing risks outcomes. The additive subdistribution hazards model assesses direct covariate effects on cumulative incidence when investigating risk differences among different groups instead of relative risk. The presence of left truncation, which commonly occurs in biomedical studies, introduces additional complexities to the analysis.
Existing inverse-probability-weighting methods for case-cohort studies on competing risks are inefficient in parameter estimation of coefficients for baseline covariates. In addition, their methods do not address left truncation.
To improve the efficiency of parameter estimation of coefficients for baseline covariates and account for left-truncated competing risks data, we propose an augmented-inverse-probability-weighted estimating equation for left-truncated competing risks data with additive subdistribution models under the case-cohort study design. For multiple case-cohort studies, we further improve parameter estimation efficiency by incorporating extra information from the other causes. We study large sample properties of the proposed estimator 

Keywords

Additive subdistribution hazards model

Case-cohort study design

Competing risks

Efficiency

Left-truncation

Stratified data 

First Author

Xi Fang, Yale University

Presenting Author

Soyoung Kim, Medical College of Wisconsin

Fractional binomial regression model for count data with excess zeros

In this talk, we introduce a new generalized linear model with fractional binomial distribution. Zero-inflated Poisson/negative binomial distributions are used for count data that has many zeros. To analyze the association of such a count variable with covariates, zero-inflated Poisson/negative binomial regression models are widely used. In this work, we develop a regression model with the fractional binomial distribution that can serve as an additional tool for modeling count data with excess zeros. The consistency of the ML estimators is proved under certain conditions, and the performance of the estimators is investigated with simulation results. Applications are provided with datasets from horticulture and public health, and the results show that on some occasions, our model outperforms the existing zero-inflated regression models. 

Keywords

Zero-inflated regression models

Count data with excess zeros

Fractional binomial distribution 

Co-Author

Chloe Breece, University of North Carolina Wilmington

First Author

Jeonghwa Lee, University of North Carolina Wilmington, USA

Presenting Author

Jeonghwa Lee, University of North Carolina Wilmington, USA

High-Dimensional Matching with Genetic Algorithms

Matching in observational studies estimates causal effects by balancing covariate distributions between treated and control groups. Traditional methods rely on pairwise distances, but in high-dimensional, low-sample size settings, the curse of dimensionality makes it difficult to distinguish observations. To address this, we propose a novel matching method using genetic algorithms, shifting focus from individual- to group-level distances. Our method improves causal effect estimation by optimizing the similarity of high-dimensional joint covariate distributions. This approach has key advantages: (1) it avoids dimension reduction, preserving full covariate information without additional modeling; (2) it maintains transparency by not relying on outcomes, akin to traditional matching; and (3) it is robust in low-sample size settings, where traditional methods may struggle. Moreover, our results show the proposed method is competitive with existing approaches even in low-dimensional cases. Through simulations and real data applications, we validate its performance, offer practical guidance, and highlight its potential as a tool for causal inference in high- and low-dimensional settings. 

Keywords

Matching

High-dimensional data

Genetic algorithms

Covariate balance

Low-sample size settings

Causal inference 

Co-Author

Kwonsang Lee, Seoul National University

First Author

Hajoung Lee, Seoul National University

Presenting Author

Hajoung Lee, Seoul National University

Nonparametric Erlang Mixtures

Erlang mixture models are essential tools for modeling insurance losses and evaluating aggregated risk measures. However, finding the maximum likelihood estimate (MLE) of Erlang mixtures is challenging due to the discrete nature of the parameter space for shape. This discreteness complicates the application of the standard expectation-maximization (EM) algorithm, which is commonly used in mixture models. Although alternative algorithms have been proposed to compute the MLE of Erlang mixtures, they are often restricted to parametric models and tend to converge to local maxima of the likelihood function. In this study, we focus on the nonparametric Erlang mixture model which offers greater flexibility compared to parametric models, and introduce an algorithm to estimate the nonparametric maximum likelihood estimate (NPMLE) of Erlang mixtures. By exploiting the gradient function, this method efficiently identifies critical support points, enhancing the likelihood of finding the global maximizer. Numerical studies demonstrate that our approach provides more stable and accurate performance in estimating the MLE for Erlang mixture models compared to existing methods. 

Keywords

Erlang mixtures

nonparametric mixtures

NPMLE

gradient function

EM algorithm 

Co-Author

Byungtae Seo, Sungkyunkwan University

First Author

KyeongA Yang, Sungkyunkwan University

Presenting Author

KyeongA Yang, Sungkyunkwan University