Data Analysis and Modeling

Katherine McLaughlin Chair
Oregon State University
 
Thursday, Aug 7: 8:30 AM - 10:20 AM
4206 
Contributed Papers 
Music City Center 
Room: CC-105B 

Main Sponsor

Survey Research Methods Section

Presentations

A robust imputation method for missing data in high throughput observations

Missing data issues are highly prevalent in High Throughput Studies (HTS). Missing patterns in such studies are rarely missing at random. We describe varying percentages of missingness and quantify the amount of missingness in a clinical study. Acute Myeloid Leukemia (AML) is a type of cancer of the myeloid line of blood cells in the bone marrow and blood. This is one of most lethal cancer types. We have gene expression data on AML for thousands of genes. There are three different subtypes of AML (Normal, CK type, and CBF type) that we plan to compare. Usually, gene expression data have many genes with zero counts. Missing value imputation methods are versatile techniques to deal with missingness. The imputation methods facilitate analysis by keeping almost majority of the dataset for further analysis. Here, our goal is to compare a robust imputation technique (an MLE based approach) to the conventional imputation techniques namely mean imputation, KNN imputation, and EM algorithm and choose the best one for our AML study. 

Keywords

Acute Myeloid Leukemia

High Throughput Data

Imputation

Maximum Likelihood approach

Mean, KNN, EM algorithm 

Co-Author(s)

Sarmistha Das
Anand Seth, Research Mentor
Shesh N. Rai, Biostats, Health Inform & Data Sci | College of Medicine

First Author

Bipulkumar Das, University of Cincinnati

Presenting Author

Bipulkumar Das, University of Cincinnati

Bayesian combined statistical decision limits with covariates

Decision limits are crucial in laboratory medicine for guiding diagnostic and decision-making processes. While reference ranges offer general guidelines, decision limits, often one-sided upper limits, serve as diagnostic criteria for specific conditions. In complex diagnoses which require the use of several analytes, computing separate univariate decision limits increases the number of false positives. As an alternative to this, it is recommended to construct multivariate decision limits that account for the cross-correlations among analytes. Moreover, appropriate decision limits may also be needed for specific values of covariates (e.g., age and sex). For this reason, this study proposes an approach to compute regression-based multivariate statistical decision limits within the multivariate normal framework. The criterion used in obtaining the decision limits is related to Bayesian tolerance regions. Simulation results show that the proposed statistical decision limits have highly satisfactory frequentist properties. Finally, the approach used in this study controls the desired false positive rate at a prespecified level of confidence. 

Keywords

decision limits

Bayesian multivariate regression

tolerance interval

laboratory medicine 

Co-Author

Michael Daniel Lucagbo, University of the Philippines Diliman

First Author

Lian Mae Tabien, University of the Philippines

Presenting Author

Michael Daniel Lucagbo, University of the Philippines Diliman

Clustered coefficient logistic linear mixed models in small area estimation

Logistic linear mixed models are used in small area estimation to construct unit level model-based estimators for binary outcomes. Instead of assuming common regression coefficients for all small domains in the traditional model, a new model with clustered coefficients is proposed with consideration of random effects, which allows different regression coefficients or intercepts in different clusters of domains. To achieve the goal, an optimization problem based on penalized quasi-likelihood (PQL) and pairwise penalties is considered. A new algorithm based on the linearized alternating direction method of multipliers (ADMM) algorithm is developed to find clusters and estimate parameters simultaneously. Simulations are used to compare the proposed approach and traditional approaches to show the advantages of the proposed estimator. 

Keywords

Logistic linear mixed models

Clustering

ADMM algorithm

Small area estimation 

First Author

Xin Wang, San Diego State University

Presenting Author

Xin Wang, San Diego State University

Detecting AI-Generated Survey Responses: Algorithm Development and Bias Mitigation

Large language model (LLM)-generated responses to open-ended questions have become increasingly common in online surveys. Unfortunately, this potentially compromises survey data quality and increases the cost of data collection and review. To tackle this challenge, we have developed a machine learning classifier that detects AI-generated responses to open-ended survey questions. This presentation will highlight the ways in which off-the-shelf LLMs do not respond like typical survey respondents and how key differences helped drive feature selection for the classifier. To create training data, we generated responses from LLMs (e.g., GPT, Llama, and Claude) to compare with responses from survey respondents to multiple open-ended questions. Performance is excellent, with precision and recall as high as 99% on held out training data and in the low 90% for unseen observations from subsequent surveys using different types of questions and subject matter domains. We will conclude this presentation with a discussion of bias and equity considerations, noting how performance varies across groups and suggesting equitable approaches to handling responses labeled a potentially from AI. 

Keywords

AI

Large language models

Survey data quality

Machine learning

Text analysis

Natural language processing 

Co-Author

Lilian Huang

First Author

Brandon Sepulvado

Presenting Author

Brandon Sepulvado

Evaluating Treatment Effect with Mixed Endpoints in a Phase IV Cancer Trial – Frequentist Approach

Advancements in cancer therapies have significantly improved long-term survival. However, survivors face higher risks of long-term morbidities and mortality. Those treated with cardiotoxic therapy (anthracyclines or chest radiation) are at greater risk of cardiovascular diseases that include Afterload (AF) as a continuous variable and Fractional Shortening (FS) as a binary variable. FS is classified as abnormal (FS < 0.28) or normal (FS ≥ 0.28). Hudson et al. (2007) evaluated risk factors for these outcomes independently. This manuscript presents a likelihood-based approach for jointly analyzing these mixed endpoints. We illustrate this by assessing the effect of risk group (AR vs. NAR) on the cardiovascular outcomes of AF and FS using Hudson et al.'s (2007) data. First, we analyze AF and FS separately using a linear regression model and a probit model. Then, we apply joint modeling to account for FS-AF correlation and compare results with the independent approach of analysis. Additionally, we conduct simulations to assess the performance of the joint modeling approach under different sample sizes and correlation values to evaluate the operating characteristics. 

Keywords

Frequentist Approach

Mixed Endpoint

Probit Model

Linear Regression

Hotelling’s T-Squared Test

Simulation 

Co-Author(s)

Deo Kumar Srivastava, St. Jude Children's Research Hospital
Zhuo Qu, St. Jude Children's Research Hospital, Memphis, Tennessee
Anand Seth, Research Mentor
Shesh N. Rai, Biostats, Health Inform & Data Sci | College of Medicine

First Author

Muhammad Mahabub Rahaman Manik

Presenting Author

Muhammad Mahabub Rahaman Manik

WITHDRAWN Improved partitioning of commercial and non-commercial Deep7 bottomfish catch in the MHI

Data from the Hawaii Marine Recreational Fishing Survey (HMRFS) were used to obtain non-commercial catch. Non-commercial catch was estimated as the product of the catch rate and fishing effort, and adjustments were made to both components to exclude fishing trips that were not entirely recreational from the HMRFS data. By estimating only non-commercial catch, values could be combined with mandatory commercial fishing reports to estimate total catch for the stock assessment. Two adjustments were made during the catch rate estimation: 1) catch claimed as sold in HMRFS was excluded; and 2) catch claimed as non-sold by expense fishers (who sometimes sell fish to cover fishing expenses) and part-time commercial fishers was excluded, as these fishers must report all catch in the commercial reporting system, even if only a portion is sold. Fishing effort estimates (derived from telephone and mail surveys) were also adjusted to exclude trips from expense and part-time commercial fishers. The non-commercial catch from this study in combination with the catch in commercial reports may better define total fish removal for Deep7 bottomfish stock assessments in the main Hawaiian Islands (MHI). 

Keywords

Hawaii Marine Recreational Fishing Survey (HMRFS)

Deep7 bottomfish

non-commercial catch

fishing effort

catch rate

stock assessment 

Co-Author(s)

Toby Matthews, NOAA Fisheries
Marc Nadon, NOAA Fisheries
John Syslo, NOAA Fisheries
Meg Oshima, NOAA Fisheries
Felipe Carvalho, NOAA FIsheries

First Author

Hongguang Ma, PIFSC, NOAA Fisheries

Regression Analysis for Longitudinal Survey Data with a Diverging Number of Covariates

In economics and the social and health sciences, longitudinal sample surveys often exhibit complex sampling design features such as unequal selection probabilities, stratification and clustering of individuals. For data collected from some large-scale surveys, or from surveys linked to administrative data files, special methods are required for inference when exploring relationships between outcome variables and covariates.

Under the semiparametric modeling approach, the within-cluster correlation is unspecified. The quadratic inference function approach provides consistent and asymptotically normal estimators of model parameters when their number is finite. In this paper, we consider the case when the number of covariates grows to infinity as the number of clusters increases and we illustrate how the rate of divergence in the number of parameters affects the convergence rate of penalized estimators. The procedure simultaneously estimates parameters and selects important variables, accounting for both within-cluster correlation and the complex survey design features. 

Keywords

Complex sampling design

Diverging number of parameters

Longitudinal data

Model selection

Oracle property

Quadratic inference functions 

First Author

Laura Dumitrescu, Fairfield University

Presenting Author

Laura Dumitrescu, Fairfield University