SPEED 2: Data Challenge II & Methods for Correlated Data, Part 1

Eric Odoom Chair
University of Cincinnati
 
Sunday, Aug 4: 2:00 PM - 3:50 PM
5002 
Contributed Speed 
Oregon Convention Center 
Room: CC-E141 

Presentations

A Group Penalization Framework for Detecting Time-Lagged Microbiota-Host Associations

We present a framework to identify time-lagged associations between abundances of longitudinally sampled microbiota and a stationary response (final health outcome, disease status, etc.). We introduce a definition of the time lag by imposing a particular grouping structure on the association pattern of longitudinal microbial measurements. Using group regularization methods, we identify these time-lagged associations including their strengths, signs, and timespans. Simulation results demonstrate accurate identification of time lags and estimation of signal strengths by our approach. We apply this framework to find specific gut microbial taxa and their lagged effects associated with increased parasite worm burden in zebrafish. 

Keywords

Longitudinal data

Gut microbiome

Group penalization

Time-lagged associations

Biostatistics

Disease modeling 

View Abstract 3149

Co-Author(s)

Thomas Sharpton, Oregon State University
Yuan Jiang, Oregon State University

First Author

Emily Palmer, Oregon State University

Presenting Author

Emily Palmer, Oregon State University

A Joint Model of Longitudinal and Interval-censored Post-Diagnosis Time to Event Data

This work presents a novel joint model of a longitudinal biomarker and interval-censored post-diagnosis time to event outcome in the presence of interval-censored covariates due to the unknown initial event diagnosis time. By treating interval-censored initial event as missing data, we develop an expectation-maximization algorithm for semi-parametric maximum likelihood estimation, where the distribution of interval-censored initial event is modeled. A simulation framework is constructed to demonstrate the performance of our novel approach across a variety of scenarios. We applied this joint model to large-scale UK-Biobank data and found that (1) the age at diagnosis of diabetes was positively associated with the systolic blood pressure; (2) a smoker had a significantly increased risk of cardiovascular disease (CVD) event, but midpoint analysis detected no significance in these two covariates. Lastly, using Brier Score as a calibration measure for dynamic prediction, our proposed model yielded a higher accuracy of CVD event prediction than midpoint analysis. In summary, both our simulation and application results showed that our proposed model outperformed midpoint analysis. 

Keywords

Joint model

Unknown initial diagnosis

Interval-censored data

Longitudinal biomarker

Time to event outcome 

View Abstract 2012

Co-Author

Gang Li, University of California-Los Angeles

First Author

Shanpeng Li, City of Hope

Presenting Author

Shanpeng Li, City of Hope

Evaluating Performance of Unsupervised Machine Learning Methods for Time Series Clustering

Unsupervised clustering is widely used to discover patterns in data without pre-defined labels. Clustering methods for time series data have been less studied and still present challenges. In this study, we use simulated data to showcase the performance of clustering algorithms on time series data and provide new insights into methodological choices. We selected a range of clustering algorithms-Hierarchical, k-means, k-medoids, Gaussian mixture, self-organizing maps, and density-based clustering-and distance metrics included Euclidean, correlation-based distances, dynamic time warping (DTW), and variants like weighted DTW. Results were evaluated using the adjusted Rand index and validated with known cluster labels. Preliminary findings in simulated univariate time series data showed that data transformation (i.e., standardization) was the leading determinant of clustering performance. In benchmark multivariate time series data, clustering performance was weaker. Next steps include investigations using simulated multivariate data. Results inform a project to identify distinct diurnal patterns of multiple air pollutants. 

Keywords

time series data

clustering

unsupervised learning 

View Abstract 3223

Co-Author(s)

Yue Zhang, University of Utah
Kenan Li, Saint Louis University
Erika Garcia, University of Southern California
Sandrah Eckel, University of Southern California

First Author

Brittney Marian

Presenting Author

Brittney Marian

An examination of the effectiveness of group control charts for monitoring COVID-19 data in Michigan

Incidence data for COVID-19 was collected by county across the state of Michigan. This data can be standardized such that statistics computed for each county can be assumed to follow the same distribution. One can then plot the largest and smallest statistics at a given timepoint on a group control chart. The authors will present the development of two group control chart that could be used to monitor county (or geospatial) data for the state of Michigan. The authors will then demonstrate the usage of this group control chart using incidence data for COVID-19. The authors will also examine the impact of policy decisions (enacted by the Governor and the legislature) on the spread of COVID-19. 

Keywords

Statistical Process Control

COVID-19 incidence data

Group Control Charts 

View Abstract 3837

Co-Author

Ashley Grace Davies, Grand Valley State Unniversity

First Author

Paul Stephenson, Grand Valley State University

Presenting Author

Ashley Grace Davies, Grand Valley State Unniversity

Analysis of Social Trends using the General Social Survey

As times change, so do societal opinions and attitudes regarding topics as diverse as financial well-being, women's rights, and national policies. Opinions on these topics are often not random, but rather can be related to demographic characteristics such as race, age, and gender, among others. The NORC General Social Survey (GSS) has collected data about Americans' social attitudes since 1972. However, it is not always clear how best to analyze this type of data. In this analysis, we provide several different approaches to analyze the wide-ranging data found in the GSS, from time series analysis of financial well-being, sentiment analysis regarding national policies over time, and a meta-analysis of the distribution of survey questions related to women and women's rights over time. By understanding how these topics are impacted by respondents' demographic characteristics and how these opinions change over time, we can better understand what people value and prioritize and gain insight into the ways social sentiment can influence, or be influenced by, current events. 

Keywords

Social Survey

Sentiment Analysis

Public Opinion

Current Events 

View Abstract 3612

Co-Author(s)

Damon Leach, Pacific Northwest National Laboratory
Logan Lewis, PNNL
Sydney Schwartz, PNNL
Beata Meluch, PNNL
Samantha Obermiller, PNNL
David Degnan, Pacific Northwest National Laboratory
Lisa Bramer, Pacific Northwest National Laboratory

First Author

Natalie Winans, Pacific Northwest National Laboratory

Presenting Author

Damon Leach, Pacific Northwest National Laboratory

BulkLMM: Real-time genome scans for multiple quantitative traits using linear mixed models

Genetic studies often collect data using high-throughput phenotyping. That has led to the need for fast genomewide scans for large number of traits using linear mixed models (LMMs). Computing the scans one by one on each trait is time consuming. We have developed new algorithms for performing genome scans on a large number of quantitative traits using LMMs. Our method, BulkLMM, speeds up the computation by orders of magnitude compared to one trait at a time scans. On a mouse BXD Liver Proteome data with more than 35,000 traits and 7,000 markers, BulkLMM completed in a few seconds. We use vectorized, multi-threaded operations and regularization to improve optimization, and numerical approximations to speed up the computations. Our software implementation in the Julia programming language also provides permutation testing for LMMs and is available at
https://github.com/senresearch/BulkLMM.jl. 

Keywords

GWAS

eQTL

LMM

Julia 

View Abstract 3829

Co-Author(s)

Gregory Farage
Robert Williams
Karl Broman, University of Wisconsin-Madison
Saunak Sen, University of Tennessee Health Science Center

First Author

Zifan Yu

Presenting Author

Saunak Sen, University of Tennessee Health Science Center

Data Challenge Expo Entry

Placeholder for data challenge expo sponsored by the Statistical Computing Section 

Keywords

statistical computing 

View Abstract 3660

Co-Author(s)

Cristina Anton, SAS Institute
Tom Grant, SAS Institute
Lincoln Groves, SAS Institute
Linda Jordan
Rachel McLawhon, SAS
Bryan Mehi, SAS

First Author

Jacqueline Johnson, SAS Institute

Presenting Author

Jacqueline Johnson, SAS Institute

Do Positive Mass Communications Help SIPP Interviewers Succeed?

Research has shown that survey interviewers need more morale boosts and opportunities to improve their competencies. The Survey of Income and Program Participation (SIPP) emailed a series of informative and positive messages to their interviewers with the intent of keeping their interviewers engaged in the survey data collection. The SIPP is a longitudinal annual household government computer assisted person interview (CAPI) survey conducted by the U.S. Census Bureau, where SIPP interviewers interview sample primarily in person and self-manage their workloads.

The SIPP is a challenging survey to field. While these emails do not make the survey any less challenging, these emails are designed to establish more intrinsic value in the work they do with the hopes that that will put them in a better position to succeed. Specifically, this research explores if the emails lead to more interviewers meeting their intermittent progress goals and uses time series methods to analyze if the emails sent led to any significant changes into how they work their cases. 

Keywords

data collection

email

interviewer morale

CAPI

longitudinal survey 

View Abstract 1944

First Author

Kevin Tolliver, US Census Bureau

Presenting Author

Kevin Tolliver, US Census Bureau

Functional Data Analysis for Rodent Sleep Data

As statistical methods for continuous data progress, there remains a need for applying sophisticated statistical techniques to complex behavioral neuroscience datasets. In an experiment studying the impact of Vitamin K deficiency on sleep following changes in dietary Vitamin K, rodent electroencephalography (EEG) and electromyography (EMG) data were collected using implanted wireless physiological telemetry devices and rodent sleep state scoring was performed. While the data collected are continuous, current analysis approaches typically model averages of responses over time using an analysis of variance (ANOVA) or repeated measures ANOVA model. One approach that leverages the original complexity of the data is functional data analysis (FDA). In this talk, we discuss functional data analysis and its fitness for analyzing a longitudinal dataset, as well as its limitations or when traditional models may remain the preferred approach. We will fit a functional model to our neurological dataset and demonstrate the process for selecting appropriate functional mean and variance structures. 

Keywords

Functional Data Analysis

Neuroscience

Longitudinal Data

Rodent Studies 

View Abstract 2671

Co-Author(s)

Katherine Allen-Moyer, Social and Scientific Systems, Inc., a DLH Holdings Corp Company, Durham, North Carolina.
Leslie Wilson, Neurobehavioral Core Laboratory, NIEHS, NIH, Department of Health and Human Services, RTP, NC.
Wei Fan, Metabolism, Genes, and Environment Group, Signal Transduction Laboratory, NIEHS, NIH, RTP, NC.
Jesse Cushman, 2Neurobehavioral Core Laboratory, NIEHS, NIH, RTP, NC.
Xiaoling Li, 3Metabolism, Genes, and Environment Group, Signal Transduction Laboratory, NIEHS, NIH, RTP, NC.
Leping Li, Biostatistics Branch/NIEHS
Helen Cunny, Division of Translational Toxicology/NIEHS
Keith Shockley, National Institute of Health

First Author

Kathryn Konrad, DLH

Presenting Author

Kathryn Konrad, DLH

Identifying Voter Behavior

What can sway a voter to alter their voting pattern and vote for someone that they would not normally vote for in an election? This project aims to analyze political trends by comparing national data from the General Social Survey (GSS) with precinct-level voting patterns in Florida, focusing on shifts towards independent or Democratic preferences. By examining historical voter turnout and opinions on political issues, we seek to identify correlations between national ideologies and local voting behaviors in Florida. The analysis will highlight precincts with high voter turnout and their political inclinations, providing insights into the electorate's evolving dynamics. The findings will inform strategies for a potential independent candidate vying for a seat in Florida's State House of Representatives, emphasizing areas with potential for significant impact. Through statistical analysis and data visualization, we aim to bridge the gap between national trends and local realities, offering a data-driven foundation for targeted campaign initiatives. This comparative study not only contributes to understanding political shifts but also supports democratic engagement by aligning campaign efforts with voter sentiments. 

Keywords

Voting

Elections 

View Abstract 3736

Co-Author

Joshua Cook

First Author

Dane Korver

Presenting Author

Joshua Cook

Individual Level Variances as a Predictor of Health Outcomes

Longitudinal biomarker data and cross-sectional outcomes are routinely collected in modern epidemiology studies, often with the goal of informing tailored early intervention decisions. For example, hormones such as oestradiol (E2) and follicle-stimulating hormone (FSH) may predict changes in womens' health during the midlife. Most existing methods focus on constructing predictors from mean marker trajectories. However, subject-level biomarker variability as a predictor may provide critical information about disease risks and health outcomes. In this paper, we develop a joint model that estimates subject-level means and variances of longitudinal predictors to predict a cross-sectional health outcome. Simulations demonstrate excellent recovery of true model parameters. The proposed method provides less biased and more efficient estimates, relative to alternative approaches that either ignore subject-level differences in the variances or perform two-stage estimation where estimated marker variances are treated as observed. Analyses of women's health data reveal that a larger variability of E2 is associated with slower rates of waist circumference gains over the menopausal transition. 

Keywords

Bayesian methods

Joint models

Subject-level variability

Variance component priors

midlife aging

women's health 

View Abstract 2913

Co-Author(s)

Zhenke Wu, University of Michigan
Sioban Harlow, University of Michigan
Carrie Karvonen-Gutierrez, University of Michigan
Michael Elliott, University of Michigan
Michelle Hood, University of Michigan

First Author

Irena Chen

Presenting Author

Irena Chen

Investigating American Optimism: Evidence from 2022 General Social Survey (GSS) Data

In this cross-sectional study, we delve into how a person's characteristics contribute to their degree of optimism, in terms of happiness, how exciting their life is, and how fair they perceive the world around them, drawing on data from the most recent wave of the General Social Survey conducted in 2022. Our exploration is motivated by the findings of Smith (2005), who examined various dimensions of "troubles in America," including health, work, finances, material hardships, family/personal, law and crime, housing, and miscellaneous domains. Our underlying hypothesis posits that specific characteristics of a person (or group of people) may be able to predict their degree of optimism. We screen potential predictors using preliminary tests for association to see if there is any significance among the specific factors prior to including them in the model. Utilizing ordinal logistic regression and leveraging a substantial volume of survey responses, we scrutinize the intricate relationships among these domains. The model incorporates demographic variables and various personal characteristics to estimate their effects on degree of optimism and provided the bootstrap confidence interval for the estimates. We studied the interaction effects of different characteristics on the response variables and found significant interactions. In response to the large amounts of missing-at-random data, this study investigates the extent to which data imputation techniques can be utilized for further analysis.  

Keywords

Ordinal Logistic Regression

Data-Imputation

Optimism

General Social Survey (GSS)

Categorical Data

Cross-sectional 

Abstracts


Co-Author(s)

Wooyoung Kim, Washington State University
Jacqueline Carlton
David Rice, Western Washington University
Daryl DeFord, Washington State University

First Author

Md Mahedi Hasan, Washington State University

Presenting Author

Jacqueline Carlton

Jointly Modeling Means and Variances for Nonlinear Mixed Models with Measurement Errors and Outliers

In the longitudinal data analysis, the within-individual repeated measurements often exhibit large variations and these variations appear to change over time. A good understanding the nature of the within-individual systematic and random variations allows us to conduct more efficient statistical inferences and make better predictions. Motivated by HIV viral dynamic studies, we considered a nonlinear mixed effects (NLME) model for modeling the longitudinal means, together with a model for the within-individual variances which also allows us to address outliers in the repeated measurements. Statistical inference was then based on a joint model for the mean and variance, implemented by a computationally efficient approximate method. Extensive simulations evaluated the proposed method. We found that the proposed method produces more efficient estimates than the corresponding method without modeling the variances. Moreover, the proposed method provides robust inference against outliers. The proposed method was applied to a recent HIV-related dataset, with interesting new findings. 

Keywords

h-likelihood

joint model

measurement error

robust 

View Abstract 1642

Co-Author(s)

Lang Wu, University of British Columbia
Viviane Dias Lima, Department of Medicine, University of British Columbia

First Author

Qian Ye

Presenting Author

Qian Ye

Modeling of Ordinal Longitudinal Accelerometer Data

Improvements in accelerometer technology has led to new types of data on which
more powerful predictive models can be built to assess physical activity. This
paper implements an ordinal random forest model with recursive forecasting to
take into account the ordinal longitudinal nature of responses. The data comes
from 28 adults performing activities of daily living in two visits, while wearing
accelerometers on the ankle, hip, right and left wrist. The first visit provided
training data and the second testing data so that an independent sample, cross-validation
approach could be used. For this data, prior responses are not available
at the testing stage or in practice. However, recursive forecasts can be made
with prior predictions in place of lagging responses on models which were built
to use lagging responses as explanatory variables. Models are fit to account
for multiple time series, with different time series for each participant in the
study. We found that ordinal random forest, when the time series is taken into
account, produces better accuracy rates and better linearly weighted kappa values
than both ordinary ordinal forest and random forest. On the testing 

Keywords

Longitudinal Data

Multiple Time Series

Accelerometers

Ordinal Models 

View Abstract 3550

First Author

Drew M Lazar, Ball State University

Presenting Author

Drew M Lazar, Ball State University

Predicting Sleep-Wake Times from Wearable Sensor Data using Change Point Detection

In the era of ubiquitous wearable technology, the prediction of sleep and wake times based on activity and biometric data has emerged as an important area of research. Previous studies have predominantly relied on supervised learning algorithms, trained using polysomnography(PSG) data. However, the collection and labeling of PSG data are prohibitively expensive and logistically challenging. Moreover, the alternative use of self-recorded sleep reports as labels for PSG data is fraught with issues of subjectivity and inaccuracy. In response to these challenges, this study introduces an unsupervised algorithm for sleep/wake times prediction using biometric data obtained from wearable devices. This algorithm is grounded in the change point detection methodology, a technique well-suited for identifying pattern changes in time-series data. We estimate common parameters based on general patient data, which enhances the algorithm's adaptability across diverse patient profiles. The algorithm was tested on a cohort of 590 patients. Our results not only validate the effectiveness of the proposed method but also opens new avenues for leveraging wearable device data in sleep research. 

Keywords

Change point detection

Sleep-wake times prediction

Unsupervised learning

Multivariate time-series analysis

Wearable sensor data 

View Abstract 2695

Co-Author(s)

Hyonho Chun
Myung Hee Lee, Weill Cornell Medicine
Jeongyoun Ahn, Korea Advanced Institute of Science and Technology

First Author

Jiyu Moon, Korea Advanced Institute of Science and Technology

Presenting Author

Jiyu Moon, Korea Advanced Institute of Science and Technology

Simulating cancer progression events: state-based model transitions with flexible correlation

In advanced cancers, patients undergo multiple different lines of therapies, switching treatments when their disease progresses. Using longitudinal data sources, such as those obtained via Electronic Health Records (EHRs), researchers can study effects of common therapy sequences. New models for studying therapy sequence are needed to inform clinical decisions when prospective trial data is absent; however, to evaluate the performance of such models, a method for simulating EHR-like longitudinal data is required. Here, we develop a method for simulating paths through a state-based model. This method allows transition times to depend on treatments and observed covariates and incorporates within-patient correlation. This is important as patients' outcomes across states may be dependent due to difficult-to-quantify factors (e.g., disease aggressiveness, response to prior therapy, evolution of the mutational landscape). We propose to introduce within-patient correlation using a copula. This flexible class of multivariate models allows for researchers to generate outcomes with a range of within-patient correlation structures: Gaussian, T, and Clayton's copula will be considered. 

Keywords

simulation methods

copula

cancer applications

multi-state models

survival

EHR 

View Abstract 2313

Co-Author(s)

J. Robert Beck, Fox Chase Cancer Center
Daniel Geynisman, Fox Chase Cancer Center

First Author

Elizabeth Handorf, Rutgers University, Rutgers Cancer Institute of New Jersey

Presenting Author

Elizabeth Handorf, Rutgers University, Rutgers Cancer Institute of New Jersey

Testing the Homogeneity of Differences between Two Proportions for Stratified Bi-Unilateral Data

Medical comparative studies often involve collecting data from paired organs, which can produce either bilateral or unilateral data. While many testing procedures are available that account for the intra-class correlation between paired organs for bilateral data, more research needs to be conducted to determine how to analyze combined correlated bilateral and unilateral data. In practice, stratification is often used in analysis to ensure participants are allocated equally to each experimental condition. In this paper, we propose three Maximum Likelihood Estimation (MLE)-based methods for testing the homogeneity of differences between two proportions for stratified bilateral and unilateral data across strata using Donner's model. We compare the performance of these methods with a model-based method based on Generalized Estimating Equations using Monte Carlo simulations. We also provide a real example to illustrate the proposed methodologies. Our findings suggest that the Score test performs well and offers a valuable alternative to the exact tests in future studies. 

Keywords

stratified bilateral and unilateral data

risk difference

MLE-based test procedures

Donner’s model 

View Abstract 2558

Co-Author

Changxing Ma, The State University of New York at Buffalo

First Author

Xueqing Zhang

Presenting Author

Xueqing Zhang

Time Series Analysis of Consumer Price Index Weights

Weight is one of the main components in the Consumer Price Index (CPI) formula. The CPI program revises fixed quantity weights on an annual basis presently. Using graphical displays and correlation analysis, we explore the relationship of CPI series whose weights are from different reference periods, and the relationship among the weights. Our focus is to analyze the seasonal components of CPI series. The initial investigation was through frequency analyses using Discrete Fourier Transform. We further investigate various aspects of CPI series including trend, changes in rates, jump discontinuity, and outliers. 

Keywords

Autocorrelation

Convolution

Discrete Fourier Transform

Periodicity

Trend

Wavelets 

View Abstract 2562

First Author

MoonJung Cho, US Bureau of Labor Statistics

Presenting Author

MoonJung Cho, US Bureau of Labor Statistics

Understanding Confidence in American Institutions Over Time: Comparing Two National Surveys

This research explores historical trends in how much confidence Americans have in political, government, and social institutions as reported by two national surveys. Using responses to comparably worded questions in the General Social Survey (GSS) and the Gallup World Poll about respondents' confidence in various American institutions, we compare trends in public confidence over five decades spanning 1973 to 2022. While each survey provides conceptually equivalent measures, both reflect different sampling designs and provide a unique opportunity to reconcile trends between public opinion and social survey sources. Normalizing each source to reflect the share of respondents with a great deal of confidence in each type of institution, we show long run declines in confidence across institutions through a series of visualizations. Considering both surveys as independent estimates of public confidence, we apply cointegration tests to assess whether the difference between both survey measures is stable over time. Our findings provide a unique comparison between two separate surveys measuring public confidence, noting how each source captures common trends in institutional trust. 

Keywords

public confidence

institutions

governance 

View Abstract 3760

Co-Author

Andrew C. Forrester, University of Maryland

First Author

Ujjayini Das, University of Maryland

Presenting Author

Ujjayini Das, University of Maryland