Sunday, Aug 4: 2:00 PM - 3:50 PM
5002
Contributed Speed
Oregon Convention Center
Room: CC-E141
Presentations
We present a framework to identify time-lagged associations between abundances of longitudinally sampled microbiota and a stationary response (final health outcome, disease status, etc.). We introduce a definition of the time lag by imposing a particular grouping structure on the association pattern of longitudinal microbial measurements. Using group regularization methods, we identify these time-lagged associations including their strengths, signs, and timespans. Simulation results demonstrate accurate identification of time lags and estimation of signal strengths by our approach. We apply this framework to find specific gut microbial taxa and their lagged effects associated with increased parasite worm burden in zebrafish.
Keywords
Longitudinal data
Gut microbiome
Group penalization
Time-lagged associations
Biostatistics
Disease modeling
This work presents a novel joint model of a longitudinal biomarker and interval-censored post-diagnosis time to event outcome in the presence of interval-censored covariates due to the unknown initial event diagnosis time. By treating interval-censored initial event as missing data, we develop an expectation-maximization algorithm for semi-parametric maximum likelihood estimation, where the distribution of interval-censored initial event is modeled. A simulation framework is constructed to demonstrate the performance of our novel approach across a variety of scenarios. We applied this joint model to large-scale UK-Biobank data and found that (1) the age at diagnosis of diabetes was positively associated with the systolic blood pressure; (2) a smoker had a significantly increased risk of cardiovascular disease (CVD) event, but midpoint analysis detected no significance in these two covariates. Lastly, using Brier Score as a calibration measure for dynamic prediction, our proposed model yielded a higher accuracy of CVD event prediction than midpoint analysis. In summary, both our simulation and application results showed that our proposed model outperformed midpoint analysis.
Keywords
Joint model
Unknown initial diagnosis
Interval-censored data
Longitudinal biomarker
Time to event outcome
Unsupervised clustering is widely used to discover patterns in data without pre-defined labels. Clustering methods for time series data have been less studied and still present challenges. In this study, we use simulated data to showcase the performance of clustering algorithms on time series data and provide new insights into methodological choices. We selected a range of clustering algorithms-Hierarchical, k-means, k-medoids, Gaussian mixture, self-organizing maps, and density-based clustering-and distance metrics included Euclidean, correlation-based distances, dynamic time warping (DTW), and variants like weighted DTW. Results were evaluated using the adjusted Rand index and validated with known cluster labels. Preliminary findings in simulated univariate time series data showed that data transformation (i.e., standardization) was the leading determinant of clustering performance. In benchmark multivariate time series data, clustering performance was weaker. Next steps include investigations using simulated multivariate data. Results inform a project to identify distinct diurnal patterns of multiple air pollutants.
Keywords
time series data
clustering
unsupervised learning
Incidence data for COVID-19 was collected by county across the state of Michigan. This data can be standardized such that statistics computed for each county can be assumed to follow the same distribution. One can then plot the largest and smallest statistics at a given timepoint on a group control chart. The authors will present the development of two group control chart that could be used to monitor county (or geospatial) data for the state of Michigan. The authors will then demonstrate the usage of this group control chart using incidence data for COVID-19. The authors will also examine the impact of policy decisions (enacted by the Governor and the legislature) on the spread of COVID-19.
Keywords
Statistical Process Control
COVID-19 incidence data
Group Control Charts
As times change, so do societal opinions and attitudes regarding topics as diverse as financial well-being, women's rights, and national policies. Opinions on these topics are often not random, but rather can be related to demographic characteristics such as race, age, and gender, among others. The NORC General Social Survey (GSS) has collected data about Americans' social attitudes since 1972. However, it is not always clear how best to analyze this type of data. In this analysis, we provide several different approaches to analyze the wide-ranging data found in the GSS, from time series analysis of financial well-being, sentiment analysis regarding national policies over time, and a meta-analysis of the distribution of survey questions related to women and women's rights over time. By understanding how these topics are impacted by respondents' demographic characteristics and how these opinions change over time, we can better understand what people value and prioritize and gain insight into the ways social sentiment can influence, or be influenced by, current events.
Keywords
Social Survey
Sentiment Analysis
Public Opinion
Current Events
Genetic studies often collect data using high-throughput phenotyping. That has led to the need for fast genomewide scans for large number of traits using linear mixed models (LMMs). Computing the scans one by one on each trait is time consuming. We have developed new algorithms for performing genome scans on a large number of quantitative traits using LMMs. Our method, BulkLMM, speeds up the computation by orders of magnitude compared to one trait at a time scans. On a mouse BXD Liver Proteome data with more than 35,000 traits and 7,000 markers, BulkLMM completed in a few seconds. We use vectorized, multi-threaded operations and regularization to improve optimization, and numerical approximations to speed up the computations. Our software implementation in the Julia programming language also provides permutation testing for LMMs and is available at
https://github.com/senresearch/BulkLMM.jl.
Keywords
GWAS
eQTL
LMM
Julia
Placeholder for data challenge expo sponsored by the Statistical Computing Section
Keywords
statistical computing
Research has shown that survey interviewers need more morale boosts and opportunities to improve their competencies. The Survey of Income and Program Participation (SIPP) emailed a series of informative and positive messages to their interviewers with the intent of keeping their interviewers engaged in the survey data collection. The SIPP is a longitudinal annual household government computer assisted person interview (CAPI) survey conducted by the U.S. Census Bureau, where SIPP interviewers interview sample primarily in person and self-manage their workloads.
The SIPP is a challenging survey to field. While these emails do not make the survey any less challenging, these emails are designed to establish more intrinsic value in the work they do with the hopes that that will put them in a better position to succeed. Specifically, this research explores if the emails lead to more interviewers meeting their intermittent progress goals and uses time series methods to analyze if the emails sent led to any significant changes into how they work their cases.
Keywords
data collection
email
interviewer morale
CAPI
longitudinal survey
As statistical methods for continuous data progress, there remains a need for applying sophisticated statistical techniques to complex behavioral neuroscience datasets. In an experiment studying the impact of Vitamin K deficiency on sleep following changes in dietary Vitamin K, rodent electroencephalography (EEG) and electromyography (EMG) data were collected using implanted wireless physiological telemetry devices and rodent sleep state scoring was performed. While the data collected are continuous, current analysis approaches typically model averages of responses over time using an analysis of variance (ANOVA) or repeated measures ANOVA model. One approach that leverages the original complexity of the data is functional data analysis (FDA). In this talk, we discuss functional data analysis and its fitness for analyzing a longitudinal dataset, as well as its limitations or when traditional models may remain the preferred approach. We will fit a functional model to our neurological dataset and demonstrate the process for selecting appropriate functional mean and variance structures.
Keywords
Functional Data Analysis
Neuroscience
Longitudinal Data
Rodent Studies
Co-Author(s)
Katherine Allen-Moyer, Social and Scientific Systems, Inc., a DLH Holdings Corp Company, Durham, North Carolina.
Leslie Wilson, Neurobehavioral Core Laboratory, NIEHS, NIH, Department of Health and Human Services, RTP, NC.
Wei Fan, Metabolism, Genes, and Environment Group, Signal Transduction Laboratory, NIEHS, NIH, RTP, NC.
Jesse Cushman, 2Neurobehavioral Core Laboratory, NIEHS, NIH, RTP, NC.
Xiaoling Li, 3Metabolism, Genes, and Environment Group, Signal Transduction Laboratory, NIEHS, NIH, RTP, NC.
Leping Li, Biostatistics Branch/NIEHS
Helen Cunny, Division of Translational Toxicology/NIEHS
Keith Shockley, National Institute of Health
First Author
Kathryn Konrad, DLH
Presenting Author
Kathryn Konrad, DLH
What can sway a voter to alter their voting pattern and vote for someone that they would not normally vote for in an election? This project aims to analyze political trends by comparing national data from the General Social Survey (GSS) with precinct-level voting patterns in Florida, focusing on shifts towards independent or Democratic preferences. By examining historical voter turnout and opinions on political issues, we seek to identify correlations between national ideologies and local voting behaviors in Florida. The analysis will highlight precincts with high voter turnout and their political inclinations, providing insights into the electorate's evolving dynamics. The findings will inform strategies for a potential independent candidate vying for a seat in Florida's State House of Representatives, emphasizing areas with potential for significant impact. Through statistical analysis and data visualization, we aim to bridge the gap between national trends and local realities, offering a data-driven foundation for targeted campaign initiatives. This comparative study not only contributes to understanding political shifts but also supports democratic engagement by aligning campaign efforts with voter sentiments.
Keywords
Voting
Elections
Longitudinal biomarker data and cross-sectional outcomes are routinely collected in modern epidemiology studies, often with the goal of informing tailored early intervention decisions. For example, hormones such as oestradiol (E2) and follicle-stimulating hormone (FSH) may predict changes in womens' health during the midlife. Most existing methods focus on constructing predictors from mean marker trajectories. However, subject-level biomarker variability as a predictor may provide critical information about disease risks and health outcomes. In this paper, we develop a joint model that estimates subject-level means and variances of longitudinal predictors to predict a cross-sectional health outcome. Simulations demonstrate excellent recovery of true model parameters. The proposed method provides less biased and more efficient estimates, relative to alternative approaches that either ignore subject-level differences in the variances or perform two-stage estimation where estimated marker variances are treated as observed. Analyses of women's health data reveal that a larger variability of E2 is associated with slower rates of waist circumference gains over the menopausal transition.
Keywords
Bayesian methods
Joint models
Subject-level variability
Variance component priors
midlife aging
women's health
In this cross-sectional study, we delve into how a person's characteristics contribute to their degree of optimism, in terms of happiness, how exciting their life is, and how fair they perceive the world around them, drawing on data from the most recent wave of the General Social Survey conducted in 2022. Our exploration is motivated by the findings of Smith (2005), who examined various dimensions of "troubles in America," including health, work, finances, material hardships, family/personal, law and crime, housing, and miscellaneous domains. Our underlying hypothesis posits that specific characteristics of a person (or group of people) may be able to predict their degree of optimism. We screen potential predictors using preliminary tests for association to see if there is any significance among the specific factors prior to including them in the model. Utilizing ordinal logistic regression and leveraging a substantial volume of survey responses, we scrutinize the intricate relationships among these domains. The model incorporates demographic variables and various personal characteristics to estimate their effects on degree of optimism and provided the bootstrap confidence interval for the estimates. We studied the interaction effects of different characteristics on the response variables and found significant interactions. In response to the large amounts of missing-at-random data, this study investigates the extent to which data imputation techniques can be utilized for further analysis.
Keywords
Ordinal Logistic Regression
Data-Imputation
Optimism
General Social Survey (GSS)
Categorical Data
Cross-sectional
Abstracts
In the longitudinal data analysis, the within-individual repeated measurements often exhibit large variations and these variations appear to change over time. A good understanding the nature of the within-individual systematic and random variations allows us to conduct more efficient statistical inferences and make better predictions. Motivated by HIV viral dynamic studies, we considered a nonlinear mixed effects (NLME) model for modeling the longitudinal means, together with a model for the within-individual variances which also allows us to address outliers in the repeated measurements. Statistical inference was then based on a joint model for the mean and variance, implemented by a computationally efficient approximate method. Extensive simulations evaluated the proposed method. We found that the proposed method produces more efficient estimates than the corresponding method without modeling the variances. Moreover, the proposed method provides robust inference against outliers. The proposed method was applied to a recent HIV-related dataset, with interesting new findings.
Keywords
h-likelihood
joint model
measurement error
robust
Improvements in accelerometer technology has led to new types of data on which
more powerful predictive models can be built to assess physical activity. This
paper implements an ordinal random forest model with recursive forecasting to
take into account the ordinal longitudinal nature of responses. The data comes
from 28 adults performing activities of daily living in two visits, while wearing
accelerometers on the ankle, hip, right and left wrist. The first visit provided
training data and the second testing data so that an independent sample, cross-validation
approach could be used. For this data, prior responses are not available
at the testing stage or in practice. However, recursive forecasts can be made
with prior predictions in place of lagging responses on models which were built
to use lagging responses as explanatory variables. Models are fit to account
for multiple time series, with different time series for each participant in the
study. We found that ordinal random forest, when the time series is taken into
account, produces better accuracy rates and better linearly weighted kappa values
than both ordinary ordinal forest and random forest. On the testing
Keywords
Longitudinal Data
Multiple Time Series
Accelerometers
Ordinal Models
In the era of ubiquitous wearable technology, the prediction of sleep and wake times based on activity and biometric data has emerged as an important area of research. Previous studies have predominantly relied on supervised learning algorithms, trained using polysomnography(PSG) data. However, the collection and labeling of PSG data are prohibitively expensive and logistically challenging. Moreover, the alternative use of self-recorded sleep reports as labels for PSG data is fraught with issues of subjectivity and inaccuracy. In response to these challenges, this study introduces an unsupervised algorithm for sleep/wake times prediction using biometric data obtained from wearable devices. This algorithm is grounded in the change point detection methodology, a technique well-suited for identifying pattern changes in time-series data. We estimate common parameters based on general patient data, which enhances the algorithm's adaptability across diverse patient profiles. The algorithm was tested on a cohort of 590 patients. Our results not only validate the effectiveness of the proposed method but also opens new avenues for leveraging wearable device data in sleep research.
Keywords
Change point detection
Sleep-wake times prediction
Unsupervised learning
Multivariate time-series analysis
Wearable sensor data
Co-Author(s)
Hyonho Chun
Myung Hee Lee, Weill Cornell Medicine
Jeongyoun Ahn, Korea Advanced Institute of Science and Technology
First Author
Jiyu Moon, Korea Advanced Institute of Science and Technology
Presenting Author
Jiyu Moon, Korea Advanced Institute of Science and Technology
In advanced cancers, patients undergo multiple different lines of therapies, switching treatments when their disease progresses. Using longitudinal data sources, such as those obtained via Electronic Health Records (EHRs), researchers can study effects of common therapy sequences. New models for studying therapy sequence are needed to inform clinical decisions when prospective trial data is absent; however, to evaluate the performance of such models, a method for simulating EHR-like longitudinal data is required. Here, we develop a method for simulating paths through a state-based model. This method allows transition times to depend on treatments and observed covariates and incorporates within-patient correlation. This is important as patients' outcomes across states may be dependent due to difficult-to-quantify factors (e.g., disease aggressiveness, response to prior therapy, evolution of the mutational landscape). We propose to introduce within-patient correlation using a copula. This flexible class of multivariate models allows for researchers to generate outcomes with a range of within-patient correlation structures: Gaussian, T, and Clayton's copula will be considered.
Keywords
simulation methods
copula
cancer applications
multi-state models
survival
EHR
Medical comparative studies often involve collecting data from paired organs, which can produce either bilateral or unilateral data. While many testing procedures are available that account for the intra-class correlation between paired organs for bilateral data, more research needs to be conducted to determine how to analyze combined correlated bilateral and unilateral data. In practice, stratification is often used in analysis to ensure participants are allocated equally to each experimental condition. In this paper, we propose three Maximum Likelihood Estimation (MLE)-based methods for testing the homogeneity of differences between two proportions for stratified bilateral and unilateral data across strata using Donner's model. We compare the performance of these methods with a model-based method based on Generalized Estimating Equations using Monte Carlo simulations. We also provide a real example to illustrate the proposed methodologies. Our findings suggest that the Score test performs well and offers a valuable alternative to the exact tests in future studies.
Keywords
stratified bilateral and unilateral data
risk difference
MLE-based test procedures
Donner’s model
Weight is one of the main components in the Consumer Price Index (CPI) formula. The CPI program revises fixed quantity weights on an annual basis presently. Using graphical displays and correlation analysis, we explore the relationship of CPI series whose weights are from different reference periods, and the relationship among the weights. Our focus is to analyze the seasonal components of CPI series. The initial investigation was through frequency analyses using Discrete Fourier Transform. We further investigate various aspects of CPI series including trend, changes in rates, jump discontinuity, and outliers.
Keywords
Autocorrelation
Convolution
Discrete Fourier Transform
Periodicity
Trend
Wavelets
This research explores historical trends in how much confidence Americans have in political, government, and social institutions as reported by two national surveys. Using responses to comparably worded questions in the General Social Survey (GSS) and the Gallup World Poll about respondents' confidence in various American institutions, we compare trends in public confidence over five decades spanning 1973 to 2022. While each survey provides conceptually equivalent measures, both reflect different sampling designs and provide a unique opportunity to reconcile trends between public opinion and social survey sources. Normalizing each source to reflect the share of respondents with a great deal of confidence in each type of institution, we show long run declines in confidence across institutions through a series of visualizations. Considering both surveys as independent estimates of public confidence, we apply cointegration tests to assess whether the difference between both survey measures is stable over time. Our findings provide a unique comparison between two separate surveys measuring public confidence, noting how each source captures common trends in institutional trust.
Keywords
public confidence
institutions
governance