02/27/2024: 5:30 PM - 7:00 PM CST
Posters
Room: Salon III
Presentations
Sex trafficking is a pervasive worldwide issue that poses significant challenges for law enforcement agencies, non-profits, and researchers alike. In this study, we present a novel method to monitor counts of sexual service advertisements in over 600 cities across the United States, which have been collected over the past two to three years. The ad volume is believed to be linked to the prevalence of sex trafficking. The data structure poses several difficulties, namely non-stationarity, autocorrelation, and over or under-dispersion. Our method encompasses an approach that models the daily and weekly absolute data differences using a flexible parametric time series model based on the zero-inflated Conway-Maxwell-Poisson distribution with change-points. By effectively capturing the nuanced temporal dynamics within the data, our method overcomes limitations in existing monitoring approaches that may assume independence or equi-dispersion. Leveraging the flexibility of this parametric time-series model, we can better understand key patterns, trends, and anomalies in this data set. Our findings, illustrated with the data, demonstrate the efficacy of the proposed method as a powerful tool for retrospective monitoring of sexual service advertisement data in the fight against sex trafficking.
Presenting Author
Chase Holcombe, University of Alabama
First Author
Chase Holcombe, University of Alabama
CoAuthor(s)
Subhabrata Chakraborti, University of Alabama
Jason Parton, University of Alabama
Nickolas Freeman, University of Alabama
Gregory Bott, University of Alabama
Clinical trials and cohort studies often collect clinical data paired with stored biospecimens. An increasing focus of biomedical research is aimed at leveraging these existing specimens to address new and important research questions. When hypotheses of interest proposes to utilize costly, limited or difficult to obtain samples (e.g. peripheral blood mononuclear cells), informed sampling strategies (ISS) can be used to minimize costs and preserve biospecimens by providing methods to select more informative samples of subjects. The resulting data can be assayed and analyzed in concert with an analytical correction. Dropout is common in longitudinal studies but existing ISS methods assume complete follow-up on all individuals. Ignoring cases where poor outcomes may influence the propensity to dropout, such as persons with HIV (PWH), may bias study results. We propose an expansion to current ISS frameworks to include dropout. Mixture models, commonly used to adjust for informative dropout, are modified to accommodate analysis of data from our design. A software package, developed in R, is used to facilitate analyses.
Presenting Author
Carter Sevick
First Author
Carter Sevick
CoAuthor(s)
Camille Moore, National Jewish Health
Samantha MaWhinney, Colorado School of Public Health
Prediction performance, in practical settings, is mostly achieved by one sin-
gle (complex) model which nevertheless can lack in interpretability. However,
a recent idea has developed pushing to find a set of equally-performing mod-
els instead of a single one, called the Rashomon set. In this direction, the
Sparse Wrapper Algorithm (SWAG) is a recently proposed multi-model selection
method consisting in a greedy algorithm that combines screening and wrapper
approaches to create a set of low-dimensional models with good predictive power
using a learning mechanism of choice. As a result of its modelling flexibility,
practitioners can pick the model that best reflects their needs and/or domain
expertise without losing accuracy. In addition, the SWAG can deal with many
problematic features in the data such as missing values, outliers, collinearity and
others. Finally, the set of SWAG models can be used to construct a network
that highlights the intensity and the direction of attribute interaction from a
broader and more insightful perspective. We highlight how this method delivers
important results for decison-makers in fields such as genomics, engineering and
neurology.
Presenting Author
Yagmur Yavuz Ozdemir, Auburn University
First Author
Yagmur Yavuz Ozdemir, Auburn University
CoAuthor(s)
Cesare Miglioli, University of Geneva
Samuel Orso
Nabil Mili, University of Lausanne
Gaetan Bakalli, Emlyon Business School
Stephane Guerrier
Roberto Molinari, Auburn University
Abstract: Cherry Blossom Prediction - LSTM vs. Traditional Regression
The enchanting phenomenon of cherry blossoms has captivated cultures across the globe for generations. In this project, we embark on a journey to predict cherry blossom timings using advanced machine learning techniques, specifically focusing on comparing the effectiveness of Long Short Term Memory (LSTM) networks with traditional Regression models.
Our investigation involves an in-depth analysis of historical cherry blossom timing data, encompassing various locations and timeframes. Additionally, we integrate essential meteorological variables such as temperature, humidity, and sunlight duration to enhance prediction accuracy.
A dual approach is adopted:
LSTM Model: Leveraging the power of LSTM, renowned for its ability to capture temporal relationships, we construct a predictive model. Python & R serves as our tool for data preprocessing, feature engineering, and LSTM model development. Through meticulous training and parameter tuning, we harness LSTM's sequence learning capabilities to forecast cherry blossom timings.
Traditional Regression Model: In parallel, we implement a traditional Regression model, leveraging established statistical techniques. This model employs historical cherry blossom timings and meteorological variables as features, predicting cherry blossom timings based on linear relationships between the variables.
The models are rigorously evaluated, comparing their predictive performances using MSE and MAPE metrics with cross-validation, and training/testing split. Beyond predictive accuracy, we delve into interpretability, identifying key features driving each model's predictions. This understanding aids in unraveling the complex relationship between meteorological conditions and cherry blossom timings.
Presenting Author
NITUL SINGHA
First Author
NITUL SINGHA
CoAuthor
Achraf Cohen, University of West Florida
Clearly defined roles within an organization can improve efficiency, providing clear understanding and communications while minimizing duplicate efforts and lack of accountability. Assigning 'R' (responsible), 'A' (accountable), 'C' (consulted) and 'I' (informed) parties can provide clarity, facilitate meaningful discussion and leverage previously unknown synergies. We discuss the process for developing a RACI matrix to provide a blueprint for organizational efficiency using a focus group within a Medical and Clinical Affairs environment. For statisticians and statistical programmers, cross functional communication is key when relaying complex data and analyses to nonstatisticians. Although our tasks ranged from topics such as case report form design, data monitoring, clinical study reports, and scientific communications; the RACI can be applied to any discipline. We will discuss the scope, goals, methodology, implementation, pitfalls to avoid, and our design choices that make the RACI matrix creation efficient and purposeful for all stakeholders, regardless of functional group.
Presenting Author
Jennifer Mares
First Author
Jennifer Mares
CoAuthor(s)
Anna Liza Antonio, Edwards Lifesciences
Jami Maccombs, Edwards Lifesciences
Latent Class Analysis (LCA) is a statistical method used to identify distinct subgroups and uncover patterns within observed categorical data. It has broad and practical implications in social, behavioral, and health sciences. However, explicit guidelines for determining the optimal sample size remain limited, despite prior simulation research underscoring the significance of a sufficient sample size to ensure reliable class identification in LCA. This study aims to enhance the utilization of LCA by offering insights into sample size determination and performance assessment across diverse scenarios through a simulation involving over 500 scenarios encompassing different class counts, observed indicators, and sample sizes. The study examines the probabilities of identifying latent classes. The results reveal that, among the considered models, the two-class model consistently performs well, particularly at a sample size of 100.
My presentation is targeted to an audience either with little exposure to the topic (introductory) or assumes a base level of knowledge (intermediate).
Presenting Author
Gail Han
First Author
Gail Han
CoAuthor
Achraf Cohen, University of West Florida
Clinical investigations are required to provide oversight to ensure adequate protection of the rights, welfare, and safety of human subjects. The quality of clinical trial data is critical to the protection of human subjects and the conduct of clinical studies under the ISO and FDA guidance. Data cleaning is a crucial part to ensure data integrity, and successful data analyses and data interpretations cannot be achieved without it. A well-established data cleaning process usually involves collaborations between multiple functional teams. To enable more efficient cross-functional collaborations on the data query report, we developed a graphical user interface with writeback functionality that enables the end users to provide their feedback directly into the report, with data refreshed automatically on a regular basis. This visual interface provides the latest insights on data cleaning progress, and the writeback function facilitates efficient communication between Clinical Monitoring and Data Management. This process shortens the study timelines by allowing teams to proactively clean data prior to database locks while ensuring data integrity.
Presenting Author
Kay Liu, Edwards Lifesciences
First Author
Kay Liu, Edwards Lifesciences
CoAuthor(s)
Ana Legaspi, Edwards Lifesciences
Gregory Botwin, Edwards Lifesciences
Anna Liza Antonio, Edwards Lifesciences
Kevin Ngov, Edwards Lifesciences
Kelly Hendrickson, Edwards Lifesciences
Terri Johnson, Edwards Lifesciences
We propose a novel class of dynamic factor models for spatiotemporal areal data. This novel class of models assumes that the spatiotemporal process may be represented by some few latent factors that evolve through time according to dynamic linear models. As the dimension of the vector of latent factors is typically much smaller than the number of subregions, our proposed class of models may achieve substantial dimension reduction. At each time point, the vector of observations is linearly related to the vector of latent factors through a matrix of factor loadings. Each column of this matrix may be seen as a vectorized map
of factor loadings relating one latent factor to the vector of observations. Thus, to account for spatial dependence, we assume that each column of the matrix of factor loadings follows an intrinsic conditional autoregressive (ICAR) process. Hence, we call our class of models the Dynamic ICAR Spatiotemporal Factor Models (DIFM). We develop a Gibbs sampler for exploration of the posterior distribution. In addition, we develop model selection through a Laplace-Metropolis estimator of the predictive density. We present two case studies. The first case study, which is for simulated data, demonstrates that our DIFMs are identifiable
and that our proposed inferential procedure works well at recovering the underlying data generating process. Finally, the second case study demonstrates the utility and flexibility of our DIFM framework with an application to the drug overdose epidemic in the United States from 2015 to 2021.
Presenting Author
Hwasoo Shin, Virginia Tech
First Author
Hwasoo Shin, Virginia Tech
CoAuthor
Marco Ferreira, Virginia Tech
The transition from academia to industry for statistics students can be daunting, especially for those with limited to no experience in the industry they are entering. This poster presentation highlights the challenges faced, transformative factors, key learnings, and recommendations drawn from the experiences of a Data Science Intern at GTI Energy during the summer of 2023. Challenges included adapting to the industry's pace, analyzing real-world data, and applying theoretical knowledge practically. Mentorship was crucial, guiding beyond technical aspects to understand workplace dynamics and customer needs. Internships mutually benefit both parties, bringing fresh perspectives to an organization while allowing students to apply what they have learned in the classroom to real-world problems. We provide tips and advice for educators preparing students for industry internships, students considering these internships, and internship administrators and supervisors.
Presenting Author
Ella Martinez, GTI Energy
First Author
Ella Martinez, GTI Energy
CoAuthor
Zachary Weller, Pacific Northwest National Lab
The categorical Gini correlation, ρg, is a recently proposed measure of dependence between categorical variable, Y, and a numerical random vector, X. It has been shown that ρg has more appealing properties than current existing dependence measurements. In this study, we develop the jackknife empirical likelihood (JEL) method for ρg. Confidence intervals for the Gini correlation are constructed without estimating the asymptotic variance. Adjusted and weighted JEL are explored to improve the performance of the standard JEL. Simulation studies show that our methods are competitive to existing methods in terms of coverage accuracy and shortness of confidence intervals. The proposed methods are illustrated in an application on two real datasets.
Presenting Author
Sameera Hewage, University of Louisiana at Lafayette
First Author
Sameera Hewage, University of Louisiana at Lafayette
CoAuthor
Yongli Sang, University of Louisiana at Lafayette
Introduction
Data sharing by scientists, research organizations, and governments is on the rise, which enhances opportunities for secondary data analysis (Tenopir et al., 2011). However, pooling data from different sources often requires harmonizing measures prior to statistical analysis. Harmonization consists of placing measures of the same variable collected using different questionnaires on a shared metric. As will be illustrated in this project, latent variable modeling is a promising method for harmonizing data from different sources.
Data Analysis
To illustrate how data harmonization using latent variable modeling can be performed prior to the statistical analysis, data from the United Nations Demographic and Health Surveys (DHS) Program collected in 70 countries were used to evaluate whether education level, gender and marital status predicted acceptance of domestic violence. Acceptance of domestic violence was modeled as a latent factor capturing the common variance of the percentage of participants who endorsed six statements related to domestic violence of husbands against wives (e.g., "A husband is justified in hitting or beating his wife if she burns the food").
The complexity of the statistical model required a multi-step process. The data had a nested structure where the presence of multiple data points from the same country in the sample caused dependence of observations, and therefore needed to be modeled explicitly using a hierarchical model. Additionally, the presence of a latent factor in conjunction with the nested structure prevented the complete statistical model from converging for the available data. The analysis was divided into 2 steps: (1) a factor analysis estimating the individual scores for each demographic group across the 70 countries on the latent variable acceptance of domestic violence excluding predictors, and (2) the estimation of the full hierarchical model in which demographic variables predict the latent variable acceptance of domestic violence.
Results
In a model predicting acceptance of domestic violence using education level, gender, and marital status, findings indicated that only gender predicted attitudes toward domestic violence (holding education level and marital status constant). Specifically, holding education level and marital status constant, being a woman instead of a man led to significantly less acceptance of domestic violence.
Implications
Latent variable modeling was used to create a single index of acceptance of domestic violence. Using the index as the dependent variable in the statistical model and demographic variables as predictors, results suggest that holding education level and marital status constant, men are more accepting of domestic violence against women, and could thus benefit more from a prevention campaign designed to shift attitudes toward intimate partner violence.
Presenting Author
Milica Miocevic
First Author
Milica Miocevic
T cells are essential to adaptive immune responses, particularly in counteracting tumor immunity. This abstract presents an updated version of the Network Analysis of Immune Repertoire (NAIR) software for comprehensive T cell receptor (TCR) sequence analysis. Our enhanced NAIR software constructed network within TCR repertoire based on TCR sequence similarity, enabled by tailored search algorithms. These algorithms effectively identify disease-associated TCR clusters and public TCR clusters shared across multiple samples, facilitating the discovery of potential disease-specific TCR signatures. The other feature of the NAIR software is the ability of quantification of the TCR network by network properties. To manage the complexity of network properties and their correlation with clinical outcomes, we employ group lasso regularization. This novel approach highlights network properties significantly associated with clinical outcomes, thus identifying crucial TCR features.The updated NAIR software can now process single-cell TCR sequencing data. In addition, we've broadened the pipeline to incorporate TCR sequences with single-cell gene expression data, using a Graph deep learning model. This allows for a detailed analysis of TCR diversity and gene expression profiles at a single-cell level, providing deeper insights into T cell functionality. The updated software also introduces an innovative technique to predict the binding peptides by integrating of TCR sequence vectorization, TCR sequence similarity networks, and V/J gene in a deep learning framework. It is refined and validated on TCR dataset that given binding antigen. By merging network analysis, advanced statistical methods, and deep learning, this enhanced NAIR software provides a powerful platform for TCR repertoire data analysis. This tool helps unravel the complex relationship between TCR, disease progression, and clinical outcomes, fostering improved understanding of immune system dynamics and paving the way for immunotherapy and precision medicine advancements.
Presenting Author
Li Zhang, University of California
First Author
Li Zhang, University of California
CoAuthor(s)
Hai Yang, UCSF
Phi Le, UCSF
Brian Neal, San Francisco State University
Leah Ung, UCSF
Shilpika Banerjee, San Francisco State University
Tao He, San Francisco State University
Statistical consultants play a crucial role in scientific research and data science. In this study, we conducted a survey of the perspectives of statistical consultants who work in a private practice. This study aims to offer understanding in the field of statistical collaboration and consulting, aiming to enhance the expertise in both technical aspects and collaborative abilities of consultants. This improvement will enable statistical consultants to effectively assist and educate their collaborators and clients. The information collected from the proposed study will not only be valuable for someone entering the field but also for those who are already in the field and wish to normalize their experiences by referencing them to the experiences of their colleagues. According to the American Statistical Association (ASA) Section on Statistical Consulting, there are 950 individuals contained in the directory who work in this field. In our study, the provided survey was emailed to 850 statistical consultants who work within the United States and we received 187 responses for a 22% response rate. In our poster, we summarize the responses to the questions on the survey to provide insight into the use of statistical techniques, communication skills and important characteristics necessary for collaborative statistical consultants.
Presenting Author
Weiwei Xie, Washington State University
First Author
Weiwei Xie, Washington State University
CoAuthor
Harry Johnson, Washington State University
Addressing the limitations of current global optimization algorithms, especially when tackling non-convex functions and when gradient information is computationally intensive or absent, we introduce a novel approach. Our proposed Probabilistic Global Optimizer (ProGO) is based on a sequence of multidimensional integrations that converge to global optima under specific regularity conditions. This gradient-free method benefits from a robust convergence framework built on the properties of emerging optima distributions. We've also created a latent slice sampler with a geometric convergence rate for sampling from these distributions to approximate global optima effectively. ProGO is designed as a versatile framework that scales to approximate global optima for continuous functions across any dimension. Our empirical tests on well-known non-convex functions demonstrate ProGO's superior performance over many established algorithms in terms of regret value and convergence speed.
Presenting Author
Xinyu Zhang, North Carolina State University
First Author
Xinyu Zhang, North Carolina State University
CoAuthor
Sujit Ghosh, North Carolina State University
Ecological momentary assessment and other modern data collection technologies facilitate research on both within-person and between-person variability of people's health outcomes and behaviors. For such intensively measured longitudinal data, regular mixed-effects models were extended to mixed-effects location scale (MELS) models to accommodate random subject effects on both mean and variability of the outcome. However, standardized effect sizes for the MELS model are lacking. To address this gap, we extend an existing framework of R-squared measures for regular mixed-effects models, which are based on model-implied variances, to MELS models. Our proposed framework applies to two specifications of the random location effects: random intercepts with covariate-influenced variances and random intercepts combined with random slopes of observation-level covariates. We also provide an R package, R2MELS, that generates summary tables and visualization for values of our R-squared measures. We validated our framework through a simulation study. These R-squared measures can help researchers who are using MELS models interpret their findings more effectively.
Presenting Author
Xingruo Zhang, The University of Chicago
First Author
Xingruo Zhang, The University of Chicago
CoAuthor
Donald Hedeker, The University of Chicago
Because many statisticians work with specialists in other fields, communicating with people who do not know as much about statistics can be vital. This presentation will discuss the challenges in communicating with non-statisticians, and give tips for communicating the important information effectively. Some non-statisticians will know more about statistics than others, and the presenter will discuss how to adjust the explanation based on this level of understanding. The importance of understanding the field of study will also be emphasized, with tips for developing enough understanding of other topics to communicate more effectively with experts in other fields and collaborate with them more effectively.
Presenting Author
Geoffrey Shaw
First Author
Geoffrey Shaw
Practicing reproducible research is important, but increasingly complex as studies involve more data and code, and larger teams. Tools like Jupyter Notebook and R Markdown support reproducibility, but are not designed to collect information such as: Who worked on the analyses, and what decisions did they make? Where did the data come from? What are the code file dependencies and code libraries? We developed StatWrap, an open source and free software program, as an assistive, non-invasive discovery and inventory tool to document these changes in a research project. StatWrap combines automatically collected metadata (e.g., statistical packages, code file dependencies), investigator-supplied documentation (e.g., analysis notes, personnel), and source control. StatWrap creates interactive "workflow graphs" illustrating relationships between code, data, and libraries. It helps team members document workflow and analysis decisions. StatWrap creates a searchable project log of user actions in a project – for example, notes associated with data. "Wrapping" information together, StatTag promotes reproducibility by documenting data, code, collaborators, and their changes over time.
Presenting Author
Leah Welty, Northwestern University, Feinberg School of Medicine
First Author
Luke Rasmussen, Northwestern University
CoAuthor(s)
Eric Whitley, Northwestern University
Abigail Baldridge, Northwestern University
Leah Welty, Northwestern University, Feinberg School of Medicine