CS3e: Speed Session 3

Conference: Women in Statistics and Data Science 2024
10/17/2024: 11:45 AM - 1:15 PM EDT
Speed 
Room: Cypress 

Presentations

01. A Classification Technique for Survival Data using Bayesian AFT Model with Frailty Effect

We provide a proposed Bayesian classifier to classify cancer biomarkers. Before classification, biomarkers or efficient markers from high-dimensional survival data need to be identified. Currently, it is an emerging area in oncology. A three-stepped feature selection method is also introduced to select the most efficient markers from microarray data. AFT model with the frailty effect is used in the classification and analysis of the data in the Bayesian framework. The cutoff value for each selected gene expression has been obtained through classification using the minimum deviance criterion in the AFT model with the frailty effect. The frailty effect is considered for dealing with unobserved heterogeneity present in the expression value of the subject for investigating the risk effects on the cancer dataset. A simulation study is also done to verify the methodology's validation. The brier score is obtained to know the effectiveness of the proposed classification procedure. The proposed classification method demonstrates its efficacy in gauging the risk impact on diverse patients by utilizing biomarkers, enabling swift estimation and prompt action for disease treatment. This approach finds practical application through the analysis of two high-dimensional, real-world lung cancer datasets, offering valuable insights for effective healthcare interventions. 

Presenting Author

Pragya Kumari

First Author

Pragya Kumari

02. A hierarchical Bayesian model for the identification and removal of technical length isomiRs in miRNA sequencing data

MicroRNAs (miRNAs) are small, single-stranded non-coding RNA molecules with important gene regulatory function. MiRNA biogenesis is a multi-step process, and certain steps of the pathway, such as cleavage by Drosha and Dicer, can result in miRNA isoforms that differ from the canonical miRNA sequence in nucleotide sequence and/or length. These miRNA isomers, called isomiRs, which may differ from the canonical sequence by as few as one or two nucleotides, can have different mRNA targets and stability than the corresponding canonical miRNA. As the body of research demonstrating the role of isomiRs in disease grows, the need for differential expression analysis of miRNA data at scale finer than miRNA-level grows too. Unfortunately, errors during the amplification and sequencing processes can result in technical miRNA isomiRs identical to biological isomiRs, making resolving variation at this scale challenging. We present a novel algorithm for the identification and correction of technical miRNA length variants in miRNA sequencing data. The algorithm assumes that the transformed degradation rate of canonical miRNA sequences in a sample follows a hierarchical normal Bayesian model. The algorithm then draws from the posterior predictive distribution and constructs 95% posterior predictive intervals to determine if the observed counts of degraded sequences are consistent with our error model. We present the theory underlying the model and assess the performance of the model using an experimental benchmark data set. 

Presenting Author

Hannah Swan, University of Rochester School of Medicine and Dentistry

First Author

Hannah Swan, University of Rochester School of Medicine and Dentistry

03. Adjusting for covariate misclassification to quantify the relationship between diabetes and local access to healthy food

Without access to healthy food, it may be difficult to maintain a healthy lifestyle free from preventable illness. This access can be quantified for residents of a given area by measuring their distance to the nearest grocery store, but there is a trade off. We can either consider (i) the more accurate but cost-prohibitive distance measurement that only uses passable roads or (ii) the error-prone but easy-to-obtain straight-line distance calculation that ignores the location of infrastructure and potential natural barriers. Trying to fit a standard regression model to the relationship between disease prevalence and the error-prone, straight-line food access measures would introduce bias to the parameter estimates. Fully observing the more accurate, route-based food access measure is often impossible, and thus, if it can only be partially observed, a missing data problem arises. We address this bias and the missing data by deriving a new maximum likelihood estimator for Poisson regression with a binary, error-prone explanatory variable (representing access to healthy food based on distance to the nearest grocery store), where the errors may depend on additional error-free covariates. With simulation studies, we show the consequences of ignoring the error and how the proposed estimator corrects for that bias while preserving more statistical efficiency than the complete case analysis (i.e., deleting any neighborhoods with missing data). Finally, we apply our estimator to data from the Piedmont Triad region of North Carolina, where we model the relationship between diabetes prevalence and access to healthy food at various distance thresholds. 

Presenting Author

Ashley Mullan, Vanderbilt University

First Author

Ashley Mullan, Vanderbilt University

CoAuthor

Sarah Lotspeich, Wake Forest University

04. An ensemble ordinal outcome classifier for high-dimensional data

Several classification techniques for ordinal outcomes in high-dimensional data have been developed throughout the years. However, the performances of these techniques depend heavily on the evaluation criteria used, and it is usually not known a priori which technique will perform the best in any classification application. In this project, we propose an ensemble classifier, constructed by combining bagging and rank aggregation techniques that can provide an optimal classification of the ordinal outcomes in high-dimensional data. Our classifier internally uses several existing ordinal classification algorithms and combines them in a flexible way to adaptively produce results. Our approach optimizes the classification outcomes across multiple performance measures, such as Hamming score, Mean Absolute Error, Kendall's τ_b, and weighted Kappa, among others. Through various simulation studies, we will compare the performance of our proposed ensemble classifier with the individual algorithms, included in the ensemble, and illustrate that our more intricate approach achieves enhanced predictive performance. We will also show the utility of our ensemble classifier with applications on real high-dimensional genomics data. We will highlight the fact that when dealing with the complexity of ordinal outcomes in high-dimensional datasets, it might be reasonable to consider an ensemble classification algorithm combining several classifiers rather than relying on a single classifier. 

Presenting Author

Heranga Rathnasekara

First Author

Heranga Rathnasekara

CoAuthor

Sinjini Sikdar, Old Dominion University

05. An MCMC-based method for dynamic causal modeling of effective connectivity in functional MRI (fMRI)

Effective connectivity analysis in functional magnetic resonance imaging (fMRI) studies the directional interactions among brain regions and experimental stimuli. A widely used method to estimate effective connectivity is dynamic causal modeling (DCM), which uses a state space model representation; this consists of a latent neural signal model and an observation model which transforms this signal into the observed blood-oxygen–level-dependent (BOLD) signal in fMRI data. A standard DCM model involves a complex neural-hemodynamic model system with a variational Bayes method for parameter estimation. While physically sound, this approach can lead to various practical challenges such as inexact solutions and underestimated uncertainty in parameter estimates. In our work, we introduce a Markov chain Monte Carlo (MCMC)-based DCM method that adopts a simpler observation model and the No U-Turn Sampler for posterior distribution sampling of network parameters. Preliminary results indicate that this approach maintains robustness against misspecification, allows accurate uncertainty quantification of inferred parameters, and consistent estimation of parameters related to the experimental inputs for both simulated and real data. 

Presenting Author

Kaitlyn Fales, Pennsylvania State University

First Author

Kaitlyn Fales, Pennsylvania State University

CoAuthor(s)

Hyebin Song, Penn State
Nicole Lazar, Pennsylvania State University

06. Anomaly Detection using a scaled Bregman Divergence

Anomaly detection is to identify the specific moments when a system exhibits significantly different behavior from its usual patterns. Density ratio estimations, such as the Kullback-Leibler (KL) importance estimation procedure (KLIEP), unconstrained least-squares importance fitting (uLSIF), and relative uLSIF (RuLSIF), have been widely used for the anomaly detection, because estimating the ratio of two probability densities is easier to accomplish than estimating each density separately. However, these methods have many limitations, including the unboundedness and unstable issues. In this work, we propose a novel approach based on a scaled Bregman divergence using a mixture measure, together with the Kernel regression method, for anomaly detection in multivariate time series data. Finally, we apply the proposed method to detect anomalies from simulation data and real-world data. 

Presenting Author

Yunge Wang, Saint Louis University

First Author

Yunge Wang, Saint Louis University

CoAuthor

Haijun Gong, Saint Louis University

07. Bayesian Simulation-Guided Designs for Adaptive Clinical Trials: Potential Synergies of Open-Source Code and Statistical Software Tools.

Multi-arm multi-stage (MAMS) trials represent an efficient approach in clinical trial design. This design type allows for testing of multiple treatment arms simultaneously within one protocol, assigning patients to the most promising arms in an adaptive manner, all while controlling for type-1 error. Key to this approach is the choice multiplicity comparison procedures (MCPs), and choice of treatment selection rules. In this case study, we focused on assessing multiple treatment selection rules, including posterior probabilities and Bayesian approaches, using custom R coding integrated in commercial statistical software to optimize a MAMS study design. Leveraging the computing capabilities of commercial software, alongside the flexibility of R coding allowed us to assess a variety of treatment selection rules efficiently and comprehensively. Software-native selection algorithms furthered our optimization aims by offering optimized design candidates for comparison. Our simulation-based approach enhanced probability of success by comparing, side-by-side, different novel treatment selection rules, and choosing the best fit rule for the study at hand. We believe that combining custom code alongside statistical software offers a comprehensive approach for complex study designs. 

Presenting Author

haripria Ramesh Babu

First Author

haripria Ramesh Babu

CoAuthor

haripria Ramesh Babu

08. Characterizing racial and economic disparities and predictors of Gestational Diabetes Mellitus in AAPI Populations: Secondary Analysis of PRAMS data, 2016-2022

GDM affects between 2 and 10% of pregnancies in the United States with trends of increasing prevalence with a significant amount of variability due to racial/ethnic factors, maternal age, insurance at the individual level, and state-level factors. Asian American Pacific Islander (AAPI) have been documented to have a higher prevalence and risk of developing GDM compared to non-Hispanic White populations. We aim to explore racial and economic disparities in GDM and conduct within group analysis focusing on AAPI populations to identify risk factors and predictors of GDM within this vulnerable racial group. This is a secondary analysis of the Pregnancy Risk Assessment Monitoring System (PRAMS) 2016-2022 dataset. PRAMS consists of state-specific and national data on current and emerging issues in reproductive and maternal child health. Subset analyses were based on aggregated race groups: AAPI ethnic subgroups and non-AAPI populations. Bivariate analysis was performed to explore the relationship between potential risk factors for GDM among the subsets and multivariable logistic regression was used to investigate potential predictors of GDM.
In both the overall dataset and AAPI subset, the odds of GDM diagnosis consistently increased with maternal age and pre-pregnancy BMI. However, while significant risk factors of GDM in the overall population were a combination of demographic, BMI, psychosocial, and structural/socioeconomic factors, only demographic factors (ethnicity, maternal age, pre-pregnancy BMI) in the AAPI population were significant predictors of GDM diagnosis. This study sought to make a significant impact on policy and clinical practices on obstetric care concerning racial/ethnic minorities, low-income women, and at-risk AAPI individuals. The findings presented in this study contribute to new insights on potential predictors of GDM diagnosis and may inform targeted or earlier GDM screening for at-risk individuals. 

Presenting Author

Mallory Go

First Author

Mallory Go

09. Developing, Validating, and Analyzing an Assessment that Includes Interactions Among Learning Objectives Related to Confidence Intervals

Hypothesis testing is typically presented as a rote multi-step procedure. This training has cultivated the current mindset held by many disciplines in that only statistically significant results are meaningful. With a recent desire to shift away from the dichotomous formal decision framework, interval estimation is of increasing importance. Confidence intervals are also susceptible to being treated in a dichotomous way if the manner in which they are presented focuses solely on whether the null value falls in the interval. As such, it is insufficient for just the presentation of the statistical findings to shift from a p-value to a confidence interval; the mindset must shift as well from that of desiring statistical significance to that of desiring statistical transparency of results. If students will be encouraged in the foreseeable future to communicate their findings using interval estimates, then it is imperative that statistics instructors have the means necessary to assess students' conceptual understanding of confidence intervals.

This study serves as a prototype for how to develop, validate, and analyze an instrument that includes interaction effects among learning outcomes. In this study, we take an innovative approach to assessment development by employing a fractional factorial design to highlight the interactions among key learning objectives related to confidence intervals. We use qualitative think-aloud interviews to validate the instrument and to identify students' epistemic understandings about confidence intervals. We collect data from participating large-enrollment introductory statistics students at Penn State to measure students' statistical literacy surrounding confidence intervals upon the conclusion of an introductory statistics course using simulation-based inference methods. This research could inform the findings of previous studies on statistical literacy that include confidence intervals as one of many topics being assessed. 

Presenting Author

Susan Lloyd, Penn State University

First Author

Susan Lloyd, Penn State University

CoAuthor

Matthew Beckman, Penn State University

10. Does International Migration Impact the School Enrollment and Child Labor Outcomes of Left-Behind Children? Evidence from Rural Bangladesh

This paper examines how the international migration of household members affects the school enrollment of and weekly hours worked by children living in rural Bangladesh. International migrants play an important role in developing countries, with remittances comprising 6% of the GDP of Bangladesh. Yet there is a lack of comprehensive research on how this influences child outcomes. I analyze nationally representative surveys and address potential endogeneity by using historic migration rates as an instrument for a household's migrant status. I find that boys aged 6-17 from migrant households are less likely to be enrolled in school than boys in non-migrant households, though there is no impact on girls. Boys aged 6-17 from migrant households also work more hours per week on average, with boys aged 15-17 years working 21.25 more hours than male peers in non-migrant households. Girls in migrant households work slightly fewer hours per week than girls in non-migrant households. These results suggest that rural boys living in migrant households are less likely to complete schooling, which may limit their long-run human capital formation and earning potential. 

Presenting Author

Alaka Halder

First Author

Alaka Halder

11. Functional Data Analysis and Stochastic Differential Equations: Recent advances and Applications in Diffusion-Driven Models

The talk presents my recent research at the nexus of Functional Data Analysis (FDA) and Stochastic Differential Equations (SDEs). After a brief introduction to FDA and SDE, I introduce novel contributions that enhance the FDA toolbox for better adaptability to dynamic and stochastic systems. Key applications emerge in finance and economics. The talk briefly covers the application of the methods I developed to estimation and change point detection, demonstrating their practical impact in important scenarios.
The talk is based on the recent manuscripts: https://arxiv.org/abs/2305.04112 and https://arxiv.org/abs/2404.11813. 

Presenting Author

Neda Mohammadi, NC A&T State University

First Author

Neda Mohammadi, NC A&T State University

12. Impact of Text Preprocessing Techniques on Fake News Detection

The journey from raw text to actionable insights in the realm of natural language processing involves several critical preprocessing stages. These stages prepare the textual data for further analysis by implementing strategies such as eliminating infrequently occurring words, removing stopwords, removing numerical entities, and standardizing text to lowercase. Following these initial steps, the processed text undergoes word embedding, utilizing advanced algorithms like Word2Vec and BERT. This study delves into how various text preprocessing and word embedding techniques influence the effectiveness of fake news detection systems. Specifically, it examines the roles that the choice of classification, embedding, and preprocessing techniques play in optimizing key metrics such as accuracy, precision, sensitivity, and specificity in the context of fake news identification. Our findings highlight that the strategic inclusion of stopwords, particularly in conjunction with BERT embeddings, enhances the performance of fake news detection models, alongside the careful selection of threshold criteria for word frequency. 

Presenting Author

Jessica Hauschild, United States Air Force Academy

First Author

Jessica Hauschild, United States Air Force Academy

CoAuthor

Kent Eskridge, University of Nebraska, Statistics Department

13. Machine Learning Model Robustness and Performance Stability in Future Years when Predicting Adverse Events in a Veteran Population and a Diabetic Subpopulation

We developed machine learning models to predict adverse events after Veterans received non-steroidal anti-inflammatory drugs (NSAIDs) during acute care encounters, and we evaluated model robustness in subsequent years as well as within a subpopulation of patients with diabetes mellitus. We collected electronic health record data from a national U.S. Veteran population ≥18 years who presented to an emergency department or urgent care center, were prescribed NSAIDs from 1/1/2017-12/31/2023, and survived longer than 1-day post-encounter. The outcome of interest was any adverse event within 30 days of the visit (acute kidney injury stage 2-3, gastroesophageal reflux disease, gastrointestinal bleed, or allergic reaction). Using 85 clinical patient variables for care delivered in 2017, we built a logistic regression model using LASSO regularization and an extreme gradient boosting (xgboost) model. We tested the 2017 model on data from each subsequent year starting with 2018 encounter data and ending in 2023. We assessed model performance using calibrated slope and area under the receiver operating characteristic curve (AUC). We were also interested in model performance when applied to a subgroup of patients with a history of diabetes. The incidence rates of any adverse event were 4.9% for the entire cohort and 6.3% in the diabetic subgroup. For the 2017 models evaluated on 2023 encounters, LASSO had a calibrated slope 1.020 compared to xgboost 1.040, and AUC was similar for xgboost 0.790 and LASSO 0.789. For the same model and test data in patients with diabetes, xgboost had a calibrated slope 1.020 vs LASSO 0.975, while AUC was similar for LASSO 0.783 and xgboost 0.782. Model performance for years 2018-2022 was similar. The model also performed moderately well over time in the diabetic subgroup, but performance should be reassessed in a non-Veteran population before making wide generalizations about the model predictions. 

Presenting Author

Amy Perkins, Vanderbilt University Medical Center

First Author

Amy Perkins, Vanderbilt University Medical Center

CoAuthor(s)

Michael J Ward, Vanderbilt University Medical Center
Jesse Wrenn, Vanderbilt University Medical Center
Robert Winter, Vanderbilt University Medical Center
Chad Dorn, Vanderbilt University Medical Center
Amber Hackstadt, Vanderbilt University Medical Center
Michael E Matheny, Vanderbilt University Medical Center

14. Maximum likelihood estimation and EM-algorithm in a Covid-19 Markov jump stochastic epidemic model

As of April 2024, the following statistics are obtained for the COVID-19 epidemic: over 14 billion vaccine doses have been distributed; 775 million individuals have been infected; and over 7 million deaths have been recorded. This presentation introduces a new theoretical discrete-time Markov chain model for COVID-19 epidemic dynamics, including asymptomatic and symptomatic disease transition modes, exposure, vaccination, hospitalization, recovery, and death. Epidemiological parameters such as the basic reproduction number are derived. Statistical inference is conducted in the model by applying the EM-algorithm to account for both missing and hidden states in the observed data. Numerical simulation results are given. 

Presenting Author

Ivy Collins, University of Georgia

First Author

Ivy Collins, University of Georgia

CoAuthor

Divine Wanduku

15. Video Chat, Social Presence, and Idealization in Long-Distance Romantic Relationships

With the rapid pace of changes in communication technologies, it is pertinent to stay updated on how such technologies are utilized for relational purposes. Despite video chat being over a decade old and the primary mode of communication for long-distance romantic partners (Kirk, 2013; Pinsker, 2019), not much research has provided further insight beyond how long-distance romantic partners use infrequent phone calls and emailing to maintain their relationships (Stafford & Merolla, 2007). This study attempted to close the gap on the lack of these insights and look into how the use of rich media such as video chat might reduce the higher levels of idealization that have been found among long-distance romantic partners (Stafford & Reske, 1990). Idealization has been understood by past research to be fueled by certain behaviors – limited face-to-face contact, conflict avoidance, and selective self-presentation. Despite noting these behaviors as potential reasons for having higher idealization in a long-distance relationship, previous studies have neglected to measure idealistic behaviors as they relate to perceived idealization. Though it was predicted that those that engaged in more idealistic behaviors would perceive more idealization, it was found instead that participants that engaged in a lower amount of such behaviors tended to score higher on their perceived level of idealization regarding their long-distance romantic partner. Additionally, women were significantly more likely than men to engage in all three idealization behaviors. Though social presence and the three idealization behaviors were tested as mediators, no significant mediation effects were found. 

Presenting Author

Rebecca Johnson, Wake Forest University

First Author

Rebecca Johnson, Wake Forest University

16. A Spatial Analysis of the Indian Farmers' Protests

In 2020-2021, 250 million people protested three agricultural laws' that threatened to suppress Indian farmers' autonomy, leading to the world's largest protest to date. These three laws promote globalization of the agriculture sector and decrease reliance on the minimum support price offered by each state. Farmers, traders, and affiliated groups faced the absence of state protection against global corporations having free entry into the market and potential control of the produce supply. We conduct a spatial analysis of the farmers' protests from 2020 to 2021 in India, focusing on the types of actors that were involved in the protest to observe their spatial variability across the country. We connect socioeconomic features, using the PRIO-GRID dataset, to the spatial intensity of protests across India. Our study relates the spatial socioeconomic features to the intensity of protests across different types of actors (such as religious or political organizations and different types of unions). We use this modeling framework to begin to understand the ability and motivations behind people protesting across a socioeconomically diverse nation. These findings provide insight into the spatial range and patterns of different groups within the farmers' protests. This research is valuable to further understanding the people and groups involved in the Farmers' Protests and the complex nature of continuous agricultural advancement in India. 

Presenting Author

Claire Kelling

First Author

Claire Kelling

CoAuthor

Manasvi Khanna, Wellesley College