CS1e: Speed Session 2

Conference: Women in Statistics and Data Science 2024
10/16/2024: 4:00 PM - 5:00 PM EDT
Speed 
Room: Cypress 

Presentations

01. A New D – K Class Estimator for the Poisson Regression Model: Simulation and Application

The Poisson regression model (PRM) aims to model a counting variable which is frequently estimated by using maximum likelihood estimation (MLE) method. Since the performance of MLE is not reliable when there exist a multicollinearity. Therefore, we proposed a new estimator called Dawoud – Kibria (DK) class estimator for the PRM as a solution to the problems caused by multicollinearity. For assessing the superiority of proposed estimator, we present a theoretical comparison with MLE, traditional ridge and Liu estimator that is based on matrix mean squared error (MMSE) and scalar mean squared error (MSE) criterions. A Monte Carlo simulation study is conducted under different controlled conditions in order to show the efficacy of the proposed estimator. An empirical application is also considered to see the clear image of the proposed DK estimator for the PRM. From the findings of simulation study and example it is observed that the DK class estimator is the most effective and consistent estimation method as compared to the MLE and other competitive estimators when there exist a multicollinearity issue. 

Presenting Author

Karamelahi Chohan

First Author

Karamelahi Chohan

02. Advancing Biostatistical Approaches in Clinical Research: Lessons from Human Metabolism Assessment

Indirect calorimetry assesses metabolism using respiratory oxygen and carbon dioxide gas exchange. Whole room indirect calorimeters detect small, dynamic changes in metabolic parameters, such as carbohydrate oxidation. The area under the curve (AUC) is commonly used to evaluate these acute temporal changes by summarizing the time series into a single parameter, but limits power and specificity in describing the temporal changes. Here, we evaluate the potential for Bayesian hierarchical modeling to describe individual trends in time series metabolic data and compare the resulting estimates of aggregated data to that of the commonly used AUC. Carbohydrate utilization was assessed over time after participants consumed drinks containing one of three sugars: sucrose, dextrose, and fructose. Carbohydrate oxidation profiles over time were summarized and compared across groups, and individuals, using cubic splines and an appropriate Bayesian hierarchical model. The resulting models demonstrated that the carbohydrate oxidation profile for dextrose differed statistically from sucrose and fructose, with the posterior probability for all spline coefficients for sucrose and fructose being less than 5%. Additionally, the Bayesian model estimates of the AUCs for dextrose, fructose, and sucrose were 4.43, 22.4, and 30.5, respectively. Comparatively, the estimated AUC using the trapezoidal rule and raw data were 0.6, 19.9, and 24.6, respectively. Our developed model was able to characterize the trajectory of carbohydrate oxidation following the consumption of one of three sugars more precisely than AUC alone. This innovation has relevant applications in the areas of human metabolism and physiology to better describe a human's response to other stimuli. Future work includes exploring different parametric forms outside of the cubic spline in measuring metabolism. 

Presenting Author

Monica Ahrens

First Author

Monica Ahrens

CoAuthor(s)

Mary Elizabeth Baugh, Fralin Biomedical Research Institute, Virginia Tech, Roanoke, VA
Zach Hutelin, Fralin Biomedical Research Institute, Virginia Tech, Roanoke, VA
Alexandra Hanlon, Virginia Tech
Alex DiFeliceantonio, Virginia Tech FBRI

03. Authentic Data Explorations: Engaging Students with Statistical Investigations in Meaningful Contexts

Inquiry-based activities allow students to explore questions they find interesting and applicable, motivating deeper engagement in a given task. Authentic data situates inquiry-based statistical explorations in meaningful contexts and advances the development of students' data acumen. We have designed two inquiry-based activities that offer students rich experiences exploring questions about authentic data through R Shiny applets. Each activity aims to strengthen students' data exploration skills while furthering their statistical content knowledge. The two lessons are centered on understanding concepts related to the normal distribution and confidence intervals for proportions. The contexts include distributions of rent across the US and access to the supplemental nutrition assistance program based off regional demographics. The two lessons meet introductory statistics learning objectives and foster statistical literacy as students become aware of social phenomena that are modeled by statistics and are situated in a context of societal and personal importance. Students will gain experience taking the lead in statistically investigating a question that interests them and will practice communicating their results to others. Creating and implementing this activity and the R Shiny Applet will be discussed as well as the importance of situating statistical tasks in real-world contexts with authentic data. 

Presenting Author

Maria Cruciani, Michigan State University

First Author

Maria Cruciani, Michigan State University

CoAuthor(s)

Justin Post, North Carolina State University
Jennifer Green, Michigan State University
Sunghwan Byun, North Carolina State University

04. Challenges of Adolescence: A Cross-Group and Multidimensional Exploration on the Associations between High School Student Depression and Risk Behaviors

The current understanding of high school students' risk behaviors associated with depression is limited in scope. The current research collects a large dataset and comprehensively examines nine risk aspects (50 individual components), including driving unsafety, weapons, physical fight, suicide, smoking, drinking alcohol, substance abuse, unhealthy eating, and physical inactivity. We explicitly contrast how depression in high school students differentially affects these risk aspects and their individual components. We map out how these effects differ in student segments by genders, grades, and ethnicities, contributing to a richer and clearer pattern of the impact of depression. Beyond that, we take the first attempt in this research stream to use a higher-order PCA approach and demonstrate the nuance of the relationships between depression and risk factors across the sub-segments of students. 

Presenting Author

Jessica Sun

First Author

Jessica Sun

CoAuthor

Daniel Zhang, University of Zurich

05. Clustering for Data with Categorical Outcomes Using a Generalized Linear Mixed Effects Model with Simultaneous Variable Selection

I propose a model-based clustering method for high-dimensional, longitudinal data with categorical outcomes via regularization. The development of this method was motivated in part by a study on 177 Thai mother-child dyads to identify risk factors for early childhood caries (ECC). Another considerable motivation was a dental visit study of 308 pregnant women to ascertain determinants of successful dental appointment attendance. There is no available method capable of clustering longitudinal categorical outcomes while also selecting relevant variables. Within each cluster, a generalized linear mixed-effects model is fit with a convex penalty function imposed on the fixed effect parameters. Through the expectation-maximization algorithm, model coefficients are estimated using the Laplace approximation within the coordinate descent algorithm, and the estimated values are then used to cluster subjects via k-means clustering for longitudinal data. The Bayesian information criterion can be used to determine the optimal number of clusters and the tuning parameters through a grid search. My simulation studies demonstrate that this method has satisfactory performance and is able to accommodate high-dimensional, multi-level effects as well as identify longitudinal patterns in categorical outcomes. 

Presenting Author

Samantha Manning, University of Rochester

First Author

Samantha Manning, University of Rochester

06. Gender, racial, and ethnic identity representation in National Institutes of Health study sections: Impact of the COVID-19 pandemic, 2019-2021

Disparities in biomedical research hinder the professional advancement of underrepresented researchers. Before the COVID-19 pandemic, women were underrepresented in National Institutes of Health (NIH) study sections, panels of scientists that influence allocation of grant funding. Little is known about representation of racial/ethnic or sexual/gender minorities. This retrospective study seeks to elucidate the effect of the COVID-19 pandemic on NIH study section representation. Data were extracted from public study section lists during the May to July 2019, 2020 and 2021 NIH review cycles. For all reviewers (N=16,980), gender was identified using pronouns or photos from reputable websites. To confirm gender identity and collect data on race/ethnicity and sexual orientation, an electronic REDCap survey was distributed to all reviewers in fall 2022.
We assessed the demographic composition of NIH study sections from 2019-2021 under different methods to address missing data. Complete case analysis was conducted with the 42% of reviewers who completed the survey (n=7189). To address missing race/ethnicity data, we propose random imputation with proportional constraints. Expected proportions for racial/ethnic groups were obtained from the NIH Data Book, which reports demographics for research grant principal investigators. Sensitivity analysis was conducted with 2020 and 2021 US Census data, which included reference proportions for race/ethnicity and sexual orientation. Lastly, for each missing data approach, we used generalized estimating equations to model the effects of gender, institute, race, ethnicity and review cycle on study section membership. This analysis leverages novel data and existing reference information to assess representation in NIH study sections. Understanding representation of the scientists who influence NIH grant decisions is an important first step to ensure biomedical workforce diversity and innovative science that addresses the needs of the US. 

Presenting Author

Alexandra Knitter, University of Chicago

First Author

Alexandra Knitter, University of Chicago

CoAuthor(s)

Monica Kowalczyk, University of Chicago
Lucy Alejandro, University of Chicago
Anna Volerman, University of Chicago

07. Generalized pairwise comparisons using pseudo-observations for time-to-event censored data in a randomized controlled trial setting

As an extension of the Mann-Whitney approach in the randomized controlled trial (RCT) setting, generalized pairwise comparisons (GPC) methods are based on assigning scores to pairs of subjects where all pairs of treatment and control subjects are evaluated: the outcome of every individual in the treatment group is compared with the outcome of every individual in the control group. The GPC test statistic can, therefore, be expressed as a treatment effect by such measures as the net benefit, win odds, win ratio (WR), or probability index for the therapeutic intervention. Taking the WR as an example for this study, it has an attractive interpretation as the inverse of the hazard ratio under proportional hazards. However, its estimate could be biased in the presence of substantial censoring and cautious interpretation is needed. Considerable censoring increases the numbers of indeterminate treatment and control pairs, where the win or loss is undetermined due to the censored observation(s) and a definitive score cannot be assigned. Such indeterminant pairs are typically treated as "ties" and scored as 0. We propose a novel approach leveraging pseudo-observation values to address this issue of ties resulting from censoring for a single time-to-event outcome. We demonstrate and compare the performance measures of our method with existing GPC methods in Monte Carlo simulations under various equal drop-out, unequal drop-out, and administrative censoring scenarios. Moreover, we illustrate this new approach using two reconstructed datasets from an oncology and cardiomyopathy RCT. 

Presenting Author

Stephanie Pan, Boston University

First Author

Stephanie Pan, Boston University

CoAuthor(s)

Janice Weinberg, Boston Univ School of Public Health
Prasad Patil, Boston University
Sara Lodi, Boston University School of Public Health
Michael LaValley, Boston University

08. Going for Gold: Using Record Linkage and Bayesian Hierarchical Modeling to Select Winning Gymnasts at the 2024 Paris Olympics

Athletes who compete in the Olympic Games participate in many other competitions before this international event, and scores in these competitions are possible tools to use to predict Olympic performance. However, at each competition the name of a gymnast is not always perfectly recorded, as nicknames and other variants of names are often used. In this project, I use record linkage to identify athletes in data from 2022 and 2023 international competitions. I propose an adaptation to the Jaro-Winkler Similarity score based on the specific discrepancies in names in this data set. I then use the linked data to predict winning gymnasts in Women's Artistic Gymnastics using Bayesian Hierarchical Modeling. 

Presenting Author

Zongyue Teng, Vanderbilt University

First Author

Zongyue Teng, Vanderbilt University

CoAuthor

Nicole Dalzell, Wake Forest University

09. Hotelling's T-Squared Statistic Application

This study examined the Hottelling's T-squared Statistic on the performance of statistics students on four different courses which include Statistical theory, Applied general statistics, Operational research and Statistical inference for two academic sessions (2021 and 2022). The major objective of this research was to compare the average score of the students on the courses listed above and to ascertain if significant difference exists between the average performances of students on these courses in Statistics department. The data collected was analyzed using two statistical package Excel and statistical package for social science (SPSS). The result from the analysis indicated that there was a significant difference in the average performance of the students between the two academic sessions. 

Presenting Author

Eunice Olushola Idowu, Yaba College of Technology

First Author

Eunice Olushola Idowu, Yaba College of Technology

CoAuthor(s)

Yemisi olamide Ajiboye, Yaba College of Technology
Saidi Oyedele Amusa, Yaba College of Technology

10. Joint Compartmental-Survival Bayesian Model and Mediation Analysis with Application to a Canine Cohort Study

Compartmental models can be used to study the epidemiological characteristics of an infectious disease. For example, the susceptible, infectious, susceptible (SIS) model is appropriate for infections that yield no immunity upon recovery. In this study, we expand the SIS framework to account for multiple co-infections among samples from populations at different locations. We propose a fully probabilistic Bayesian compartmental model with random location effects and a survival component to estimate the effects of infection on long-term survival. In particular, a fully Bayesian approach to estimating direct and indirect of effects of treatment on survival. We evaluate these methods through simulation studies and apply them to a longitudinal study in dogs exposed to multiple tick-borne infections and a parasitic infection. We also discuss the benefits of a joint model that incorporates a compartmental model framework to inform survival and compare it to a more traditional survival approach. 

Presenting Author

Marie Ozanne, Mount Holyoke College

First Author

Marie Ozanne, Mount Holyoke College

CoAuthor(s)

Grant Brown
Felix Pabon-Rodriguez, Indiana University School of Medicine

11. Large Language Model for Detecting Unreported Cases of Foodborne Illnesses

Foodborne outbreaks pose a serious yet preventable threat to public health, often leading to loss of worker productivity, fatalities, and significant economic impacts. Traditional detection methods typically face delays from the onset of initial infections to the public notification of an outbreak. Recently, the use of Twitter data to identify unreported foodborne illnesses has been explored with advanced models like BERTweet, showing promise yet still exhibiting limitations in accuracy and cost efficiency. This study explores the potential of utilizing large language models to enhance the accuracy and efficiency of early detection of foodborne outbreaks. We developed and assessed the GPT-4's Zero-Shot model and the GPT-4 Few-Shot model to detect cases of unreported foodborne illnesses. The BERTweet model attained an accuracy score of 0.88 and an F1-score of 0.85. The GPT-4 Zero-Shot model achieved an accuracy score of 0.89 and an F1-Score of 0.86. The GPT-4 Few-Shot model achieved an accuracy score of 0.92 and an F1-Score of 0.90. Our results indicate that the GPT-4 Zero-Shot model performs comparably to the BERTweet model with marginal improvements. More notably, our GPT-4 few-shot model demonstrates superior performance over the BERTweet model. Additionally, it does not require extensive human labeling, saving time and money. The application of large language models like GPT-4 provides a more accurate and resource-efficient method for the early detection of foodborne outbreaks, underscoring the significant potential of these models for real-time, precise, and cost-efficient public health surveillance. 

Presenting Author

Sophia Yuan, Parkview High School

First Author

Sophia Yuan, Parkview High School

CoAuthor(s)

Kevin Bui, Parkview High School
Alexis Solorzano, Parkview High School

12. Performance Testing and Comparative Benchmarking for Creating a Self-Sustaining Ecosystem for data.table

The data.table package in R is a powerful tool for data analysis, combining efficient C code with user-friendly R syntax. To ensure its long-term sustainability, the NSF POSE program has funded a project from 2023 to 2025 to build a self-sustaining ecosystem around data.table.

In this presentation, we will discuss the importance of performance testing in the development of data.table and present a general approach that can be applied to other R packages. By creating performance tests based on historical regressions, we can measure the package's efficiency over time and memory usage, ensuring that code and version releases do not impact its performance. We will demonstrate the use of the atime package to benchmark execution time and memory usage, providing developers with confidence in maintaining efficient performance and reliability. This approach not only benefits data.table but also serves as a model for other R package developers to enhance the performance and popularity of their own projects. 

Presenting Author

Doris Amoakohene

First Author

Doris Amoakohene

13. Quantifying the impact of measurement error on health disparities models

Healthy eating is an important part of living a healthy life, and unequal access to healthy foods can perpetuate health disparities. When not everyone has the same level of access to healthy foods, it can result in disproportionately high rates of disease in communities with fewer healthy food options. Previous studies exploring this topic have used county- or census tract-level data on both disease and food access, which captures a broad range of diverse communities. However, counties and census tracts cannot provide specific details about the individuals and communities within them. In this project, we investigate the relationship between the distance from patients' homes to the nearest healthy foods store (proximity) and the prevalence of diabetes. How proximity to healthy foods is measured poses an additional challenge, as distance measures are either computationally simple and inaccurate (straight-line distances), or computationally complex and accurate (map-based distances). To approach these questions, we extract patient information (including diabetes diagnoses) from the electronic health record (EHR), geocode patients' home addresses, and calculate both straight-line and map-based proximity to healthy foods. Using rate ratios, relative indices of inequality, and concentration curves, we quantify whether patients with farther proximities to healthy foods (indicating worse access) face a higher burden of prevalent diabetes. Finally, we discuss the impact of using inaccurate access measures to quantify health disparities. 

Presenting Author

Cassandra Hung, Wake Forest University

First Author

Cassandra Hung, Wake Forest University

CoAuthor

Sarah Lotspeich, Wake Forest University

14. Small Area Estimation: Uses in Agricultural Experiments

In agricultural, field-based experiments it is common to take multiple subsamples of the response variable within a given experimental unit. This method is commonly deployed due to variability in observations within experimental units. These subsamples are often averaged together prior to analysis so that data analysis can be performed on the experimental unit level. Given averaging reduces the associated variance, we explore the impact of this practice on the probability of making a type I error in varied simulated settings. Small area estimation is a category of techniques that can be used to provide more accurate estimates within a small area or domain. Model based small area estimates combine direct sample data with auxiliary data. These estimates generally have lower mean squared error than the direct sample data alone, and avoid the need for researchers to average subsample observations within plots. Small area estimation research has been widely used within survey applications but little work has been done in the context of designed experiments. This poster will explore potential benefits of using small area estimation models in designed experiments where multiple subsamples are taken. Ultimately, guidance will help practitioners in designing experiments to leverage the benefits of small area models and techniques will be demonstrated in a real data analysis based on simulated settings. 

Presenting Author

Victoria Stanton, University of Kentucky

First Author

Victoria Stanton, University of Kentucky

CoAuthor

Katherine Thompson, University of Kentucky

15. Statistical Analysis of the Awareness, Perception and Practice of Exclusive Breastfeeding among Mothers in Ikorodu Local Government Area of Lagos Nigeria.

This study examined mothers' knowledge, awareness perception and practice of exclusive breastfeeding to measure the concepts and predict the probability of breastfeeding babies exclusively by the mothers. Descriptive cross-sectional research was adopted and systematic random sampling was used to select 380 mothers across 3 local council development areas in Ikorodu with a well-structured self-administered questionnaire used to retrieve the data from them. Data collected were analysed using descriptive statistics, Chi-square test and Logistic regression analysis, with the aid of Statistical Package for Social Sciences (SPSS) version 25. The results show that the majority of mothers fall within the age group of 30-34 years while most mothers are married (76.8%) with good educational background. Mothers' knowledge of EBF was very good (89.9%), and most mothers are aware of EBF (84.5%). Most mothers didn't give infants water or pre-lacteal feed (75.9%), 71.6% introduced breast milk to infants within 1 hour of delivery, and the prevalence level of EBF is 76.8%. A binary logistic regression analysis revealed that rooming-in practice (OR 2.683; 95% CI 1.266, 5.687), not offering pre-lacteal feeds before breast milk (OR 2.246; 95% CI 1.16, 4.35), and not introducing water before breast milk (OR 2.156; 95% CI 1.153, 4.03) has more likelihood of EBF practice while housewife status (OR 0.314; 95% CI 0.106, 0.929) has less likelihood of EBF practice. The study therefore concluded that the knowledge, awareness, perceptions, practice and prevalence of exclusive breastfeeding among mothers is highly driven by rooming in practice, no pre-lacteal feeding and housewife status of the mothers. 

Presenting Author

Yemisi olamide Ajiboye, Yaba College of Technology

First Author

Yemisi olamide Ajiboye, Yaba College of Technology

CoAuthor(s)

Eunice Idowu, Yaba College of Technology
Emmanuel Ikegwu, Yaba College of Technology

16. Understanding demographic difference in interpretation of information across data visualization types

Data visualizations provide viewers important insights about data and topics of interest. When designing a data graphic, it is important to convey the information in a concise and effective manner; a well-designed graphic provides a clear message and helps the viewer understand the most important information in the chart. However, different types of charts can convey different messages depending on context and purpose. To study how information is interpreted by viewers, we used NORC's AmeriSpeak panel, engaging a nationally representative sample of US adults to answer questions about the information presented in charts. In the study we tested participants' ability to interpret charts by asking them to estimate the value of specific elements in the chart and to assess whether true/false statements were supported by the data. We incorporated various types of questions to represent multiple use cases for the visualization. The survey captured responses to each question and data on several different respondent demographics. We studied the connection between correctness of participant responses and demographic variables. The types of data visualizations shown to participants were varied between rounds of the study, so we could measure the effect of chart types while still evaluating responses on the same questions. We found significant differences in the correctness of answers across levels of educational attainment as well as differences across chart types shown to participants. In this presentation we discuss findings on the performance of various chart types in supporting effective interpretation of the information conveyed. We also review implications for designing data visualizations for a general audience. 

Presenting Author

Sydney Bell, NORC at the University of Chicago

First Author

Sydney Bell, NORC at the University of Chicago

CoAuthor(s)

Taylor Wing, NORC
Kiegan Rice, NORC at The University of Chicago
Heike Hofmann, Iowa State University

17. Wavelet Analysis and the financial performance of ESG Strategies

In this paper we apply Wavelet Analysis to investigate if and for how long information with regards to ESG influences financial performance of investment strategies of asset managers. This includes retrieving economic and firm specific data from Refinitiv, identifying best risk factor and ARIMA models for expected returns and forecasting, choosing adequate estimation techniques, and to evaluate their specific results. Various different estimation techniques are applied (pooled OLS regression, fixed effect, random effect panel data analysis) to identify the appropriate estimation technique that is in line with capital market theories. Furthermore, this approach includes the question if risk factors are evaluated by the market and therefore associated with risk premiums or if ESG-related factors can be considered as strategies that generate opportunities for alpha. Respective robust estimation techniques are applied to the collected data to discriminate between the effects of risk factors. Literature suggests that an improved financial performance of firms implementing ESG strategies only manifests itself in the long term. With Data Science methods (Wavelet Analysis) this insight is investigated within an event study. A common problem with this approach however is that expected returns have to be modelled using factor models like CAPM or APT. We use wavelet analysis in two ways. First, wavelet analysis is applied as a way to improve on estimating expected returns necessary to identify risk factors. A second application of wavelet analysis is concerned with the time period ESG-related information might be useful to generate outperformance. We therefore filter the return data and analyze the performance on a scale-by-scale basis. This approach allows to discriminate between various time periods. This topic is idealy taught within a COIL structure. Interests in forming this type of coorperation would be highly appreciated. 

Presenting Author

Michaela Kiermeier, University of Applied Sciences Darmstadt

First Author

Michaela Kiermeier, University of Applied Sciences Darmstadt