Speed Session 4

Conference: Women in Statistics and Data Science 2022
10/07/2022: 2:30 PM - 4:00 PM CDT
Speed 
Room: Grand Ballroom Salon G 

Chair

Katherine Allen, Eli Lilly

Presentations

01 - Set-based Methods to Identify Genetic Variants Associated with Falls and Fractures Susceptibility

Combining multiple outcomes in genetic association tests can increase statistical power while identifying key biomarkers that are associated with multiple traits. In this work, we develop a set-based inference method for jointly testing the association between multiple interval-censored outcomes and a group of genetic mutations, such as those in a gene or pathway. This variance components score test only requires fitting the null model once, so it is well-suited for genome-wide application. Our work shows that combining multiple interval-censored outcomes can detect causal variants with increased power over using a test that only considers single outcomes. We further validate the value of jointly testing multiple correlated interval-censored outcomes by testing for the genetic effects of both number of falls and bone fracture data from the UK Biobank. Fall susceptibility and risk of fractures, which have been shown to be heritable traits, are important to investigate because of their prevalence in healthcare and the high cost associated with these outcomes. This application of our method identified genes that have been shown through previous literature to be associated with muscle and bone related diseases, showing the potential to further identify genes that may be associated with these outcomes. 

Presenting Author

Jaihee Choi, Rice University

First Author

Jaihee Choi, Rice University

CoAuthor

Ryan Sun, University of Texas, MD Anderson Cancer Center

02 - A novel artificial neural network estimator for AR(1) time series parameters

The class of ARMA(p,q) models are archetypal statistical models for stationary time series. ARMA model parameters are usually estimated by the classical methods of maximum likelihood, maximum entropy (the Burg method), ordinary least squares, or moments. We focus on the simplest member of this class, the AR(1) model, and propose a machine learning estimator for its primary parameter based around the architecture of an artificial neural network (ANN). The architecture of this ANN estimator includes many weights (hyperparameters) that can be tuned to the given time series data set. Tuning (or training) the ANN requires a training data set with many time series samples labelled by the model parameter(s) that created them. In practice, though, only the original time series data are available. We overcome this problem by sampling from the parameter's posterior distribution to artificially generate training data. This novel Bayesian data generation scheme can produce training data sets of any size. The performance (bias and standard error) of the ANN estimator is compared with those of some of the classical approaches. 

Presenting Author

Angela Folz, University of Colorado Boulder

First Author

Angela Folz, University of Colorado Boulder

CoAuthor(s)

Michael Frey, National Institute of Standards & Technology
Mary Gregg, National Institute of Standards and Technology
Lucas Koepke, National Institute of Standards and Technology

03 - Comprehensive Analysis of Elastographic Liver Disease Biomarkers and Volatile Organic Compounds (VOCs) with Significant Undetectable Levels in NHANES

Nonalcoholic fatty liver disease (NAFLD) is a clinicopathologic diagnosis based on presence of fat with or without inflammation and fibrosis in the liver. NAFLD spans a spectrum of simple steatosis, steatohepatitis, fibrosis, and cirrhosis. Volatile organic compounds (VOC) are mostly manmade chemicals with a high vapor pressure and low water solubility, which are commonly seen in paints, dry cleaning agents, industrial solvents, and even pharmaceuticals, which have been known to contaminate ground water. We proposed to study the association between hepatic steatosis, liver fibrosis and high-risk nonalcoholic steatohepatitis (NASH) and VOC using National Health and Nutrition Examination Survey (NHANES) 2017-18. VOCs are detected in the blood, urine and other body fluids as well. However, the VOC data available in NHANES is based on detection levels in the serum only. Hepatic steatosis was measured using Controlled Attenuation Parameter (CAP) score of the Vibration Controlled Transient Elastography (VCTE) using FibroScan@ , liver fibrosis was measured using Liver Stiffness Measurement (LSM) and high risk NASH was determined using FAST score (calculated using CAP, LSM, and AST).
VOC data present in NHANES pose a few challenges from the perspective of statistical analysis since they have some inherent limitations which include non-normality of data, high number of VOC variables, and majority of values with low VOC detection rates in the serum. With the help of a case study, these analytical issues and remedial measures have been described below.
The case study had three main objectives, i.e., testing associations between a) VCTE measurements and demographic covariates, b) VOCs and demographic covariates and c) VCTE measurements and VOCs in presence of demographic covariates. For the first association test, LSM, CAP and FAST were used as dependent variables and for the second test, VOC was used as the dependent variable. For both sets of tests, the independent variables were age, gender, race, body mass index (BMI), diabetes, and alanine aminotransferase (ALT). For the third test, we investigated association between LSM, CAP and FAST with VOC using all VOCs and covariates as independent variables. Furthermore, the analysis was divided into two phases, i.e., a) traditional and b) non-traditional.
For the traditional analysis, normality tests were performed, and it was found that LSM and FAST had a skewed distribution whereas CAP was normally distributed. Therefore, univariable and multivariable analysis on log transformed LSM and FAST values were conducted. In this analysis, missing VOC values were imputed that resulted in a high number of constant values for VOCs. Thus, log transformation could not solve the issue of non-normality and hence, non-parametric methods were chosen to analyze associations between VOCs and the covariates. Additionally, based on specific cutoffs for LSM (cutoff=8.6 kPa for clinically significant fibrosis (CSF)), CAP (cutoff=286 dB/m for any steatosis) and FAST (cutoff=0.35 for NASH with fibrosis), bivariate analysis was performed. In these bivariate analyses, chi-square tests and t-tests were used to assess associations with categorical and continuous variables respectively.
In the non-traditional analysis, principal components were identified based on 40 VOCs present in the NHANES dataset. Principal component analysis was used because an increase in the dimensionality of a model makes it unreliable. Moreover, a Bayesian kernel regression was fitted to evaluate associations between elastographic liver disease biomarkers and VOCs.
Based on all the analyses described above, an RShiny application is being created that can be used by researchers conducting similar analyses.
Lastly, it would be beneficial to have a guidance from the NHANES regarding analysis of such variables with low frequency of detection as it would help in achieving consistent and generalizable results. 

Presenting Author

Rachana Lele

First Author

Rachana Lele

CoAuthor(s)

Matthew Cave, University of Louisville
Manjiri Kulkarni, University of Louisville
Shesh Rai, University of Louisville
Niharika Samala, Indiana University School of Medicine

04 - Extending Mediation Analysis to Within-subjects Data With Dichotomous Outcomes

Linear regression is used to model the relationship between predictor variables and a continuous outcome while logistic regression is used for a dichotomous outcome. Mediation analysis uses linear regression or logistic regression to explain how an independent variable indirectly affects a dependent variable via a mediator. We know how to use linear regression to conduct mediation analysis for both between and within-subjects data, but when using logistic regression, we only know how to conduct mediation analysis for between-subjects data. Breen et al. (2013) developed a logistic method for doing mediation analysis for between-subject data and showed that the standard deviation of the residuals of the continuous mediation model could be used to determine a scale parameter that transforms the linear regression coefficients to the logistic regression coefficients, and vice versa. Using the methods from Breen et al. (2013), we derived equations to conduct mediation analysis for within-subjects data that has a dichotomous outcome. We used three methods to validate our derived equations, involving both simulated and real data. For the simulation, we simulated within-subjects data with continuous outcomes using the equations from Montoya & Hayes (2017) to get the linear coefficients and calculate the scale parameter to transform the linear coefficients to logistic coefficients. The difference in outcomes was dichotomized and used with the derived equations to get the logistic coefficients. Comparing the logistic coefficients to the transformed logistic coefficients confirmed that the scale parameter can be used to transform the logistic regression coefficients based on the population parameters set in the simulation. Next, we used data from Montoya et al. (2013) and both parametric and nonparametric tests to validate our equations. In this study, participants viewed two class syllabi (within-subject factor): one syllabus about a course that encouraged independent work and one about a course that encouraged group work. After viewing each syllabus, participants (N = 51) rated their interest in each class on a continuous scale from 1 (Not at all) to 7 (Extremely). We dichotomized the difference in interest variable so that we have both continuous and dichotomous measurements of the outcome variable to do mediation analysis using linear and logistic regression. The mediator in this study was a measurement of the participants' communal goals on a continuous scale. For this study, we examined how participants' communal goals explain their interest in taking a course based on the syllabus. The continuous difference in interest was used to find the transformed logistic coefficients by dividing the linear coefficients by the estimated scale parameter. The dichotomized difference in interest was used to find the logistic coefficients using the derived equations. Because the equations from Breen et al. (2013) are used to convert between population parameters, but we only have the sample estimates, we compare estimates to confidence intervals. We confirmed that the logistic coefficients were similar to the transformed logistic coefficients and were within the confidence intervals for the transformed logistic coefficients. We computed bootstrapped 95% confidence intervals (N = 10,000) to further check if the logistic estimate is in the transformed confidence interval. The extension of mediation analysis to logistic regression for within-subjects data will enable research using within-subjects design, and remove the restriction to measure variables solely on a continuous scale. For example, when measuring one's interest in a course, a question that asks for a "yes" or "no" answer like, "Are you interested in taking this course?" might be more valid than a question like, "Rate your interest in taking this course on a scale from 1 to 7." Future research will extend these methods to two dichotomous outcomes and dichotomous mediators. 

Presenting Author

Nickie Yang

First Author

Nickie Yang

CoAuthor(s)

Jessica Fossum, University of California, Los Angeles
Amanda Montoya, UCLA

05 - Handling Limit of Detection Values for Sepsis Biomarkers in Neonates

Background:
C-reactive protein (CRP) and procalcitonin (PCT) are commonly used sepsis biomarkers, though few studies have shown comparisons of these biomarkers in very-low birthweight (VLBW) infants. It is essential to compare these to determine what may be a better biomarker for sepsis diagnosis. Our study includes infant data from two neonatal intensive care units in Tennessee. One study site did not report measures of CRP and PCT below a limit of detection (LOD), or clinical cut-off. Concentrations below the LOD, deemed left-censored observations, can have large effects on the distribution of the data. We examined the data for neonates at their first sample date, day 0, and discovered roughly 24% of the day 0 CRP values were at or below the LOD of 5.0 mg/L , and 3% of the day 0 PCT values were at or below the LOD of 0.10 ng/mL.

Methods:
Multiple approaches were considered for handling left-censored values. Simple substitution methods such as replacing values at or below the LOD with LOD/2 and LOD/√2 were examined. Regression on order statistics (ROS), maximum likelihood estimation (MLE), and Kaplan-Meier estimation (KM), are computational methods investigated to estimate the mean and standard deviation (SD) of the CRP and PCT values. The ROS method was used for further analyses in determining the effect of LOD on summary statistics, correlation, and regression. ROS functionality was utilized from the NADA package in R and was conducted for the site that had censored values, further splitting by study year to adjust for any potential measurement differences. Censored values were imputed with ROS estimates and combined with the detected values to get a set of observations with no censoring. Linear regression model performance was measured with Akaike's information criteria (AIC) and deviance. AIC evaluates how well the model fits the data by estimating the information lost by the complexity of the model. Deviance is another goodness of fit estimate that determines how much variation in the data the model accounts for. Lower AIC and deviance indicates a better fit. All analyses were conducted with R, version 4.1.1.

Results:
The effect of imputation using the ROS method was compared with using the LOD and LOD/√2 to replace left-censored values. Before performing ROS on the censored data, the overall mean (SD) for pairwise complete observations of CRP and PCT values were 16.7 (27.9) and 8.5 (16.4), respectively. After performing ROS, the overall mean (SD) for CRP and PCT were 15.4 (26.9) and 8.5 (16.4), respectively. As expected, shrinkage of the mean was observed. Linear regression was used to model the association between CRP and PCT. Values of CRP and PCT were log transformed and a restricted cubic spline with 3 knots was used on CRP to model the non-linear relationship. An interaction term was included between CRP and Site. The ROS model had a smaller AIC and deviance than the LOD/√2 model; 784.2 and 430.0 compared to 788.3 and 438.1, respectively. Furthermore, diagnostic plots of ROS imputation regression model compared to LOD and LOD/√2 regression models indicated the ROS model more closely followed the Normality assumption and showed there are no longer patterns of residuals due to the LOD. The attenuation of the mean and SD of CRP and PCT and improved fit of the linear regression model with ROS compared to LOD/√2 indicates that imputation by ROS for left-censored observations provides information we are missing due to the LOD.

Conclusion :
Left censoring is common in biomarker data, as values below a threshold may be deemed negligible. Nonetheless, to understand how these biomarkers can be used to predict sepsis diagnosis, we need to estimate these values and their distribution. Furthermore, ignoring censoring leads to biased estimates. ROS has provided insight to gauge the true distributions of CRP and PCT values. Similar results will be presented for KM, MLE, and multiple imputation methods. 

Presenting Author

Tess Stopczynski, Vanderbilt University Medical Center

First Author

Tess Stopczynski, Vanderbilt University Medical Center

CoAuthor(s)

Gregory Ayers, Vanderbilt University School of Medicine
Jörn-Hendrik Weitkamp, Vanderbilt University Medical Center

06 - Withdrawn - Common factors in large panels of option prices

We propose a new factor model for multivariate tensor-valued data suitable to describe the join dynamics of a cross-section of option prices of over 200 equities of the S&P 500 Index. The factors explain the common variation of all the options in the cross-section, and we model their dynamics in a standard time series context. In contrast, the factor loadings express the heterogeneous response to a common shock and are two-dimensional arrays (i.e., tensors). We propose an inference framework to test the significance of the loadings. Furthermore, we implement a tensor-counterpart version of the multivariate principal component model to deal with the high-parametrization of the factor loadings, which enables us to extract trading signals, which we use to design a dynamic trading strategy. Our results show that this strategy yields significantly higher profits than a mean-variance investment strategy, even when controlling transaction costs. 

Presenting Author

Maria Grith, Erasmus University Rotterdam

First Author

Maria Grith, Erasmus University Rotterdam

07 - The Spatial Clustering of Cardiovascular Mortality and Associated Risk Factors.

Background: Cardiovascular diseases are the leading cause of death both globally and in the US. Yet, the prevalence of heart diseases varied among regions and racial groups, prompting questions about which populations and regions are at higher risk. Therefore, this study explored the racial and regional distribution of heart diseases and associated risk factors, using new prototypes.
Methods: We created new prototype using U.S. County data to show that Rurality and Plurality (Major racial group) matters. For location (rurality) data, the U.S. Department of Agriculture Rural–Urban Commuting Areas (RUCCA) was used. RUCCA data was regrouped into (1) metropolitan area; (2) non-metropolitan, adjacent to a metropolitan region; (3) non-metropolitan, non-adjacent to a metropolitan region. For Plurality US census data was used. For heart diseases data and other health risk factors, the 2021 county health ranking data was used. Univariate statistics were created to show the disparities in the prevalence of heart diseases and associated risk factors across rurality and plurality.
Results: Health disparities exist along racial and ethnic lines, as well as locations. Non-metropolitan, adjacent to a metropolitan region and non-metropolitan, non-adjacent to metropolitan regions disproportionately show higher heart disease prevalence. Also, counties with majority American Indians showed highest prevalence of smoking, obesity, uninsured population, and other disadvantage socioeconomic factors as compared to other racial groups.
Conclusion: This work provides new prototypes showing that location and plurality matters and continues to highlight variations in heart disease and associated risk factors among racial groups in the U.S. Health care leaders and policy makers should be proactive to develop prevention strategies and response plans to manage and control worse health outcomes in vulnerable rural populations 

Presenting Author

Ruaa Al Juboori, Saint Louis University

First Author

Ruaa Al Juboori, Saint Louis University

CoAuthor(s)

Dipti Subramaniam, Saint Louis University
Divya Subramaniam, Saint Louis University School of Medicine
Ness Sandoval, Saint Louis University

08 - The Unequal Burden: Mapping New Prototypes related to Mortality and Vaccination of COVID-19

Background: Understanding the current burden of coronavirus disease 2019 (COVID-19) mortality rates variation within regions and communities will help inform and address public health challenges. Therefore, this research aimed to explore new prototypes to explain the association between location and COVID -19 health outcomes.
Methods: This study used publicly available data including, county-level COVID-19 case and death counts, and vaccination data (through January 1st, 2022), county health rank data 2021, the United States Department of Agriculture Rural–Urban Commuting Areas (RUCCA) data, and 2020 presidential voting tallies were included. Analyses were done to examine whether spatial regression models will predict if RUCC, racial diversity, health, behavioral, social, economic, political opinions, and public health policies are associated with COVID-19 Mortality rates (across 4 waves of the pandemic and the overall period). Weight matrix was the "Queen's contingency" method. All analyses were performed using R and p-value <0.05.
Results: Total of 3107 counties will be included in this study. COVID-19 county level health disparities exist along racial and ethnic lines, as well as locations. Although the general trend of mortality was going down, counties with majority of minority populations showed higher mortality when compared to counties with a majority white population across RUCCA. Also, non-metropolitan, non-adjacent to a metropolitan region with majority of American Indians showed the highest mortality in 2020 when compared to other counties. Counties with majority Black population have achieved smaller shares of vaccinations compared to their shares of mortality. Yet, counties with majority American Indian achieved higher vaccination rates as compared to other counties. Political standings have also been shown to have a significant role in COVID-19 mortality.
Conclusion: Geographic continues to highlight variations in health outcomes among many racial groups in the U.S. as minorities in rural areas had higher mortality. Health care leaders and policymakers should be proactive in developing targeted prevention strategies and response plans to manage poor health outcomes in vulnerable rural populations. 

Presenting Author

Ruaa Al Juboori, Saint Louis University

First Author

Ruaa Al Juboori, Saint Louis University

CoAuthor(s)

Ness Sandoval, Saint Louis University
Divya Subramaniam, Saint Louis University School of Medicine

09 - Tie-breakers for Sign Test

The goal of this project is to investigate a tie-breaking methods in sign test, a one sample non-parametric test. Non-parametric tests are widely used in statistical analysis due to their lack of assumptions about the distribution from which the data is coming. One of the non-parametric tests to test the one sample median is sign test. In practice, there could be observations in the sample which is same as the hypothesized median which are called "tied" observations. To deal with the tied observations there are several methods proposed which break the ties in order to apply the sign test. In this project, we started by studying some popular methods for tie breaking. We investigated and compared how these methods work for different sample sizes and different population distributions (known and unknown) by comparing coverage probabilities, and power of the tests. The project is further extended to Wilcoxon Signed Rank test which is another non-parametric one sample test. The optimal method for tie-breaking under the sign test and Wilcoxon signed-rank test would be determined for each case. 

Presenting Author

Rachael Goodwin, Western Washington University

First Author

Rachael Goodwin, Western Washington University

CoAuthor(s)

Alex Hutchinson, Western Washington University
Ramadha Piyadi Gamage, Western Washington University

10 - Batch effect in microbiome data

Microbiome study have been gaining enormous popularity among scientist to characterize human health and disease. While many statistical analysis tools work well in most high-dimensional data similarly, such as gene-expression data, there is a need to pay attention to the compositionality in microbiome data meaning relative abundances based on taxon counts. With such data, reproducibility is difficult to achieve, we aim to examine the batch effect, i.e., systematic bias from datasets collected at different sites or times. In microbiome experiments, combining several data sets is often considered for the sake of statistical power, hoping to discover reliable biomarker and establish more robust prognostic models. The unique challenge in microbiome data, however, is the sum-to-one constraint, that is, the relative abundance is vulnerable to a different set of microbiotas from a different experiment. For example, certain transformation in Euclidean space is not robust to the sub-compositionality. Therefore, simply adding samples from a different subset of features is rather at the risk of misleading than gaining a power. In this talk, we aim to provide the helpful advice for the use of the statistical methods under the multi-batch situations including sub-compositionality, false-discovery rate and dependency among features. 

Presenting Author

Jung Ae Lee, University of Massachusetts Chan Medical School

First Author

Jung Ae Lee, University of Massachusetts Chan Medical School

11 - A robust outlier detection approach for scrubbing artifacts in fMRI

Functional magnetic resonance imaging (fMRI) data can be artificially contaminated due to both participant and hardware-related reasons. In fMRI-based studies, it is, therefore, ¬-necessary to identify artifactual volumes. These are often excluded from analysis, a procedure known as "scrubbing" or "censoring". Such volumes contain abnormal signal intensities and can be thought of as multivariate outliers in statistical terminology. There exist many outlier-detection approaches for multivariate data and for fMRI data specifically. However, these methods either are non-robust or do not use a statistically principled approach to thresholding. Robust distance (RD) approaches that are adopted from Mahalanobis distance are promising but depend on assumptions of Gaussianity and independence, which we observe to be clearly violated in the fMRI context. When these assumptions are violated, the distribution of these RDs is unknown, preventing us from obtaining a quantile-based threshold for outliers. In this work, we develop a robust nonparametric bootstrap procedure to estimate an upper quantile of the distribution of RDs, which serves as the threshold for outliers. We compare the performance of our RD-based approach with existing "scrubbing" approaches for fMRI data employing 5 resting-state fMRI sessions with high levels of artifacts from the Human Connectome Project. 

Presenting Author

Fatma Parlak

First Author

Fatma Parlak

12 - Adversarial contamination of network data

As graph data becomes more ubiquitous, the need for robust inferential graph algo- rithms to operate in these complex data domains is crucial. In many cases of interest, inference is further complicated by the presence of adversarial data contamination. The effect of the adversary is frequently to change the data distribution in ways that negatively affect statistical inference and algorithmic performance. We study this phe- nomenon in the context of vertex nomination, a semi-supervised information retrieval task for network data. Here, a common suite of methods relies on spectral graph em- beddings, which have been shown to provide both good algorithmic performance and flexible settings in which regularization techniques can be implemented to help miti- gate the effect of an adversary. Many current regularization methods rely on direct network trimming to effectively excise the adversarial contamination, although this direct trimming often gives rise to complicated dependency structures in the result- ing graph. We propose a new trimming method that operates in model space, which is more amenable to theoretical analysis and demonstrates superior performance in a number of simulations. We then extend this method to a more general setting, where the network is contaminated through both block structure contamination and white noise contamination (contamination whose distribution is unknown). 

Speaker

Sheyda Peyman

13 - Fast resampling methods for massive Generalized Linear Models

Residual bootstrap is a widely used method in the context of Linear regression for assessing the quality of relevant estimators. Moulton & Zeger (1990) extended the idea of Residual bootstrap to the class of Generalized Linear Model (GLM), a wider class of models, which includes the linear regression model along with other commonly used models like logistic, poisson and probit regression. However, with massive datasets becoming more and more common, the ordinary residual bootstrap techniques are turning out to be computationally demanding and hence less feasible. Some computationally efficient alternatives to bootstrap exist in the literature, such as 'm out of n bootstrap' by Bickel et al (2012), 'Bag of Little bootstraps' by Kleiner et al (2014) and 'Subsampled Double Bootstrap' by Sengupta et al (2016). However, residual bootstrap is not yet known to have direct extensions to these methods.In our work, we introduce a Subsampled Residual Bootstrap (SRB) strategy applicable to GLMs, which is much more computationally efficient compared to Residual Bootstrap, and hence more feasible in cases with a stringent time budget. We establish the consistency of SRB estimators under mild assumptions. Finally, we demonstrate the computational advantages of our method through numerical simulations. 

Presenting Author

Indrila Ganguly, North Carolina State University

First Author

Indrila Ganguly, North Carolina State University

CoAuthor

Srijan Sengupta, North Carolina State University