Sunday, Aug 4: 8:30 PM - 9:25 PM
6004
Invited Posters
Oregon Convention Center
Room: CC-Hall CD
Presentations
New alternative methods for rapid toxicity screening of chemicals require new statistical methodologies which appropriately synthesize the large amount of data collected. Transcriptomic assays can be used to assess the impact of a chemical on thousands of genes, but current approaches to analyzing the data treat each gene separately and don't allow sharing of information among genes within pathways. Furthermore, the methods employed are fully parametric and do not account for changes in distribution shape that may occur at high exposure levels. To address the limitations of these methods, we propose Constrained Logistic Density Regression (COLDER) to model expression data from different genes simultaneously. Under COLDER, the dose-response function for each gene is assigned a prior via a discrete logistic stick-breaking process whose weights depend on gene-level characteristics and atoms consist of different dose-response functions subject to a shape constraint that ensures biological plausibility. The posterior distribution for the benchmark dose among genes within the same pathways can be estimated directly from the model, which is another advantage over current methods.
In causal inference, data may be collected from multiple sources such as experimental and observational studies. Experimental studies often suffer from the lack of external validity due to the limitation of the studies. Observational studies are usually broad enough to be representative of the target populations, but they often lack internal validity caused by the inevitable uncontrolled confounders. Recently, there has emerged a lot of discussions on integrating experimental and observational studies to make more efficient causal inference. In this poster, we introduce a semiparametric approach based on the density ratio model (DRM) to utilize the complementary features between the two studies. DRM is known for its ability to efficiently account for the latent structures between multiple interconnected populations. If the related studies share some common measurements for the same causal effect, the collected datasets are naturally expected to be from similar and connected populations. Therefore, it is advantageous to jointly analyze these datasets together. We also study several estimators of the causal effect not only from the mean but also from distribution perspectives.
Precision medicine endeavors to conform therapeutic interventions to the individuals being treated and needs to account for the heterogeneity of treatment benefit among patients and patient subpopulations. In oncology, basket trials have emerged as a popular design to better address the goals of precision medicine that endeavors to test the effectiveness of a therapeutic strategy among patients defined by the presence of a particular biomarker target rather than cancer, where the evaluation of treatment effectiveness are conducted with respect to the "baskets" which represent a partition of the targeted patient population. These trials have unique statistical and design considerations. In this poster, we highlight how these considerations impact trial operating characteristics, can be leveraged to account for uncertainty in the design stage, and how to further improve efficiency through incorporation of interim monitoring strategies.
Valid intermediate endpoints may serve as surrogate markers for a clinical outcome to conduct a randomized trial most efficiently. This work aims to develop causal inference methods that can determine whether or not repeated measures of biomarkers throughout a type one diabetes trial could be used in place of the current primary endpoints, which are also collected longitudinally. The proposed methods use observed and counterfactual outcomes to capture the trajectories of individuals and evaluate such endpoints. This framework uses potential outcomes and the principal stratification framework via mixed models. Ultimately, this allows us to assess the validity of the endpoint by calculating a causal effect predictiveness curve from the distribution of random effects for both the surrogate and clinical endpoints.
Second-generation p-values (SGPVs) were proposed to address the imperfections of classical p-values. They maintain the favorable properties of classical p-values while emphasizing scientific relevance to expand their utility, functionality, and applicability. They report evidence in favor of the alternative, in favor of the null hypothesis, or neither (inconclusive); they automatically incorporate an adjustment for multiple comparisons; they have lower false discovery rates; and they are easier to interpret. The most crucial component of a SGPV analysis is choosing the indifference zone. In practice, this is not easily done, as statisticians and collaborators do not always agree. We explore how choosing different indifference zones affects the SGPVs' statistical properties, and we propose allowing the indifference zone to 'shrink' in cases of collaborator uncertainty. We demonstrate that when an uncertain but wide small sample indifference zone is identified, shrinking it balances the errors between behaviors of a fixed zone and a point null. This trade-off leads to improved communication between statisticians and collaborators when planning an SGPV analysis.
Healthy foods are essential for a healthy life, but not everyone has the same access to healthy foods, leading to disproportionate rates of diseases in low-access communities. Current methods to quantify food access rely on distance measures that are either computationally simple (the shortest straight-line route) or accurate (the shortest map-based route), but not both. We combine these food access measures through a multiple imputation for measurement error framework, leveraging information from less accurate straight-line distances to compute informative placeholders (i.e., impute) more accurate food access for any neighborhoods without map-based distances. Thus, computationally expensive map-based distances are only needed for a subset of neighborhoods. Using simulations and data for Forsyth County, North Carolina, we quantify and compare the associations between the prevalence of various health outcomes and neighborhood-level food access. Through imputation, predicting the full landscape of food access for all neighborhoods in an area is also possible without requiring map-based measurements for all neighborhoods.
Drug safety data present many challenges regarding data curation, analysis, interpretation, and reporting. Visual analytics presents an alternative to the traditional tabular outputs for exploring, assessing, and reporting safety data and presents an opportunity to enhance and facilitate the evaluation of drug safety. Graphical depictions of safety data can help facilitate better communication of drug safety findings by blending data visualization, statistical, and data mining techniques to create visualization modalities that help users make sense of safety data with. Developing readily available tools for visual analytics of drug safety data that take into account considerations revolving around structured assessment driven by safety questions of interest is desirable along with considerations for user interface parlor. This poster will highlight an R package CVARS for generating interactive forest and volcano plots for adverse events and FDA Medical Queries (FMQs) analysis outputs for inclusion in submissions to the FDA. This work is based on the ongoing collaboration among the ASA, PHUSE and the FDA.
Instacart is a small but complicated company, with a four-sided marketplace (customers, shoppers, retailers and advertisers), making for interesting data science problems. We run many experiments to test improvements; we want to run them faster. We reduce variance using covariate adjustment to correct for differences in covariates such as pre-period metrics between control and treatment groups. However, there are difficulties in practice. We can't run a regression on the full dataset; we combine regression on a subset with updates on the full data, with minimal loss of variance reduction. Ratio metrics (e.g. sum w_i y_i/sum w_i) present additional difficulties - weighted regression is inconsistent, while performing separate linear regressions for numerator and denominator is consistent but gives poor variance reductions. A multiplicative+additive model is efficient and consistent, but has no closed-form solution and is ill-conditioned, requiring a creative implementation.
Motivated by the important need for computationally tractable statistical methods in high dimensional spatial settings, we develop a distributed and integrated framework for estimation and inference of Gaussian model parameters with ultra-high-dimensional likelihoods. We propose a paradigm shift from whole to local data perspectives that is rooted in distributed model building and integrated estimation and inference. The framework's backbone is a computationally and statistically efficient integration procedure that simultaneously incorporates dependence within and between spatial resolutions in a recursively partitioned spatial domain. Statistical and computational properties of our distributed approach are investigated theoretically and in simulations. The proposed approach is used to extract new insights on autism spectrum disorder from the Autism Brain Imaging Data Exchange.
Nonignorable missing data occurs when missing values are associated with an outcome of interest. For example, in electronic health record data, a laboratory variable may be missing because a patient was too sick for it to be measured. A simple method for handling nonignorable missing data is to include indicator variables for whether a value is missing, known as the missing indicator method. To date, there is little guidance about using the missing indicator method for longitudinal data with nonignorable missing values. We conduct a simulation study to investigate whether the missing indicator method is beneficial for imputing and modeling longitudinal data with nonignorable missingness. Using simulated data that mimic electronic health record data for repeated measures of falls in older adults, we found that including missing indicators in imputation or modeling did not substantially impact the accuracy of imputations; however, use of missing indicators resulted in slightly higher area under the receiver operating curve (0.921) compared to models without missing indicators (0.886) when averaged across the simulation runs.
The LISA 2020 Global Network is a structured community of statistics and data science collaboration laboratories ("stat labs") and individuals in developing countries. The 35 stat labs in the network work together to train the next generation of collaborative statisticians and data scientists; collaborate at the intersections of data-driven development with researchers, data producers, and decision-makers to make a positive impact on society; and teach short courses and workshops to improve statistical skills and data literacy widely. In this invited poster we present lessons learned from Brazil, Africa, South Asia, and Indonesia about creating and sustaining stat labs to transform evidence into action for the benefit of society.
Speaker
Eric Vance, LISA, University of Colorado-Boulder
Climate change detection and attribution have played a central role in establishing the influence of human activities on climate. Optimal fingerprinting has been widely used in detection and attribution analyses of climate change. The reliability of the method depends critically on proper point and interval estimations of the regression coefficients. The confidence intervals constructed from the prevailing method have been reported to be too narrow to match their nominal confidence levels. We propose a novel framework to estimate the regression coefficients based on an efficient, bias-corrected estimating equations approach. The confidence intervals are constructed with a pseudo residual bootstrap variance estimator that takes advantage of the available control runs. Our regression coefficient estimator is unbiased, with a smaller variance than the TLS estimator. Our estimation of the sampling variability of the estimator has a low bias compared to that from TLS. The resulting confidence intervals for the regression coefficients have coverage rates close to the nominal level, which ensures valid inferences in detection and attribution analyses.
Speaker
Yan Jun, University of Conn
Part of the mission of the University of Minnesota's Biostat Community Outreach and Engagement (BCOE) committee is to provide outreach to the Twin Cities' K-12 public schools. In this work, BCOE has adopted a community-based participatory framework in which we build relationships with local K-12 public school teachers, collaborate to understand the teachers' needs, and develop biostatistics resources that can be integrated into the existing educational structure. We share two examples of this process in practice: the development of an air quality app for high school students and the integration of biostatistics consultants into a high school biomedical research course. In the former, BCOE partnered with high school Earth science teachers to develop an R Shiny app to empower students to easily engage with Minnesota's publicly available air pollution data. In the latter, BCOE members served as biostatistical consultants for student projects in a high school biomedical research course. Drawing from BCOE's varied collaborations, we also share suggested practices for successful community-based participatory K-12 biostatistics education.
A causal decomposition analysis allows researchers to determine whether the difference in a health outcome between two groups can be attributed to a difference in each group's distribution of modifiable mediator variables. With this knowledge, researchers and policymakers can focus on designing interventions that target these mediator variables. Existing methods either focus on one mediator variable or assume that each is conditionally independent given the group label and the mediator-outcome confounders. In this work, we propose a flexible method that can accommodate multiple correlated and interacting mediator variables, which are frequently seen in studies of health behaviors and environmental pollutants. Further, we state the causal assumptions needed to identify both joint and path-specific decomposition effects through each mediator variable. To illustrate the reduction in bias and confidence interval width of the decomposition effects, we perform a simulation study and apply our approach to examine whether differences in smoking status and dietary inflammation score explain any of the Black-White differences in incident diabetes.
Few analytic results exist in the current literature for the power of interaction tests in the context of clinical trial design. The power for the interaction between the treatment and a binary subgroup indicator is computed for trials with continuous, time-to-event, or binary outcomes. Quantitative interactions are assumed, when the treatment effect is heterogeneous but positive in both subgroups. Conditional on parameters of the overall design, for normal outcomes the power for a subgroup interaction can be expressed as a function of the effect and sample size in the more positive subgroup. For quantitative interactions, when the more positive subgroup has a larger effect size, power for the interaction is low. When the more positive subgroup has a smaller sample size but a large treatment effect, power can be equal to or greater than the power for the overall effect, though this situation is probably not common. Better appreciation of the power for interaction tests in clinical trials may lead to a clearer understanding of related design issues and better trial design.
Marijuana is now legal for recreational or medical use in 41 states. Due to long-standing federal restrictions on cannabis-related research, the implications of cannabis legalization on traffic and occupational safety are understudied. There is a need for objective and validated measures of acute cannabis impairment that may be applied in public safety and occupational settings, such as post-crash or accident investigations. Identifying a reliable, objective biomarker of recent cannabis use has proven challenging, but pupillary response to light may offer an avenue for detection that outperforms typical sobriety tests. We developed a video processing and functional data analysis pipeline examining the pupillary response to a light stimulus test administered with goggles utilizing infrared videography. We then developed functional data models to make inference on pupil size in response to light after cannabis use. Our results suggest that functional regression models of pupil light response trajectories are more sensitive to differences across marijuana use groups than scalar feature extraction approaches.
Even if a model is not globally sparse, it is possible for decisions made from that model to be accurately and faithfully described by a small number of features. We introduce the Sparse Explanation Value (SEV), a new way of measuring sparsity in machine learning models. SEV is a measure of decision sparsity rather than overall model sparsity. We introduce algorithms that reduce SEV without sacrificing accuracy, providing sparse and completely faithful explanations, even without globally sparse models.
Supporting collaborative biostatistics units in universities and academic medical centers involves many challenges, both logistically and financially. A commonality among the various funding models is the difficulty of affording researchers time to pursue activities not tied to or funded by specific research projects, including professional development, mentorship, and administrative tasks. This presentation illustrates the results of one such "flexible funding" model employed by the Biostatistics Consulting and Interdisciplinary Research Collaboration Lab at the University of Kentucky that supports biostatisticians in completing these tasks. We conducted a qualitative study involving six staff collaborative biostatisticians in the Biostat CIRCL to determine the various activities, changes in workflow, impact on work-life balance, and effect on team operations and interpersonal dynamics allowed by this flexible funding model. This presentation showcases the benefits and challenges of such a funding model on the daily operations of collaborative biostatisticians, and provides recommendations to best leverage this model in a university or academic medical center context.
Speaker
Anthony Mangino, University of Kentucky, Department of Biostatistics
The interdisciplinary nature of scientific research at academic institutions and medical centers requires collaborative biostatisticians to possess expertise in team-based and soft skills, and statistical methodology. They play a crucial role in medical research through their contributions to study design, power estimation, rigorous and reproducible data analysis, quality assurance, and communicating results to both scientific communities and the public. Retaining them ensures consistency, efficiency, and effective long-term collaboration in the scientific community. While staff collaborative biostatisticians are often used to meet the growing demand for skilled collaborative biostatisticians, their effective retention and integration as team scientists into the academic landscape is challenging. Opportunities for growth in the form of a career ladder are often a serious impediment to team science in academia. This poster will describe challenges and strategies associated with successfully training, retaining, promoting, and integrating staff collaborative biostatisticians into a team-based academic research environment.
Collaborative academic statisticians are often involved in many different aspects of research and training, and are often interested in a variety of areas. This poster will show an evolving cycle of ideas, starting with how an idea for developing videos to help train and mentor the next generation of applied statisticians came to fruition and led to the development of a short course and several new projects and collaborations. That work, along with new statistical consulting and collaboration literature led to a restructuring of the Statistical Practice class at NC State and new ideas being incorporated into the joint data science consulting and collaboration program run by the NC State Libraries and Data Science Academy. The energy around that cooperative program has led to a workshop, more additions to the body of statistical consulting literature, and a grant-funded book writing project. While it is often clear how methodological developments lead to further research, that path in the collaboration and consulting space may be much less clear. This poster will illuminate one path for building a research program in this space.