Print Close

Poster Session 2 and Continental Breakfast

Conference: Conference on Statistical Practice (CSP) 2023

02/03/2023: 7:30 AM - 8:45 AM PST
Posters

Room: Cyril Magnin Foyer

Presentations

001: Automation of Personalized Reports: Effective Communication Through Data Visualization

This poster discusses the automation of customized reports using R Markdown to efficiently provide accurate, visually appealing data summaries to stakeholders. As statistics and data science skills become more desirable in the workforce, an impactful college-level statistics education becomes more crucial. Studies show that student attitudes toward statistics are an important factor relating to student learning in statistics courses. The NSF-funded MASDER research team developed the S-SOMAS survey to measure such attitudes in 5166 students nationally in the 2021-22 academic year. Each participating instructor receives a customized report with student demographic and attitudinal summaries for their particular class, along with a comparison to the national sample. While generating statistical summaries via RMarkdown is straightforward, automation is a more complex process that allows reports to be generated as soon as a class completes the survey, rather than waiting on a statistician to manually clean data and create a report. The poster will discuss the report development process, as well as particular design choices that make the automation successful and visually appealing.

Presenting Author

Cody Leporini

First Author

Cody Leporini

CoAuthor(s)

Alana Unfried
Michael Posner, Villanova University

002: Pneumonia Detection via Convolutional Neural Networks (CNNs) based on X-ray Images

Pneumonia is an infection of the lungs. Sever diseases such as COVID-19, SARS, and ARDS onset Pneumonia resulting in lung injury and death according to Xu et al. (2020). Chest X-ray films are one of the most widely used tools in detecting lung infection. Automatically diagnosing Pneumonia at early phase can significantly prevent the rapid spread of the respiratory diseases and ease the workload at laboratories, which is essential especially in COVID-19 pandemic these days. Machine learning applications coupled with imaging techniques can be very useful in auto-detection of infected patients. Convolutional neural networks (CNN), as one of the subfield of machine learning, have remarkable performance in end-to-end machine learning for images. It requires minimal feature engineering and achieves near human performance on various benchmark tasks. In this project, we built a pipeline to preprocess image data and explore various neural networks to predict the label of chest X-rays. The model performance was evaluated based on the classification accuracy, which is defined as the percentage of correctly predicted labels, healthy or Pneumonia. Additionally, to have a better understanding of how the networks made such classification decision, saliency maps were used to diagnose decision bias.

Presenting Author

Jingying Zeng

First Author

Jingying Zeng

003: CW_ICA: An Efficient Dimensionality Selection Method for Independent Component Analysis

Independent component analysis (ICA) is one of the most commonly used blind source separation (BSS) techniques for signal pre-processing. The performance of the ICA results depends on the preset number of independent components (ICs). Too many ICs leads to under-decomposition of mixed signals, whereas too few ICs results in overfitting of source signals. In this study, we propose a novel multivariate method to determine the optimal number of ICs, named the column-wise independent component analysis (CW_ICA). It measures the relationship between ICs from two different blocks by the smallest of column-wise maximum value in off-diagonal rank-based correlation matrix to automatically identify the optimal number of ICs. With simulation and raw scalp EEG signal data as validation set, we compare the proposed CW_ICA to several existing methods combined different ICA methods. Results show that the proposed CW_ICA is a reliable and robust method for determining the optimal number of components in ICA. This method is robust, has broad applicability (i.e., EEG, LC-MS, etc.) and can be used in conjunction with a variety of ICA methods (i.e., FastICA, Infomax, etc.).

Presenting Author

Yuyan Yi, Auburn University

First Author

Yuyan Yi, Auburn University

CoAuthor(s)

Jingyi Zheng, Auburn University
Nedret Billor, Auburn University

004: A Dynamic Bayesian Network model for predicting the resilience of seagrass ecosystem to future heatwave events of varying duration, frequency, and re-occurrence

Seagrass meadows support complex species assemblages and provide ecosystem services with a multitude of socio-economic benefits. However, they are sensitive to anthropogenic pressures such as coastal development, agricultural runoff, and overfishing. The increasing prevalence of marine heatwaves associated with climate change poses an additional and growing threat. Given the ecological importance of seagrass for maintaining high biodiversity and a range of other ecosystem services and, with extreme climate events, such as marine heatwaves, predicted to become more frequent and intense. Understanding marine heatwaves impacts on the marine ecosystem is critical for assessing species adaptive capacity under future climate change scenarios. There is a demand for tools and strategies to understand trends in continued decline in seagrass, explore alternative hypotheses to mitigate marine heatwaves events, and implement risk-based responses. A general seagrass ecosystem Dynamic Bayesian Network model is developed to assess the impact of marine heatwaves on the resilience of the seagrass. To achieve this, we incorporated heat stress caused by marine heatwaves into a Dynamic Bayesian Network previously developed for seagrass and evaluated the model results of the climate change impact of various scenarios via a marine heatwave case study. Although the frequency of heat events seemed to be a significant factor in the potential damage to seagrass meadows, the impacts of heat stress were predicted to be more severe as the duration of heat events increased. Furthermore, the longer the interval between heatwaves at temperatures that do not induce heat stress, the quicker H. ovalis might recover before the next heatwave. This increased understanding on how seagrass respond to varying heat scenarios may facilitate global efforts to enhance seagrass protection, monitoring, management, and restoration. Research should be broadened to better understand the impacts of climate change on seagrass ecosystems, improve the foundation for informing climate change policy debates, and develop adaptive management responses to build resilience in marine ecosystems.

Presenting Author

Paula Hatum, Queensland University of Technology

First Author

Paula Hatum, Queensland University of Technology

CoAuthor(s)

Paul Wu, School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
Kerrie Mengersen, Queensland Univ. of Technology
Kathryn McMahon, School of Science and Centre for Marine Ecosystems Resersh, Edith Cowan University

005: Validating Causal Inference Methods

The fundamental challenge of drawing causal inference is that counterfactual outcomes are not fully observed for any unit. Furthermore, in observational studies, treatment assignment is likely to be confounded. Many statistical methods have emerged for causal inference under unconfoundedness conditions given pre-treatment covariates, including propensity score-based methods, prognostic score-based methods, and doubly robust methods. Unfortunately for applied researchers, there is no 'one-size-fits-all' causal method that can perform optimally universally. In practice, causal methods are primarily evaluated quantitatively on handcrafted simulated data. Such data-generative procedures can be of limited value because they are typically stylized models of reality. They are simplified for tractability and lack the complexities of real-world data. For applied researchers, it is critical to understand how well a method performs for the data at hand. Our work introduces a deep generative model-based framework, Credence, to validate causal inference methods. The framework's novelty stems from its ability to generate synthetic data anchored at the empirical distribution for the observed sample, and therefore virtually indistinguishable from the latter. The approach allows the user to specify ground truth for the form and magnitude of causal effects and confounding bias as functions of covariates. Thus simulated data sets are used to evaluate the potential performance of various causal estimation methods when applied to data similar to the observed sample. We demonstrate Credence's ability to accurately assess the relative performance of causal estimation techniques in an extensive simulation study and two real-world data applications from Lalonde and Project STAR studies.

Presenting Author

Harsh Parikh, Duke University

First Author

Harsh Parikh, Duke University

CoAuthor(s)

Carlos Varjao, Amazon
Louise Xu, Amazon
Eric Tchetgen Tchetgen, The Wharton School, University of Pennsylvania

006: Confidence Modeling for Flight Mode Predictions

Different phases of flight have different engine operating conditions, temperatures, and performance requirements. Identifying the flight phase is critical for health monitoring, fault detection, and deterioration calculations. To be robust to measurement fidelity, signal variation, and different engine types the objective is to identify flight phase using only the rotational shafts speeds. The current study uses two shaft speeds generated from historic engine development tests and classifies flight phase into four main modes (Taxi, Cruise, Idle, and Other) by using several machine learning methods, LSTM, KNN, and SVM. By comparing evaluation criteria such as Accuracy, Precision, Recall, F1-Score, and AUC Score among these three methods, KNN has been selected as the most successful method in this study. Also, this paper investigates the confidence score for the classification results by using a recently proposed confidence modeling technique, MACEst [1]. Moreover, the sensitivity analysis is conducted to determine specific thresholds for for the low, medium, and high level of confidence for the estimated confidence scores obtained from the MACEst method. This phase identification process is beneficial for engine fault, performance, and maintenance analytics.

Presenting Author(s)

Nedret Billor, Auburn University
Mohammad Maydanchi, Auburn University

First Author

Parisa Asadi, Auburn University

CoAuthor(s)

Mohammad Maydanchi, Auburn University
Ayomide Afolabi, Auburn University
Mark Izuchukwu Uzochukwu, Auburn University
Michael Brown, Auburn University
Nedret Billor, Auburn University
Chad Foster

007: Evaluation of Missing Data Imputation Techniques in Univariate Time Series Data Forecasting with ARIMA

In this work we evaluated the predictive performance of autoregressive integrated moving average (ARIMA) model on imputed time-series data using Kalman with ARIMA filtering, Kalman filtering with structural time series, Exponentially weighted moving average, simple moving average, mean imputation, linear interpolation, stine interpolation, and KNN imputation techniques under missing completely at random (MCAR) mechanism. Missing values were generated artificially at 10%, 15%, 25%, and 35% rate using complete data of 24-hours ambulatory blood pressure readings. The performance of ARIMA models were compared on imputed and original data using mean absolute percentage error (MAPE) and root mean square error (RMSE). Based on the results, mean imputation was the best technique, resulting with the smallest MAPE and RMSE at 10% rate of missingnes. At 15% rate of missingness, the exponentially weighted moving average outperformed the other techniques in terms of RMSE and Stine interpolation was the best method of imputation based on MAPE. At 25% rate of missingness, Kalman filtering with structural time series performed better than the other techniques based on both RMSE and MAPE. Kalmnan filtering with structural time series was the best in terms of RMSE, and Kalman filtering with ARIMA filtering was the best technique in based on MAPE at 35% of missingness.

Presenting Author(s)

Nicholas Niako, University of Texas Rio Grande Valley
Kristina Vatcheva, University of Texas Rio Grande Valley

First Author

Nicholas Niako, University of Texas Rio Grande Valley

CoAuthor(s)

Kristina Vatcheva, University of Texas Rio Grande Valley
Jesus Melgarejo, Studies Coordinating Centre, Research Unit Hypertension and Cardiovascular Epidemiology, KU Leuven
Gladys Maestre, Rio Grande Valley Alzheimer’s Disease Resource Center for Minority Aging Research (RGV AD-RCMAR),

009: A general meta machine learning model for constructing a binary classifier using small training dataset

Predictive modeling aiding decision making has been widely used in a number of fields such as chemistry, computer science, physics, economic, finance and statistics. Many models have been proposed to make an accurate prediction and yet no model consistently outperforms the rest. In addition, other challenges in practice include training dataset being relatively small (n<100) probably due to rare disease or expensive data collection and how to best handle missing data. We use a real data as an example to show a general framework of advanced machine learning (ML) model when the training dataset is small and has missing data in presence. Specifically, multiple imputation will be used to create new imputed datasets to eliminate missing data. Repeated K-fold cross validation can be used to robustly evaluate the predictive performance of the final predictor. Popularly used machine learning methods for predicting a binary outcome such as penalized logistic regression, random forest, gradient boosted decision trees, support vector machine, XGBoost, neural network will first be applied to the training data from the cross-validation step as base machine learning models. Each model has associated hyper-parameters that can be tuned by how well different sets of these parameters performed on all imputed datasets. For each imputed dataset, the out of fold predictions from each of those base machine learning methods were then used as the data to undergo stacking using logistic regression model to create the final predictive model. The predictive performance measures such as balanced accuracy (average of accuracy within each of two outcome categories), accuracy, area under the curve (AUC), sensitivity and specificity and the corresponding standard errors (se) will be summarized using Rubin's rule from the final predictive models for each imputed dataset. Permutation importance ranking values (which define how much each feature contributes to the prediction) can be obtained for any base machine learning models. Then the importance ranking values of all features used in the final base ML models using each imputed dataset can be averaged to examine which features are most important for predicting the binary outcome. A feature selection strategy based on the importance ranking could be further used.

Presenting Author(s)

Junying Wang, Stony Brook University
David Wu, Stony Brook University

First Author

David Wu, Stony Brook University

CoAuthor(s)

Junying Wang, Stony Brook University
Christine DeLorenzo, Stony Brook University
Jie Yang, Stony Brook University

010: An Overview of Statistical Approaches and Operational Challenges Related to Non-Concurrent Controls in Platform Trials

Traditional, two arm randomized controlled trials are the gold standard for evaluating new therapies, but as technology has advanced, the need for flexible and efficient trial designs has increased. Master protocols, specifically platform designs, have been proposed as an option. Platform trials test multiple interventions against a common control arm, allowing for arms to enter and exit the study, and often this increases efficiency in evaluating efficacy for experimental arms. However, increased flexibility comes with challenges from both statistical and operational perspectives. Many challenges are related to proper utilization of control data in assessing efficacy. Should an arm be compared only to controls enrolled during the period in which it was active (concurrent controls), or can we leverage information from controls enrolled during other time periods (non-concurrent controls)? Several analytic frameworks have been proposed to address this question, and these decisions have important implications for operational aspects of the platform, such as data sharing. This presentation will discuss considerations for the use of non-concurrent control data and related challenges.

Presenting Author

Megan McCabe, University of Alabama at Birmingham

First Author

Megan McCabe, University of Alabama at Birmingham

CoAuthor(s)

Emine Bayman
Christopher Coffey, University of Iowa

011: A workflow for stable interaction detection in high-dimensional data

The advent of large-scale data (e.g. from industry or biotechnology) have made the development of suitable statistical analysis techniques a cornerstone of modern interdisciplinary research and data analysis. Often, these data sets contain many covariates but comparatively few samples (p>>n). In this data-scarce regime, standard statistical methods are no longer appropriate.
A common research question in many data-driven observational studies is concerned with estimating how individual covariates influence a readout of interest. In practice, it is unlikely that all measured covariates affect the readout independently of each other. Rather, it can be assumed that only a subset of covariates is relevant, and that they potentially co-operate in a concerted fashion. Thus, a major concern is to identify a small set of reliable effects from a large number of possible combinations of covariates that built hypotheses for further functional analyses. Possible questions in different application areas include, for example, combinatorial (e.g. synergistic or antagonistic) effects of different drugs on a biological readout or the combinatorial behavior of different building energy efficiency measures on the building energy consumption.
Studying the effects of all possible combinations of features is notoriously hard to solve and statisticians often have to deal with noisy datasets of incomplete experimental design, where not all combinations of covariates have been measured. State-of-the-art techniques like sparse linear regression and extensions thereof to interaction models do not deliver sufficiently robust combinatorial effects between covariates. Thus, the development of robust methods is crucial to reduce the number of spurious interaction effects and to allow the communication of a reliable set of interaction effects to clients or collaborators in the application domain.
We propose a computational workflow that robustly recovers bi-order interactions in the data-scarce regime. As baseline model we use a lasso model for hierarchical interactions. Compared to the classical lasso problem with interaction effects the hierarchical model prefers main effects over interaction effects and only selects interaction effects if the predictive accuracy gets considerably improved by the selection of an interaction coefficient.
In order to perform robust model selection in the data-scarce regime, we combine the idea of stability selection and hierarchical interaction modeling. Based on synthetic data we show superior performance of stability selection over the commonly used cross-validation procedure in lasso models in terms of minimizing the number of spurious effects.
To account for potential noise in the data, our workflow comes in combination with a model-based filter algorithm that ensures that the number of spurious interaction effects due to noisy data is minimized.
Our computational workflow is of independent interest whenever robust hierarchical statistical interactions among various types of binary, categorical, and continuous covariates need to be assessed in the data-scarce regime.
We demonstrate the generalizability of our workflow by applying our workflow to various application fields including the study of combinatorial effects of epigenetic modifications as per the histone code hypothesis, the study of combinatorial drug effects on the microbial abundance of certain species in the human gut as well as the study of combinatorial effects between building energy efficiency measures on the building energy consumption.
Participants at the conference will learn about statistical techniques like the lasso, the lasso for hierarchical interactions, stability selection and synthetic data generation. They will receive an introduction on how to use our reproducible workflow to apply it in their own domains. Familiarity with the standard statistical regression model will be of advantage when attending the session.

Presenting Author

Mara Stadler, Helmholtz Center Munich

First Author

Mara Stadler, Helmholtz Center Munich

CoAuthor

Christian L. Müller, Helmholtz Center Munich

012: Quantifying Polarization in Newspaper Media.

According to recent research, Americans are more divided and polarized in recent years. In this project, we aim to characterize and quantify polarization trends throughout a historical record of US-based, primarily regional, newspapers. Newspapers were selected from a variety of US markets in an attempt to capture any regional differences that might exist in how issues/topics are discussed. Our modeling approach is based on a Structural Topic Model (STM) that identifies topics within a given corpus and then measures the tonal differences of articles discussing the same topic. Specifically, we utilize the STM for inferring potentially correlated topics and a sentiment analyzer called VADER to identify topics that exhibit a high level of semantic disparity. Using this technique, we measure the polarization of developing and evolving topics, such as sports, politics, and entertainment, and compare how polarization between and within these topics has varied through time. Through this, we develop topic-specific distributions of sentiments that we refer to as polarization distributions. We conclude by demonstrating the utility of these distributions in both identifying polarization and show how instances of high polarization coincide with significant social events.

Presenting Author

David Edwards, Virginia Tech

First Author

David Edwards, Virginia Tech

CoAuthor(s)

Scotland Leman, Virginia Tech
Shyam Ranganathan
James Hawdon, Virginia Tech
Cozette Comer

013: Optimal Full Matching with Restrictions to Limit Large Variation in Matched Sets: Case Study on COVID-19 Mask Mandates

Since the outbreak of the COVID-19 pandemic, masks have been a central topic of public health research. Yet, much of the current literature only evaluates mask mandates retrospectively, allowing for potential bias from unobserved confounding factors. However, propensity score matching methods have been shown to reduce this bias by balancing observed covariates in the hopes of also balancing unobserved covariates.
The result is a synthetic imitation of an experimental study that produces a quasi-causal effect estimate. In our study, we employ propensity score matching methods at a county-wide level to evaluate the effect of mask mandates in August 2020.

Although propensity score matching techniques are not novel to public health research, their use has not yet been implemented to evaluate mask mandates. Further, the matching algorithms common in the literature (eg. pair matching or fixed k:1 matching) are often criticized for discarding valuable data due to their rigid structure. To overcome these challenges, we employ Hansen and Klopfer's optimal full matching algorithm with restrictions. The result is a more precise treatment effect estimate that overcomes both the non-experimental issues in observational studies and the drawbacks of commonly used matching algorithms.

Presenting Author

Simon Nguyen

First Author

Simon Nguyen

014: Prioritizing Network Properties of T-cell Receptor Repertoire – A Novel Approach to Select Network Signatures from TCR Repertoire Data

T-cells are one of the key components of the adaptive immune system. T-cell Receptors (TCR) are a group of protein complexes found on the surface of T-cells. TCRs are responsible for recognizing and binding to certain antigens found on abnormal cells or potentially harmful pathogens. Once the TCRs bind to the pathogens, the T-cells attack these cells and help the body fight infection, cancer, or other diseases. TCR repertoires, which are continually shaped throughout the lifetime of an individual in response to pathogenic exposure, can serve as a fingerprint of an individual's current immunological profile. The similarity among TCRs sequence directly influences the antigen recognition breadth. Network analysis, which allows interrogation of sequence similarity, thereby adds an important layer of information. Due to the heterogeneous nature of TCR network properties, it is extremely difficult to perform statistical inference or machine learning directly between subjects. In this paper, we proposed a novel method to prioritize the network properties that are associated with the outcome of interest, based on features extracted from heterogeneous global/local network properties. We also proposed schemes to select the top features associated and simulated the network properties using the real data. Extensive simulation studies and real data analysis were performed to demonstrate the proposed methods. Performance measures including F-1 score, false discovery rate, sensitivity, power, and stability were calculated for each model and are used for model comparison.

Presenting Author

Shilpika Banerjee

First Author

Shilpika Banerjee

CoAuthor(s)

Li Zhang, University of California
Tao He, San Francisco State University

015: Drivers of flood-induced relocation among coastal urban residents: Insight from the US east coast

Many coastal urban areas are experiencing impacts of accelerated chronic and episodic flooding on the built environment and people's livelihoods and quality of life. These impacts sometimes exceed the households' adaptive and coping capacities to deal with flooding, prompting residents to consider relocation. It is unclear how urban dwellers living in flood-prone locations perceive this adaptation strategy and under what flood-driven circumstances they would consider permanently moving. This paper provides empirical evidence on relocation preferences among urban residents along the U.S. East Coast. It further explores how this decision is influenced by socioeconomic determinants, experiences with flood exposure, comprehensive concerns with flooding, and preferences for relocation destinations. We administered an online survey to 1450 residents living in flood-prone urban areas across multiple states, from New York to Florida, and analyzed the results using descriptive and inferential statistics. Data visualization techniques were employed to explore the impact of different covariates. Correlation analysis was used in conjunction with variable selection techniques to conduct dimension reduction. We fit a multinomial logistic regression model to understand the effect of significant predictor variables on an individual's willingness to relocate. Results show that almost half of respondents would consider relocating due to coastal flooding, with only 13 percent declining this option. The results show that age and race, several determinants of place attachment, problem-solving capacity, and flood-related household- and community-level concerns play a significant role in willingness to relocate.

Presenting Author

Steven Barnett, Virginia Tech

First Author

Steven Barnett, Virginia Tech

CoAuthor

Anamaria Bukvic, Virginia Tech

016: A Bayesian Approach Towards Balanced Probability Calibration and Boldness

There is a fundamental tension between the calibration and boldness of probability predictions about forthcoming events. Predicted probabilities are considered well calibrated when they are consistent with the relative frequency of the events they aimed to predict. However, well calibrated predictions are not necessarily useful. Predicted probabilities are considered more bold when they are further from the base rate and closer to the extremes of 0 or 1. Predictions that are reasonably bold, while maintaining calibration, are more useful for decision making than those with only one or the other. We develop Bayesian estimation and hypothesis testing-based methodology with a likelihood suited to the probability calibration problem. Our approach effectively identifies and corrects miscalibration. Additionally, it allows users to maximize boldness while maintaining a user specified level of calibration, providing an interpretable tradeoff between the two. While we demonstrate the practical capabilities of this methodology by comparing hockey pundit predictions, this approach is widely applicable across many fields.

Presenting Author

Adeline Guthrie, Virginia Tech

First Author

Adeline Guthrie, Virginia Tech

CoAuthor

Christopher Franck, Virginia Tech

017: Examining the Adaptive Immune Response to SARS-CoV-2 by Network Analysis and Machine Learning Techniques

T cells and B cells are the guardians of our immune system against pathogens, foreign substances, and infections, including respiratory viruses such as SARS-CoV-2. Understanding the T and B cell repertoire provides better knowledge of the response mechanism of our immune system. Next-generation sequencing helps us to profile the T and B cell repertoire (Rep-seq). However, it also requires novel statistical approaches and machine learning techniques to analyze those new data types. We applied a customized pipeline for Network Analysis of Immune Repertoire (NAIR) with advanced statistical methods and cutting-edge machine learning techniques developed by our team to characterize and investigate changes in the landscape of Rep-seq for SARS-CoV-2 data from COVID-19 subjects. We first performed network analysis on the Rep-seq data based on sequence similarity. We then quantified the repertoire network by network properties and correlated it with clinical outcomes of interest. In addition, we identified COVID-19-specific/associated clusters based on our customized search algorithms and assessed their relationship with clinical outcomes such as active status and recovery from COVID-19 infection. Furthermore, to identify potential antigen-driven TCRs among disease-specific clusters we designed a new metric incorporating the clonal generation probability and the clonal abundance by using a modified Bayes factor to filter out the false positives. We also validated our findings by comparing our results with an external dataset. Our results demonstrate that our novel approach to analyzing the network architecture of the immune repertoire can reveal potential antigen-driven TCRs responsible for the immune response to the infection.

Presenting Author

Brian Neal, University of California Irvine

First Author

Brian Neal, University of California Irvine

CoAuthor(s)

Hai Yang, UCSF
Zenghua Fan, University of California San Francisco
Phi Le, University of Mississippi Medical Center
Tao He, San Francisco State University
Lawrence Fong, University of California San Francisco
Jason Cham, Scripps Green Hospital
Li Zhang, University of California

018: Automating Data Cleaning, Merging, Processing, and Visualization in Real Time

During the COVID-19 pandemic, Virginia Tech tried to be proactive instead of reactive in regard to outbreaks on campus. Using wastewater collected from dormitory outflow locations, models were fit with the hope that upticks in COVID-19 infection could be predicted early enough that resources such as extra tests could be allocated intelligently, instead of randomly. In order to accomplish this, various data streams such as dormitory swipe card data, wastewater test results, and university isolation and quarantine information, all needed to be kept up to date so that decision-makers had access to the most recent visualizations and predictions at any moment. To this end, the entire process of data acquisition, cleaning, merging of data streams, processing, modeling, and visualization was automated so that no human interaction was needed daily. In addition, this all needed to be done in a way that was HIPAA compliant, since student health records were an important part of the modeling process. This presentation focuses on the steps taken to achieve the goal of complete automation, from automatically collecting new data when streams were updated, to providing updated visualizations and model results to decision-makers in the form of an R Shiny app whenever they needed it.

Presenting Author

Christopher Grubb, Virginia Tech

First Author

Christopher Grubb, Virginia Tech

019: Writing about alternatives to classical hypothesis testing outside of the statistical literature: Bayesian model selection and biomechanics.

By now, statisticians and the broader research community are aware of the controversies surrounding traditional hypothesis testing and p-values. Many alternative viewpoints and methods have been proposed, as exemplified by The American Statistician's recent special issue themed "World beyond p<0.05." However, it seems clear that the broader scientific effort may benefit if alternatives to classical hypothesis testing are described in venues beyond the statistical literature. This poster addresses two relevant gaps in statistical practice. First, we describe three principles statisticians and their collaborators can use to publish about alternatives to classical hypothesis testing in the literature outside of statistics. Second, we describe an existing BIC-based approximation to Bayesian model selection as a complete alternative approach to classical hypothesis testing. This approach is easy to conduct and interpret, even for analysts who do not have fully Bayesian expertise in analyzing data. Perhaps surprisingly, it does not appear that the BIC approximation has yet been described in the context of "World beyond p<0.05." We address both gaps by describing a recent collaborative effort where we used the BIC-based techniques to publish a paper about hypothesis testing alternatives in a high-end biomechanics journal.

Presenting Author

Christopher Franck, Virginia Tech

First Author

Christopher Franck, Virginia Tech

CoAuthor(s)

Michael L Madigan, Virginia Tech
Nicole Lazar, Pennsylvania State University

020: Visualization at Your Fingertips: A Suite of Applications

For most statisticians, creating high quality graphs is both a frequent part of the job and a time-consuming task. Loading the data into R, cleaning it, adding titles and legend labels, color, ensuring quality, and going back and forth with the investigator is a labor-intensive process. We developed a suite of online data visualization applications using R to aid in our consultations with researchers and increase efficiency. These apps feature simple upload capabilities, easy title and label creation, customizable formats and saving options, and most importantly zero programming. That means these apps are useful for the statistician, but can also be easily used for data visualizations by individuals with no statistical background.

This poster will highlight 3 of our visualization apps. For survival data, we have the Kaplan-Meier app; to visualize individual evets over time you can use our swimmer plot; finally for changes in value over time, we have the waterfall and spaghetti plots. The poster will include scannable QR codes so that attendees can load and play with the apps live. While these apps were designed for cancer clinical trials, they can be used with any type of data that lends itself to one of the plots we have. We put a lot of effort into making these apps which create user friendly customizable publication-quality graphs. Our team of statisticians regularly use these applications to save time and improve communication with investigators and they can be used in the setting of statistical consulting.

Presenting Author

Julia Thompson, Columbia University

First Author

Julia Thompson, Columbia University

CoAuthor

Gonghao Liu, Columbia University

WITHDRAWN 008: Development of digital interactive dashboard for hospital antibiotics stewardship using Python

According to the CDC Antibiotics are medicines that fight infections caused by bacteria by either killing the bacteria or making it difficult for it to grow or multiply. Antibiotic use has improved health care delivery since the 19th century, but emergence of resistance has followed rapidly the introduction of new antibiotic, leading to major implications for society and the delivery of modern health care. Antimicrobial Stewardship is an effort to measure and improve how antibiotics are prescribed by clinicians and used by patients and is critical to protect patients from the harms of inappropriate antibiotic use and combat antibiotic resistance. A digital dashboard is an analysis tool that allows business users to monitor and analyze their most important data sources in real time. In this work we develop a pilot interactive dashboard for inpatient antibiotic stewardship in a local hospital at Rio Grande Valley using Python. The dashboard visualizes trends in the antibiotic stewardship metric days of therapy (DOT) by various categories including indication and therapeutic class. Statistical testing and analysis for trend and seasonality in antibiotic usage across groups are conducted.

Presenting Author(s)

Saikou Jwla, The University of Texas Rio Grande Valley
Kristina Vatcheva, University of Texas Rio Grande Valley

First Author

Saikou Jwla, The University of Texas Rio Grande Valley

CoAuthor(s)

Kristina Vatcheva, University of Texas Rio Grande Valley
Jose Maldonado, The University of Texas Rio Grande Valley
Roy Evans, Valley Baptist Medical Center
Stephen Gore, Valley Baptist Medical Center