CS008 Practice and Applications: Data Education and Visualization, Part 1

Conference: Symposium on Data Science and Statistics (SDSS) 2023
05/24/2023: 1:15 PM - 2:45 PM CDT
Lightning 
Room: Grand Ballroom C 

Description

This session will be followed by e-poster presentations on Wednesday, 5/24 at 3:20 PM.

Chair

Aaron Samson

Tracks

Data Visualization
Education
Symposium on Data Science and Statistics (SDSS) 2023

Presentations

An R-based Plotting Tool (gofplot_msw) to Support the Multistage Weibull(MSW) Time-to-tumor Model

Background: Carcinogenicity data are heavily censored, and events (Incidental and Fatal observations) may be sparse in some dose groups. The multistage Weibull (MSW) Time-to-tumor model describes the probability of a test subject exhibiting a specific carcinogenic response by observation time t, when the subject is exposed to a carcinogen at dosage rate d. Methods similar to the chi-square goodness-of-fit test that is applied to EPA's BMDS quantal models do not apply to the EPA's MSW model with censored data. Development of a suitable goodness-of-fit test, especially for heavily censored and current-status data, is difficult. Objective: To develop graphical comparison of the MSW parametric model and the nonparametric model for judging goodness-of-fit. Results: An R-based plotting tool was developed to support externally peer reviewed and EPA BMDS website posted MSW Time-to-tumor model. It generates diagnostic plots with MSW outputs and assist internal & external users to assess goodness-of-fit for the MSW model. This MSW plotting tool includes several plot types found useful for evaluating goodness-of-fit of survival functions: Probability vs. Time plot, Dose-Response plot, Hazard plot, Quantile-Quantile plot, and Probability-Probability plot. Conclusions: The tool can be used to assess the goodness-of-fit of the (parametric) MSW model by comparing it to a nonparametric model fitted to the same data. The nonparametric model imposes only the most necessary restrictions (esp. monotonicity) on the relationship between time, dose, and probability of tumor onset or death, with no assumption made on the specific distributional form of the data. By minimizing the restrictions on the structure of the model, the empirical nonparametric model fits the data as "closely" as possible. Comparisons between the parametric MSW and nonparametric models provide a subjective assessment for goodness-of-fit of the MSW Time-to-tumor model to the data. 

Presenting Author

Y. Christine Cai, US Environmental Protection Agency

First Author

Y. Christine Cai, US Environmental Protection Agency

CoAuthor

John Fox, EPA

Building an Online Multidisciplinary Masters Program in Data Science and Analytics

In today's world of constant streaming data and quick decisions, Clemson University recognized the need for individuals that are trained in both Business Management and Analytics with an emphasis in Applied Statistics. The goal of our program is to create graduates that can use statistical reasoning to help motivate business decisions. Our path to creating this program included research on other master's programs and conversations with industry professionals on the needs in the workforce. The resulting online Masters in Data Science and Analytics that began in the summer of 2020, is now a top 5 nationally ranked program. In this talk we will discuss how we developed the program, what makes this program unique, and what have we learned along the way. 

Presenting Author

Ellen Breazel, Clemson University

First Author

Ellen Breazel, Clemson University

CoAuthor

Russell Purvis, Clemson University

Developing Data Science Skills Using Call of Duty® Data

Interest in data science and data science education has exploded in recent years. As more schools offer introductory data science courses, there is an increased demand for real datasets that may be used for developing foundational skills such as data wrangling, data exploration, and modeling. We present a collection of datasets related to the popular Call of Duty® video game series and illustrative examples for teaching statistical thinking. The solutions employ data wrangling techniques such as creating new variables, filtering data, processing strings, and joining data from multiple sources. We emphasize exploratory data analysis and the importance of multivariate thinking through data visualization. The examples further extend insights gained from data visualization by introducing basic modeling and statistical thinking concepts. We conclude with a discussion of our experience incorporating these datasets into homework assignments and projects in a variety of undergraduate courses. 

Presenting Author

Matt Slifko, The Pennsylvania State University

First Author

Matt Slifko, The Pennsylvania State University

Examples and Data Repository Towards Increased Collaboration Amongst Statistics Educators

Even though educators have access to a variety of data and examples, educators in introductory statistics and quantitative analysis courses have always the need for new and real examples. To address this, we created a centralized Google drive repository which allows for anyone to access classroom examples and data or to contribute their own. The Google drive repository will emphasize real applications, real data, and R code. The Google drive repository can have a major impact on educators and students. Having the opportunity to launch this project in a national symposium as the SDSS will further collaboration among educators. 

Presenting Author

Pablo Baldivieso, Oregon State University - Cascades

First Author

Pablo Baldivieso, Oregon State University - Cascades

Five Datasets You Should Be Using in Your Introductory Statistics and Data Science Classes

The Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report recommends the use of real data in the statistics classroom as a way to engage students and help them understand the relevance of statistics in the real world. To achieve these goals instructors should engage in a discussion of the context and purpose of that data. In this presentation, we suggest five data sets that instructors should consider using in their introductory statistics and data science classes. Using these data sets we demonstrate how the data can be used at a variety of levels to illustrate the principles described in GAISE. 

Presenting Author

Roger Woodard, University of Notre Dame

First Author

Roger Woodard, University of Notre Dame

From Awareness to Action: Engaging Introductory Statistics Students in Anti-Racist Data Art

How can we engage undergraduate statistics students in current conversations around race and racism? How can we help to promote a deeper understanding of how statistics and data science can be used to perpetuate or challenge social inequities? In this presentation, I'll describe the process I undertook with a small group of undergraduate students during an 8-week summer program to create a data art collection to raise awareness of racial disparities in obstetrics & gynecology health care. We developed a systematic protocol for the review, screened articles, reviewed eligible articles, and then created data art based on the data collected during the review. Along the way, we thought critically about our privilege hazards, examined the limits of objectivity in the statistical process, discussed how race and racism are commonly handled in statistical analyses in major medical journals, and examined many examples of data art to help inform and inspire our own data "visceralizations". We created a publicly available website with our data art – and the data collected – to raise awareness and make this research accessible to the community. 

Presenting Author

Katharine Correia, Amherst College

First Author

Katharine Correia, Amherst College

Incorporating ChatGPT for teaching and learning statistical concepts - a devil or an angel?

This presentation will share my experience of using ChatGPT for teaching and learning statistical concepts. AI is revitalizing education in personalized learning, automating the repetitive and time-consuming tasks for teachers and allowing 24/7 availability for learning. The recent hype of ChatGPT has generated great interest and concerns on how it can do in every aspect of human activities, including education. Among some unique features, the fact that ChatGPT is capable of providing human-like conversations raises a particular attention. A major negative impact to education is to make assessment of student learning difficult. Statistical concepts are unique to other disciplines due to the fact that various concepts are subtle and confusing. Some examples include independent Vs. conditional events, correlation Vs. causation, standard deviation Vs. standard error and Central Limit Theorem. (1) "Is the current ChatGPT capable of explaining these unique statistical concepts?" Numerical computation is an integral process of learning statistical concepts. Hands-on activities and projects are critical for statistical thinking. (2) "Is ChatGPT capable of performing proper computations, suggesting hands-on activities and projects for teaching and learning statistics?" Misuse of statistics in practical analysis has been a critical concern. (3) "Can ChatGPT identify different misuses of statistical techniques?". ChatGPT uses data-driven approach, not causality-driven. It provides a most likely answer based on the available data. However, it is not able to validate the answer. Some pros and cons will be discussed. The fast development of AI will continue to evolve and become more and more sophisticate. Educators, including statistics profession, will need to embrace and take the advantage of AI technology in their teaching and finding creative approaches to facilitate learning. 

Presenting Author

Carl Lee, Central Michigan University

First Author

Carl Lee, Central Michigan University

Quantitative High Throughput Screening Data Quality Control Analysis R Shiny Application

Quantitative high throughput screening (qHTS) assays can be used to evaluate the bioactivity of thousands of chemicals in a single experiment. The Tox21 program utilizes qHTS to prioritize testing of chemicals and predict their effects on humans and the environment. Within Tox21, there are numerous datasets with assay results for thousands of chemicals at different concentration levels, where each chemical is represented by multiple response profiles. Cluster Analysis by Subgroups using ANOVA (CASANOVA) is an automated quality control procedure to identify compounds within an assay that have consistent response patterns. To provide general accessibility to the method, an R Shiny web app has been developed to provide a user-friendly interface for running the CASANOVA analysis and for visualizing the resulting concentration-response profiles. This app enables scientists to easily run CASANOVA on their experimental data, reload previously completed CASANOVA analysis results, and display Tox21 CASANOVA results, without any prior knowledge in R. Visualization of clustered concentration-response curves allows scientists to better understand the main sources of variation in qHTS studies by scrutinizing chemicals that produce inconsistent responses among multiple concentration-response profiles. Concentration at half-maximal response (AC50) estimates are also calculated for each cluster to provide a quantitative measure of chemical potency. 

Presenting Author

Guanhua Xie

First Author

Guanhua Xie

CoAuthor(s)

Shawn Harris, Social & Scientific Systems, Inc.
Keith Shockley, National Institute of Health
Shyamal Peddada, Biostatistics and Computational Biology Branch, Division of Intramural Research, NIEHS

SDOH: A R Shiny application for predictive modeling Social Determinants of Health survey responses

Background: Social Determinants of Health (SDOH) surveys are data sets that provide useful health related information about individuals and communities at large. This study aims to develop a user-friend web application that allows clinicians to get predictive insight about the social needs of their patients prior to their in-patient visits using SDOH survey data to provide an improved and personalized service.
Method: The SDOH dataset used is a longitudinal survey that consists of 108,563 patient responses to 12 survey questions. It was collected from The University of Kansas Health System (TUKHS). The questions were designed to have a binary outcome as the response. Then the patient's most recent responses for each of these questions was modeled independently by incorporating explanatory variables. Multiple classification and regression techniques were used, including logistic regression, Bayesian generalized linear model, extreme gradient boosting, gradient boosting, neural networks, and random forests. Finally, these models were packaged into an R Shiny application that allows users to predict and make comparisons among models.
Results: Area under the curve (AUC) values for 72 models were calculated. Based on AUC values, Gradient Boosting models provided the highest precision values. Models were packaged into an R shiny application, a tool that can predict an individuals' response to a survey question based on their gender, race, ethnicity, age, and zip code.
Conclusions: We propose a predictive tool that aids the health system address patients in need of assistance and, by extension, improve their communities. This tool is hosted online as a freely available website by the University of Kansas Medical Center's Department of Biostatistics & Data Science: https://biostats-shinyr.kumc.edu/Predicting_SDOH/.The R source code and supporting materials used to host the models has been made publicly available on: github.com/CRISsupport/SDOH-Predictions-KS-WestMO. 

Presenting Author

Sam Pepper

First Author

Sam Pepper

CoAuthor(s)

Isuru Ratnayake, Kansas University Medical Center
Dinesh Pal Mudaranthakam

The Colgate University Data Science Collaboratory

The Data Science Collaboratory at Colgate University is focused on three interconnected goals:

¹ Develop RShiny resources that lower barriers to and increase the quality of quantitative research.
² Create instructional materials that make standard statistical procedures and best practices accessible to those new to quantitative research.
³ Cultivate a multi-institutional community that collaborates on data-driven solutions to emergent global issues (climate change, social justice) while developing the next generation of data scientists.

By incorporating students into these scientific communities, we ensure that the next generation of data scientists gains exposure to data science research and experience selecting, applying, and interpreting the results of appropriate techniques in various real-world contexts. 

Presenting Author

Joshua Finnell Finnell, Colgate University

First Author

Joshua Finnell Finnell, Colgate University

CoAuthor

William Cipolli

The Mastery Rubric for Statistics and Data Science for coherence and consistency in data science education

Recently published guidance for undergraduate data science may not result in consistency in learning outcomes across or within institutions. To promote this consistency, the Mastery Rubric for Statistics and Data Science (MR-SDS) was developed prioritizing learning and the development of independence in the knowledge, skills, and abilities for professional practice in statistics and data science (SDS). A MR-SDS -driven curriculum can emphasize computation, statistics, or a third discipline in which the other(s) would be deployed; or, all three. The MR-SDS promotes consistency with recommendations for SDS education, and allows "statistics", "data science", and "statistics and data science" curricula to reliably, but flexibly, educate with a focus on increasing learners' independence. The MR-SDS supports self-directed learning, training, and tertiary education, accommodating the interests of business, government, and academic work force development. The MR-SDS can be used for development or revision of an evaluable curriculum for undergraduates, upskilling and training, and doctoral level learning. 

Presenting Author

Rochelle Tractenberg, Georgetown University

First Author

Rochelle Tractenberg, Georgetown University

CoAuthor(s)

Donna LaLonde, American Statistical Association
Suzanne Thornton, Swarthmore University

Using Chat GPT to Help with Teaching Statistics

Chat GPT is an artificial intelligence chat bot developed by OpenAI. It was released in November of 2022 and has already transformed how we teach our students. Like the release of any new technology, it brings concerns of how it will impact the way that students learn, or how students will use it to avoid learning. Artificial intelligence chat bots are here to stay, and we propose that, as educators, we should embrace the technology and learn to utilize it in ways that will engage our students in the classroom and make our work easier. In our talk will walk through some ways that we can use chat bots to help with these goals. 

Presenting Author

Victoria Woodard, University of Notre Dame

First Author

Victoria Woodard, University of Notre Dame

CoAuthor

Roger Woodard, University of Notre Dame

Utilizing R Shiny to Create a Statistical Dataview: University of Pennsylvania Multidisciplinary Approach to the Study of Chronic Pelvic Pain (MAPP)

To help better understand the underlying causes of the two most prominent chronic urological pain disorders – interstitial cystitis/bladder pain syndrome (IC/BPS) and chronic prostatitis/chronic pelvic pain syndrome (CP/CPPS), the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) of the National Institutes of Health (NIH) established the Multidisciplinary Approach to the Study of Chronic Pelvic Pain (MAPP) Research Network in 2008. The primary clinical research effort carried out during the MAPP Network's first 5-year project period (MAPP I) was a prospective cohort study and from December 14, 2009, through December 14, 2012, 1,039 men and women were enrolled in the study, including persons with UCPPS (n = 424); persons with other comorbid illnesses, including fibromyalgia, irritable bowel syndrome, and chronic fatigue syndrome (n = 200 for all conditions); and healthy controls (n = 415). All study participants were extensively characterized (i.e., phenotyped) at baseline, and UCPPS participants were further assessed during an additional 12-month follow-up period.

In order for researchers to gain unlimited access to the raw data from the MAPP studies, a MAPP Dataview application was created using R Shiny. This applet allows users to query data and view or download these results in tabular or graphic representation. The Dataview is optimized for user interaction, incorporating different graphical and statistical options for the user to obtain summary statistics. These options consist of a wide range of graphs and summary tables on either the MAPP full dataset or user-specified subsets. Users can additionally run basic regression analysis on the baseline data for both continuous and categorical data. Four datasets are available for use: (1) MAPP I baseline, (2) MAPP II baseline, (3) MAPP I longitudinal, and (4) MAPP II longitudinal and the Dataview database, which is updated regularly to incorporate newly obtained follow-up data. 

Presenting Author

Flynn McMorrow

First Author

Flynn McMorrow

CoAuthor

J. Richard Landis, University of Pennsylvania

What is Data-Driven Leadership and how do you teach it?

The role of a Data-Driven Leader is to drive accountability and build a data culture characterized by the leader and the organization's staff implementing decisions in a fact and data-based process. This is achieved by steering teams away from opinions, cognitive bias, group-thinking and self-censorship. But how is this leadership developed individually and organizationally? This paper attends to this question specifically for master students in Analytics and Data Science. While comprehensive reviews of what the market requires of a Data Scientist for employability consistently point at technical skills, soft skills such as, e.g, the ability to communicate effectively and being a team-player are additional prerequisites from the employers' side. However, soft skills are at the best indirectly developed by the students in most programs in the field. For instance, a dedicated curriculum in Data-Driven Leadership is a rare encounter, and the few existing ones are highly heterogeneous in content and learning goals. This paper offers a proposal for the standardization of a curriculum in a Data-Driven Leadership course. In doing so, we argue for content consisting of behavioral experiments, cognitive bias and debiasing, game theory, investment under uncertainty, mechanism design, optimal stopping theory, and systematic review methodology. Moreover, we argue for team-based learning where self-learning of teams on conceptual topics is combined with data analysis assignments. Further, their communication skills is developed by testing their presentations on their peers. During the course, the team members monitor the leaders and themselves by a self-reflective protocol. The findings of this paper is based on the analysis of data on 50 enrolled students, where students' perceptions, assessment of learning goals, pre/post personality tests, and team performance data consistently indicate the enhancement of the students' soft-skills and data-driven leadership ability. 

Presenting Author

Kenneth Carling, Dalarna University

First Author

Kenneth Carling, Dalarna University

CoAuthor(s)

Arend Hintze, Dalarna University
Asif M Huq, Dalarna University
Ilias Thomas, Dalarna University