05/24/2023: 1:15 PM - 2:45 PM CDT
Lightning
Room: Grand Ballroom C
This session will be followed by e-poster presentations on Wednesday, 5/24 at 3:20 PM.
Chair
Aaron Samson
Tracks
Data Visualization
Education
Symposium on Data Science and Statistics (SDSS) 2023
Presentations
Background: Carcinogenicity data are heavily censored, and events (Incidental and Fatal observations) may be sparse in some dose groups. The multistage Weibull (MSW) Time-to-tumor model describes the probability of a test subject exhibiting a specific carcinogenic response by observation time t, when the subject is exposed to a carcinogen at dosage rate d. Methods similar to the chi-square goodness-of-fit test that is applied to EPA's BMDS quantal models do not apply to the EPA's MSW model with censored data. Development of a suitable goodness-of-fit test, especially for heavily censored and current-status data, is difficult. Objective: To develop graphical comparison of the MSW parametric model and the nonparametric model for judging goodness-of-fit. Results: An R-based plotting tool was developed to support externally peer reviewed and EPA BMDS website posted MSW Time-to-tumor model. It generates diagnostic plots with MSW outputs and assist internal & external users to assess goodness-of-fit for the MSW model. This MSW plotting tool includes several plot types found useful for evaluating goodness-of-fit of survival functions: Probability vs. Time plot, Dose-Response plot, Hazard plot, Quantile-Quantile plot, and Probability-Probability plot. Conclusions: The tool can be used to assess the goodness-of-fit of the (parametric) MSW model by comparing it to a nonparametric model fitted to the same data. The nonparametric model imposes only the most necessary restrictions (esp. monotonicity) on the relationship between time, dose, and probability of tumor onset or death, with no assumption made on the specific distributional form of the data. By minimizing the restrictions on the structure of the model, the empirical nonparametric model fits the data as "closely" as possible. Comparisons between the parametric MSW and nonparametric models provide a subjective assessment for goodness-of-fit of the MSW Time-to-tumor model to the data.
Presenting Author
Y. Christine Cai, US Environmental Protection Agency
First Author
Y. Christine Cai, US Environmental Protection Agency
CoAuthor
John Fox, EPA
In today's world of constant streaming data and quick decisions, Clemson University recognized the need for individuals that are trained in both Business Management and Analytics with an emphasis in Applied Statistics. The goal of our program is to create graduates that can use statistical reasoning to help motivate business decisions. Our path to creating this program included research on other master's programs and conversations with industry professionals on the needs in the workforce. The resulting online Masters in Data Science and Analytics that began in the summer of 2020, is now a top 5 nationally ranked program. In this talk we will discuss how we developed the program, what makes this program unique, and what have we learned along the way.
Presenting Author
Ellen Breazel, Clemson University
First Author
Ellen Breazel, Clemson University
CoAuthor
Russell Purvis, Clemson University
Interest in data science and data science education has exploded in recent years. As more schools offer introductory data science courses, there is an increased demand for real datasets that may be used for developing foundational skills such as data wrangling, data exploration, and modeling. We present a collection of datasets related to the popular Call of Duty® video game series and illustrative examples for teaching statistical thinking. The solutions employ data wrangling techniques such as creating new variables, filtering data, processing strings, and joining data from multiple sources. We emphasize exploratory data analysis and the importance of multivariate thinking through data visualization. The examples further extend insights gained from data visualization by introducing basic modeling and statistical thinking concepts. We conclude with a discussion of our experience incorporating these datasets into homework assignments and projects in a variety of undergraduate courses.
Presenting Author
Matt Slifko, The Pennsylvania State University
First Author
Matt Slifko, The Pennsylvania State University
Even though educators have access to a variety of data and examples, educators in introductory statistics and quantitative analysis courses have always the need for new and real examples. To address this, we created a centralized Google drive repository which allows for anyone to access classroom examples and data or to contribute their own. The Google drive repository will emphasize real applications, real data, and R code. The Google drive repository can have a major impact on educators and students. Having the opportunity to launch this project in a national symposium as the SDSS will further collaboration among educators.
Presenting Author
Pablo Baldivieso, Oregon State University - Cascades
First Author
Pablo Baldivieso, Oregon State University - Cascades
The Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report recommends the use of real data in the statistics classroom as a way to engage students and help them understand the relevance of statistics in the real world. To achieve these goals instructors should engage in a discussion of the context and purpose of that data. In this presentation, we suggest five data sets that instructors should consider using in their introductory statistics and data science classes. Using these data sets we demonstrate how the data can be used at a variety of levels to illustrate the principles described in GAISE.
Presenting Author
Roger Woodard, University of Notre Dame
First Author
Roger Woodard, University of Notre Dame
How can we engage undergraduate statistics students in current conversations around race and racism? How can we help to promote a deeper understanding of how statistics and data science can be used to perpetuate or challenge social inequities? In this presentation, I'll describe the process I undertook with a small group of undergraduate students during an 8-week summer program to create a data art collection to raise awareness of racial disparities in obstetrics & gynecology health care. We developed a systematic protocol for the review, screened articles, reviewed eligible articles, and then created data art based on the data collected during the review. Along the way, we thought critically about our privilege hazards, examined the limits of objectivity in the statistical process, discussed how race and racism are commonly handled in statistical analyses in major medical journals, and examined many examples of data art to help inform and inspire our own data "visceralizations". We created a publicly available website with our data art – and the data collected – to raise awareness and make this research accessible to the community.
Presenting Author
Katharine Correia, Amherst College
First Author
Katharine Correia, Amherst College
This presentation will share my experience of using ChatGPT for teaching and learning statistical concepts. AI is revitalizing education in personalized learning, automating the repetitive and time-consuming tasks for teachers and allowing 24/7 availability for learning. The recent hype of ChatGPT has generated great interest and concerns on how it can do in every aspect of human activities, including education. Among some unique features, the fact that ChatGPT is capable of providing human-like conversations raises a particular attention. A major negative impact to education is to make assessment of student learning difficult. Statistical concepts are unique to other disciplines due to the fact that various concepts are subtle and confusing. Some examples include independent Vs. conditional events, correlation Vs. causation, standard deviation Vs. standard error and Central Limit Theorem. (1) "Is the current ChatGPT capable of explaining these unique statistical concepts?" Numerical computation is an integral process of learning statistical concepts. Hands-on activities and projects are critical for statistical thinking. (2) "Is ChatGPT capable of performing proper computations, suggesting hands-on activities and projects for teaching and learning statistics?" Misuse of statistics in practical analysis has been a critical concern. (3) "Can ChatGPT identify different misuses of statistical techniques?". ChatGPT uses data-driven approach, not causality-driven. It provides a most likely answer based on the available data. However, it is not able to validate the answer. Some pros and cons will be discussed. The fast development of AI will continue to evolve and become more and more sophisticate. Educators, including statistics profession, will need to embrace and take the advantage of AI technology in their teaching and finding creative approaches to facilitate learning.
Presenting Author
Carl Lee, Central Michigan University
First Author
Carl Lee, Central Michigan University
Quantitative high throughput screening (qHTS) assays can be used to evaluate the bioactivity of thousands of chemicals in a single experiment. The Tox21 program utilizes qHTS to prioritize testing of chemicals and predict their effects on humans and the environment. Within Tox21, there are numerous datasets with assay results for thousands of chemicals at different concentration levels, where each chemical is represented by multiple response profiles. Cluster Analysis by Subgroups using ANOVA (CASANOVA) is an automated quality control procedure to identify compounds within an assay that have consistent response patterns. To provide general accessibility to the method, an R Shiny web app has been developed to provide a user-friendly interface for running the CASANOVA analysis and for visualizing the resulting concentration-response profiles. This app enables scientists to easily run CASANOVA on their experimental data, reload previously completed CASANOVA analysis results, and display Tox21 CASANOVA results, without any prior knowledge in R. Visualization of clustered concentration-response curves allows scientists to better understand the main sources of variation in qHTS studies by scrutinizing chemicals that produce inconsistent responses among multiple concentration-response profiles. Concentration at half-maximal response (AC50) estimates are also calculated for each cluster to provide a quantitative measure of chemical potency.
Presenting Author
Guanhua Xie
First Author
Guanhua Xie
CoAuthor(s)
Shawn Harris, Social & Scientific Systems, Inc.
Keith Shockley, National Institute of Health
Shyamal Peddada, Biostatistics and Computational Biology Branch, Division of Intramural Research, NIEHS
Background: Social Determinants of Health (SDOH) surveys are data sets that provide useful health related information about individuals and communities at large. This study aims to develop a user-friend web application that allows clinicians to get predictive insight about the social needs of their patients prior to their in-patient visits using SDOH survey data to provide an improved and personalized service.
Method: The SDOH dataset used is a longitudinal survey that consists of 108,563 patient responses to 12 survey questions. It was collected from The University of Kansas Health System (TUKHS). The questions were designed to have a binary outcome as the response. Then the patient's most recent responses for each of these questions was modeled independently by incorporating explanatory variables. Multiple classification and regression techniques were used, including logistic regression, Bayesian generalized linear model, extreme gradient boosting, gradient boosting, neural networks, and random forests. Finally, these models were packaged into an R Shiny application that allows users to predict and make comparisons among models.
Results: Area under the curve (AUC) values for 72 models were calculated. Based on AUC values, Gradient Boosting models provided the highest precision values. Models were packaged into an R shiny application, a tool that can predict an individuals' response to a survey question based on their gender, race, ethnicity, age, and zip code.
Conclusions: We propose a predictive tool that aids the health system address patients in need of assistance and, by extension, improve their communities. This tool is hosted online as a freely available website by the University of Kansas Medical Center's Department of Biostatistics & Data Science: https://biostats-shinyr.kumc.edu/Predicting_SDOH/.The R source code and supporting materials used to host the models has been made publicly available on: github.com/CRISsupport/SDOH-Predictions-KS-WestMO.
Presenting Author
Sam Pepper
First Author
Sam Pepper
CoAuthor(s)
Isuru Ratnayake, Kansas University Medical Center
Dinesh Pal Mudaranthakam
The Data Science Collaboratory at Colgate University is focused on three interconnected goals:
¹ Develop RShiny resources that lower barriers to and increase the quality of quantitative research.
² Create instructional materials that make standard statistical procedures and best practices accessible to those new to quantitative research.
³ Cultivate a multi-institutional community that collaborates on data-driven solutions to emergent global issues (climate change, social justice) while developing the next generation of data scientists.
By incorporating students into these scientific communities, we ensure that the next generation of data scientists gains exposure to data science research and experience selecting, applying, and interpreting the results of appropriate techniques in various real-world contexts.
Presenting Author
Joshua Finnell Finnell, Colgate University
First Author
Joshua Finnell Finnell, Colgate University
CoAuthor
William Cipolli
Recently published guidance for undergraduate data science may not result in consistency in learning outcomes across or within institutions. To promote this consistency, the Mastery Rubric for Statistics and Data Science (MR-SDS) was developed prioritizing learning and the development of independence in the knowledge, skills, and abilities for professional practice in statistics and data science (SDS). A MR-SDS -driven curriculum can emphasize computation, statistics, or a third discipline in which the other(s) would be deployed; or, all three. The MR-SDS promotes consistency with recommendations for SDS education, and allows "statistics", "data science", and "statistics and data science" curricula to reliably, but flexibly, educate with a focus on increasing learners' independence. The MR-SDS supports self-directed learning, training, and tertiary education, accommodating the interests of business, government, and academic work force development. The MR-SDS can be used for development or revision of an evaluable curriculum for undergraduates, upskilling and training, and doctoral level learning.
Presenting Author
Rochelle Tractenberg, Georgetown University
First Author
Rochelle Tractenberg, Georgetown University
CoAuthor(s)
Donna LaLonde, American Statistical Association
Suzanne Thornton, Swarthmore University
Chat GPT is an artificial intelligence chat bot developed by OpenAI. It was released in November of 2022 and has already transformed how we teach our students. Like the release of any new technology, it brings concerns of how it will impact the way that students learn, or how students will use it to avoid learning. Artificial intelligence chat bots are here to stay, and we propose that, as educators, we should embrace the technology and learn to utilize it in ways that will engage our students in the classroom and make our work easier. In our talk will walk through some ways that we can use chat bots to help with these goals.
Presenting Author
Victoria Woodard, University of Notre Dame
First Author
Victoria Woodard, University of Notre Dame
CoAuthor
Roger Woodard, University of Notre Dame
To help better understand the underlying causes of the two most prominent chronic urological pain disorders – interstitial cystitis/bladder pain syndrome (IC/BPS) and chronic prostatitis/chronic pelvic pain syndrome (CP/CPPS), the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) of the National Institutes of Health (NIH) established the Multidisciplinary Approach to the Study of Chronic Pelvic Pain (MAPP) Research Network in 2008. The primary clinical research effort carried out during the MAPP Network's first 5-year project period (MAPP I) was a prospective cohort study and from December 14, 2009, through December 14, 2012, 1,039 men and women were enrolled in the study, including persons with UCPPS (n = 424); persons with other comorbid illnesses, including fibromyalgia, irritable bowel syndrome, and chronic fatigue syndrome (n = 200 for all conditions); and healthy controls (n = 415). All study participants were extensively characterized (i.e., phenotyped) at baseline, and UCPPS participants were further assessed during an additional 12-month follow-up period.
In order for researchers to gain unlimited access to the raw data from the MAPP studies, a MAPP Dataview application was created using R Shiny. This applet allows users to query data and view or download these results in tabular or graphic representation. The Dataview is optimized for user interaction, incorporating different graphical and statistical options for the user to obtain summary statistics. These options consist of a wide range of graphs and summary tables on either the MAPP full dataset or user-specified subsets. Users can additionally run basic regression analysis on the baseline data for both continuous and categorical data. Four datasets are available for use: (1) MAPP I baseline, (2) MAPP II baseline, (3) MAPP I longitudinal, and (4) MAPP II longitudinal and the Dataview database, which is updated regularly to incorporate newly obtained follow-up data.
Presenting Author
Flynn McMorrow
First Author
Flynn McMorrow
CoAuthor
J. Richard Landis, University of Pennsylvania
The role of a Data-Driven Leader is to drive accountability and build a data culture characterized by the leader and the organization's staff implementing decisions in a fact and data-based process. This is achieved by steering teams away from opinions, cognitive bias, group-thinking and self-censorship. But how is this leadership developed individually and organizationally? This paper attends to this question specifically for master students in Analytics and Data Science. While comprehensive reviews of what the market requires of a Data Scientist for employability consistently point at technical skills, soft skills such as, e.g, the ability to communicate effectively and being a team-player are additional prerequisites from the employers' side. However, soft skills are at the best indirectly developed by the students in most programs in the field. For instance, a dedicated curriculum in Data-Driven Leadership is a rare encounter, and the few existing ones are highly heterogeneous in content and learning goals. This paper offers a proposal for the standardization of a curriculum in a Data-Driven Leadership course. In doing so, we argue for content consisting of behavioral experiments, cognitive bias and debiasing, game theory, investment under uncertainty, mechanism design, optimal stopping theory, and systematic review methodology. Moreover, we argue for team-based learning where self-learning of teams on conceptual topics is combined with data analysis assignments. Further, their communication skills is developed by testing their presentations on their peers. During the course, the team members monitor the leaders and themselves by a self-reflective protocol. The findings of this paper is based on the analysis of data on 50 enrolled students, where students' perceptions, assessment of learning goals, pre/post personality tests, and team performance data consistently indicate the enhancement of the students' soft-skills and data-driven leadership ability.
Presenting Author
Kenneth Carling, Dalarna University
First Author
Kenneth Carling, Dalarna University
CoAuthor(s)
Arend Hintze, Dalarna University
Asif M Huq, Dalarna University
Ilias Thomas, Dalarna University