Contributed Poster Presentations: Section on Statistics and Data Science Education

Shirin Golchi Chair
McGill University
 
Wednesday, Aug 6: 10:30 AM - 12:20 PM
4174 
Contributed Posters 
Music City Center 
Room: CC-Hall B 

Main Sponsor

Section on Statistics and Data Science Education

Presentations

Central Limit Theorems and Approximation Theory

Central limit theorems (CLTs) have a long history in probability and statistics. They play a fundamental role in constructing valid statistical inference procedures. Over the last century, various techniques have been developed in probability and statistics to prove CLTs under a variety of assumptions on random variables. Quantitative versions of CLTs (e.g., Berry–Esseen bounds) have also been parallelly developed. In this article, we propose to use approximation theory from functional analysis to derive explicit bounds on the difference between expectations of functions. We provide bounds on the difference between functions of random variables using level sets of functions. Using classical uniform and non-uniform Berry–Esseen bounds for univariate random variables. The resulting bounds can be applied to single-layer neural networks and functions on [-1,1]^d with finite weighted norm integrable Fourier transform. These functions belong to the functions in Barron space. Unlike the classical bounds that depend on the oscillation function of f, our bounds do not have an explicit dimension dependence. 

Keywords

multidimensional

central limit theorem

Berry-Esseen bound

dependence on dimension

dependence on function 

First Author

ARISINA BANERJEE, Cornell University

Presenting Author

ARISINA BANERJEE, Cornell University

Data Science Modules for K-12 Afterschool Clubs

K-12 teachers from Oregon are encouraged to make use of a set of science education tools and model lessons created by the research team the Language, Culture, and Knowledge-building through Science (LaCuKnoS) project -- an NSF-awarded project at Oregon State University. We developed engaging data visualization and AI application class modules, utilizing LaCuKnoS tools such as language booster, concept cards, and other interactive learning aids. These modules aim to provide students with a understanding of essential concepts in data science, making complex topics more accessible and enjoyable. To assess the impact of these activities on students' learning outcomes, surveys are administered twice each academic year, measuring the improvement in both the students' understanding of STEM concepts and their interest in pursuing STEM-related fields. The analysis focuses on the development of students' STEM and how their participation in the program influences their career preferences. With various statistical tools, we implemented a system for evaluating the conceptual understanding of STEM materials of K-12 students in the LaCuKnoS project. 

Keywords

data science

K-12

STEM

education

AI 

First Author

Jingtian Yu, Oregon State University

Presenting Author

Yanming Di, Oregon State University

Enhancing Statistics Classroom Review Through A Cooperative Fantasy-Themed Board Game

Games can offer an engaging and low-stakes method for students to review and reinforce their learning. I present the use of a cooperative, fantasy-themed board game designed to help students solidify concepts covered in a second course in statistics. By leveraging the creative writing skills and assistance of a generative AI model, a compelling narrative and game mechanisms were developed to immerse students in a fun classroom experience.

I will detail the process of generating the game's story, designing its mechanisms, and creating the accompanying graphics. Additionally, feedback from students who participated in the game will be shared, highlighting the effectiveness and enjoyment of this educational approach. The feedback will also include suggestions for alternative game mechanisms, improvements to the game, and ideas for different themes and settings. 

Keywords

Statistics Education

Game-Based Learning 

First Author

Will Boyles

Presenting Author

Will Boyles

Free LLMs versus Paid LLMs: Do They Widen the Gap in Education?

As Large Language Models (LLMs) become integral to education, disparities in access to high-quality AI tools raise concerns about their impact on the education gap. This study examines the differences between free and paid LLMs in terms of accessibility, performance, and effectiveness in educational settings. By analyzing model capabilities, resource availability, and student outcomes, we assess whether free models provide equitable learning opportunities or if paid versions create an advantage for those with financial means. Our findings offer insights into the role of LLMs in shaping the future of education and the potential need for policy interventions to ensure fair access. 

Keywords

AI in education

Large Language Models

digital divide 

Co-Author(s)

Sophie Yang, Bucknell University
Dulguun Soyol-Erdene, Bucknell University
Keegan Kang, Bucknell University

First Author

Sophie Yang, Bucknell University

Presenting Author

Sophie Yang, Bucknell University

Measuring Curiosity in Introductory Statistics Students

Exploring the nature of how students learn Statistics and how instructors can most effectively help them has been a focal point in statistics education research over the past few decades (Carver et al., 2016). While earlier studies focused on different teaching approaches (e.g., Simon et al., 1976; Federer, 1978), cognitive challenges and misconceptions (e.g., Brewer, 1985; Garfield and Ahlgren, 1988), and students' attitudes (e.g., Pavlick, 1975; Gal and Ginsburg, 1994), recent research has shifted toward understanding the motivational aspects of learning statistics, e.g. interest (e.g., Sproesser, 2016), self-efficacy (e.g., Finney and Schraw, 2003), and intrinsic motivation (e.g., Dun, 2014). We aim to explore curiosity as part of intrinsic motivation, recognizing its potential to enhance students' learning (Pluck and Johnson, 2011).

Curiosity–the desire to acquire knowledge–is integral to learning environments that actively engage students when teachers can use specific techniques to evoke curiosity, enriching the learning atmosphere (Schmitt and Lahroodi, 2008). One of the initial focuses of this cross-institutional collaboration is to see whether we can measure curiosity 

Keywords

Curiosity

Statistics Education

Intrinsic Motivation

Learning

Student Engagement

Teaching Environment 

Co-Author(s)

Amy Truong, Cal Poly - California Polytechnic State University
Ella Smith, Cal Poly - California Polytechnic State University
Beth Chance, California Polytechnic State University

First Author

Visruth Srimath Kandali, California Polytechnic State University

Presenting Author

Visruth Srimath Kandali, California Polytechnic State University

Multivariate Time Series Analysis of Lung and Colon Cancer Mortality in Jamaica and the U.S.

Lung and colon cancers are leading causes of mortality worldwide, with variations across healthcare systems. This study uses multivariate time series modeling to analyze lung and colon cancer mortality trends in Jamaica and the U.S. from 1960 to 2014, applying Vector Autoregressive Moving Average (VARMA) models to assess interdependence. Country-specific multivariate forecasts extend 12 years beyond 2014, identifying disparities, similarities, and influencing factors. Model selection and validation use statistical metrics like MAPE, RMSE, and AIC to ensure accuracy. Monte Carlo simulations enhance predictive robustness by accounting for future variability. This research provides data-driven insights into cancer mortality trends, contributing to the development of advanced statistical models for understanding and forecasting cancer outcomes. Findings will support public health planning and policy development in both regions. 

Keywords

Cancer Mortality

Time Series Analysis, VARMA, Multivariate Forecasting

Monte Carlo Simulation, Predictive Analytics

Public Health

Geographic Analysis: Jamaica, United States 

Co-Author

Mostafa Zahed, East Tennessee State University

First Author

Shanice Douglas, East Tennessee State University

Presenting Author

Shanice Douglas, East Tennessee State University

Order Restricted Cluster Randomized Block Design

This research introduces a novel two-stage cluster randomized design, the order restricted cluster randomized block design (ORCRBD). The ORCRBD builds upon the cluster randomized block design by incorporating a second layer of blocking, achieved through ranking cluster units that are randomly sampled from the population. This approach creates a two-way layout, with blocks and ranking groups, and employs restricted randomization to enhance the accuracy of treatment contrast estimation. We calculate the expected mean square for each source of variation in the ORCRBD under a suitable linear model, develop an approximate F-test for the treatment effect, assess ranking quality, calculate optimal sample sizes for a given cost model, formulate multiple comparison procedures, and apply the design to an educational setting. 

Keywords

order restricted randomization

ranked set sampling

intracluster correlation coefficient

Latin square

optimal design 

Co-Author(s)

Omer Ozturk, The Ohio State University
Olena Kravchuk, The University of Adelaide

First Author

Gregory Hopper, Centre College

Presenting Author

Gregory Hopper, Centre College

Removing barriers to better data practices through Capability, Opportunity, and Motivation

Transparent, trustworthy research depends on sharing data and code and having results verified by others, yet education tends to focus on best practices or knowledge-deficit models that are often insufficient for behavior change. We adopt the Capability, Opportunity, and Motivation for Behavior change (COM-B) model using levers in the Behavior Change Wheel to create educational materials to improve data practices as a behavior change problem in a collaboration among Arkansas Children's Research Institute, UAMS's Institute for Digital Health & Innovation, and Indiana University School of Public Health-Bloomington's Biostatistics Consulting Center. Module 1 covers capabilities, opportunities, and motivations for data and code sharing and verification, acknowledging investigator barriers (e.g., being scooped, attacks, and lack of time, know-how, and resources). Module 2 provides background on capabilities and opportunities to share to enhance reproducibility, while Module 3 covers processes and practices for sharing and verification. Self-paced materials were created using the Rise Articulate platform and are Sharable Content Object Reference Model (SCORM) and Section 508 compliant. 

Keywords

Education

Reproducibility

Data sharing

Verification

Behavior change 

Co-Author(s)

Stephanie Dickinson, Indiana University, Department of Epidemiology and Biostatistics
CJ Fortune, Institute for Digital Health & Innovation, University of Arkansas for Medical Sciences
Sydney Howk, Institute for Digital Health & Innovation, University of Arkansas for Medical Sciences
Kimberly Lamb, Institute for Digital Health & Innovation, University of Arkansas for Medical Sciences
Anna Macagno, Indiana University School of Public Health-Bloomington
Erik Parker, Indiana University

First Author

Andrew Brown, University of Arkansas for Medical Sciences

Presenting Author

Andrew Brown, University of Arkansas for Medical Sciences

Risk Factors and Individualized Prediction of Student Retention

Student attrition is an important issue for higher education as it brings about grave costs to both students and institutions. In this project, we study two-year persistence of students enrolled at a large four-year public institution in California as First-time Freshmen from Fall 2016 to Fall 2020. Predictors considered in the study include student demographic information, socioeconomic variables, academic preparation, and their academic performance at the institution. Two analytical approaches are used, discrete-time survival analysis and random forest. The results from both models indicate that academic performance variables after enrollment are most strongly associated with two-year persistence, including term units earned, term GPA, whether a student is on probation, and whether a student earned units in the first summer after enrollment. Further, monitoring and providing help promptly to students with earned units below 6 or GPA below 2.0 in the first term may prevent them from dropping out. We also illustrate how the random forest model may be used to provide individualized prediction of two-year persistence. 

Keywords

student retention

discrete-time survival analysis

random forest

variable importance

individualized prediction 

Co-Author(s)

Erin Jacobs, San Diego State University
Richard Levine, San Diego State University
Jeanne Stronach, San Diego State University

First Author

Xi Yan, San Diego State University

Presenting Author

Juanjuan Fan, San Diego State University