Statistics and Data Science Educational Explorations

Ayyuce Begum Bektas Chair
Memorial Sloan Kettering Cancer Center
 
Thursday, Aug 7: 10:30 AM - 12:20 PM
4230 
Contributed Papers 
Music City Center 
Room: CC-205C 

Main Sponsor

Section on Statistics and Data Science Education

Presentations

A New Hyperbolic Tangent Family of Distributions: Properties and Applications

This paper introduces a new family of distributions called the hyperbolic tangent (HT) family. The cumulative distribution function of this model is defined using the standard hyperbolic tangent function. The fundamental properties of the distribution are thoroughly examined and presented. Additionally, an inverse exponential distribution is employed as a sub-model within the HT family, and its properties are also derived. The parameters of the HT family are estimated using the maximum likelihood method, and the performance of these estimators is assessed using a simulation approach. To demonstrate the significance and flexibility of the newly introduced family of distributions, two real data sets are utilized. These data sets serve as practical examples that showcase the applicability and usefulness of the HT family in real-world scenarios. By introducing the HT family, exploring its properties, employing the maximum likelihood estimation, and conducting simulations and real data analyses, this paper contributes to the advancement of statistical modeling and distribution theory. 

Keywords

Goodness-of-fit

Hyperbolic tangent function

Inverse exponential distribution

Maximum likelihood estimation

Moments

Simulation 

First Author

Shahid Mohammad, Department of Mathematics, UW-Oshkosh

Presenting Author

Shahid Mohammad, Department of Mathematics, UW-Oshkosh

Analyzing STEM Faculty Academic Productivity: Exploring with NbClust Package and Logistic Regression

This study investigates the nexus between research and teaching productivity among STEM faculty at a public research-intensive university, analyzing data from 553 faculty members across four STEM disciplines: Information and Computer Sciences, Biological Sciences, Engineering, and Physical Sciences. By applying cluster analysis with the NbClust package and logistic regression, this research explores correlations between academic productivity metrics and faculty demographics, including position type, rank, gender, and discipline etc. The analysis identifies distinct productivity clusters characterized by varying levels of research and teaching outcomes across demographic groups, highlighting significant disparities. These findings highlight the need for institutional policies that comprehensively support both teaching and research, thereby fostering STEM faculty success. This study provides a nuanced understanding of STEM faculty productivity profiles, informing strategies for equitable institutional resource allocation, faculty development, and evaluation, ultimately contributing to the advancement of STEM education and fulfilling institutional missions. 

Keywords

NbClust Package

Cluster Analysis

Logistic Regression

STEM Education

Academic Productivity (Teaching and Research)

STEM Faculty Characteristics 

Co-Author(s)

Brian Sato, University of California Irvine
Kameryn Denaro, UCI

First Author

Anna Kye

Presenting Author

Anna Kye

Bayesian Hierarchical Modeling of Large-Scale Math Tutoring Dialogues

We propose a Bayesian hierarchical framework for analyzing large-scale mathematics tutoring dialogues that models cognitive load as latent variables inferred from observable behavioral patterns in educational conversations. Our approach treats response timing patterns and communication modality choices (i.e., sending text vs. images) as observable indicators of underlying cognitive states, with a two-phase experimental design comparing behavioral-only versus content enhanced models incorporating LLM-based understanding classification. Applied to MathMentorDB---5.4 million messages across 200,332 tutoring conversations---our method reveals bidirectional cognitive dependencies where student confusion systematically increases tutor cognitive load, and vice versa. We demonstrate that temporal and modality patterns can reliably indicate latent cognitive states in educational dialogues, with cross-role dependencies providing new insights into collaborative learning dynamics. This work bridges research from education, Bayesian statistics, and natural language processing, providing both methodological innovations for modeling cognitive load in online learning conversations and actionable insights for designing adaptive tutoring systems. 

Keywords

Large Language Models

Educational Data Mining

Bayesian Hierarchical Modeling

Artificial Intelligence (AI)

Data Science

Natural Language Processing 

Co-Author(s)

Michael Light, University of Michigan
Sumit Asthana, University of Michigan
Kevyn Collins-Thompson, University of Michigan

First Author

Michael Ion

Presenting Author

Michael Ion

Designing Interoperable Public Data Spaces: From Best Practices to Graduate Education.

Public data spaces have become a cornerstone of modern data governance, enabling secure, transparent, and interoperable sharing among public and private sectors. As data-driven decision-making expands, robust design principles are essential to ensure efficiency, trust, and ethical use. This paper reviews literature from European, U.S., and international frameworks, outlining best practices and challenges in implementing public data spaces.
Building on this review, we propose a set of core guidelines for the design of good public data spaces that emphasize interoperability, privacy, governance, and stakeholder collaboration. We also offer a structured proposal for incorporating these principles into graduate curricula in statistics and data science, ensuring future professionals develop skills in data interoperability standards, privacy-preserving sharing, legal and ethical frameworks, and data stewardship.
A case study on designing a National Quality Infrastructure (NQI) data space demonstrates how well-governed public data ecosystems can improve standardization, accreditation, metrology, and quality control, ultimately enhancing economic performance and regulatory efficiency. 

Keywords

Graduate Curriculum

Data Governance

Public Data Spaces

Open Data

Data Interoperability

National Quality Infrastructure 

First Author

Monika Rozkrut, University of Szczecin

Presenting Author

Monika Rozkrut, University of Szczecin

From Classroom to Corporate: The Data Mine’s Impact on Data Science and Industry Collaboration

The Data Mine, currently in its seventh year, enables more than 2000 interdisciplinary graduate and undergraduate students with hands-on experience in data science. Based on the principles of learning by doing, teamwork, and real-world data science, this model is used by learners at more than 60 colleges annually. Additionally, The Data Mine enables approximately 100 projects with Corporate Partners across many types of domains, including aerospace, agriculture, manufacturing, pharmaceutical science, etc. This model has proven to be a very effective method for colleges and companies to quickly and easily build relationships that create genuine value for partners and students alike. The newest Data Mine location in Indianapolis is a successful example of this model to rapidly scale and return a strong institutional investment. This session will briefly explain why The Data Mine has become pervasive as a model for data science research across institutions of varying profiles. A case study of the past year launch of Indianapolis will be included. 

Keywords

industry-university partnerships

experiential learning

data science

mentoring

industry-student collaboration

student development 

Co-Author(s)

Mark Ward, Purdue University
Fulya Gokalp Yavuz, Purdue University

First Author

Margaret Betz, Purdue University - The Data Mine

Presenting Author

Margaret Betz, Purdue University - The Data Mine

Teaching Abroad: The History of Statistics in the UK and Ireland

Participants will embark on a journey through the development and modern practice of statistics and data science in the British Isles, exploring a new course, "The History of Statistics in the UK and Ireland." Pictures from the course will help participants feel like they were there. Site selection, course logistics, unique challenges, pedagogies, and student and faculty outcomes will be discussed. No prior knowledge is required, and instructors interested in traveling abroad will receive helpful advice. Course materials, including the course website, will be shared, providing a comprehensive model for developing similar courses. Individuals interested in the course topics are also encouraged to attend. 

Keywords

study abroad

history of statistics

traveling course

course design 

First Author

Tyler George, Cornell College

Presenting Author

Tyler George, Cornell College

The Data Mine Model for Secondary School Level

How can we effectively integrate data science into early education? Should it be woven into the formal curriculum or offered as part of extracurricular activities? Current literature supports both strategies, yet many mathematics teachers feel unprepared due to a lack of training in data science and programming languages such as R and Python. We propose a slightly modified version of The Data Mine (TDM) model to tackle these hurdles. Proposed model fosters collaboration among undergraduate students, researchers, and industry professionals, creating a vibrant learning community that also engages secondary school students. By promoting experiential learning in addressing real-world data science challenges outside the traditional curriculum, this initiative equips students with essential skills and cultivates a culture of innovation. Ultimately, early engagement in data science will prepare students for the complexities of tomorrow's world and inspire them to become proactive contributors to society. 

Keywords

Data science education

secondary school

TDM

experiential learning

learning community 

Co-Author(s)

Mark Ward, Purdue University
Fulya Gokalp Yavuz, Purdue University

First Author

Tugba Kapucu, Purdue University The Data Mine

Presenting Author

Tugba Kapucu, Purdue University The Data Mine