Print Close

Using New Technology to Enhance Statistics and Data Science Education

Monika Rozkrut Chair
University of Szczecin

Wednesday, Aug 6: 10:30 AM - 12:20 PM
4188
Contributed Papers

Music City Center

Room: CC-101D

Main Sponsor

Section on Statistics and Data Science Education

Presentations

AI in Biostatistics: a project-based approach

The term "AI" was used loosely to describe computer bots in the past. Now, computers can perform complex tasks, and we must find ways to make a positive impact of using Artificial Intelligence in education. Through a project-based approach, students experience real-world applications of data analysis in Biology using technologies: AI, The R Program, and Adobe Portfolio. Students are guided to use free AI tools (i.e. ChatGPT) to generate biological research questions, simulate random data, and analyze the data using techniques and methods taught in class. With the help of AI tools, students learn to ask ChatGPT how to code their analysis using the (free) R Program and present their results through projects by creating a website using Adobe Portfolio. With measurable learning outcomes for the projects at the end of the semester and opportunities for formative feedback as "check points" throughout each project, students gain a better understanding of the process. This paper will discuss the project-based approach with student examples along with the results of a survey designed to gather student feedback to make enhancements in future semesters.

Keywords

AI

Adobe Portfolio

Statistical Education

First Author

Kimberly Massaro, UTSA

Presenting Author

Kimberly Massaro, UTSA

AI-Generated Text Detection in the Context of Domain- and Prompt-Specific Essays

The widespread adoption of Large Language Models has made distinguishing between human- and AI-generated essays more challenging. This study explores AI detection methods for domain- and prompt-specific essays within the Diagnostic Assessment and Achievement of College Skills (DAACS) framework, applying both random forest and fine-tuned ModernBERT classifiers. Our approach incorporates pre-chatGPT essays, likely human-generated, alongside synthetic datasets of essays generated and modified by AI. The random forest classifier was trained with open-source embeddings such as miniLM, RoBERTa, and a low-cost OpenAI model, using a one-versus-one strategy. The ModernBERT method employed a novel two-level fine-tuning strategy, incorporating essay-level and sentence-pair classifications that combines global text features with detailed sentence transitions through coherence scoring and style consistency detection. Together, these methods effectively identify whether essays have been altered by AI. Our approach provides a cost-effective solution for specific domains and serves as a robust alternative to generic AI detection tools, all while enabling local execution on consumer-grade hardware.

Keywords

artificial intelligence, machine learning, large language models, text classification, AI detection, ModernBERT, random forest, embedding models, academic integrity

AI ethics, chatGPT, text analysis, coherence detection, style consistency, AI-generated content, synthetic data, DAACS, self-regulated learning, deep learning

essay assessment, diagnostic assessment, educational technology, sentence-pair analysis, miniLM, RoBERTa, OpenAI

Co-Author(s)

Angela Lui, City University of New York
Jason Bryer, City University of New York

First Author

Bruno de Melo

Presenting Author

Bruno de Melo

Integrating AI for Seamless R to Python Code Conversion with Quarto in Data Science Courses

Trained as a statistician using R, the shift to teaching Introduction to Data Science courses with Python presents a challenge for instructors who are more familiar with R. AI has proven to be a valuable tool in converting R code into Python, allowing both languages to be seamlessly integrated into the curriculum. By using AI to automatically convert R code into Python, I've been able to seamlessly integrate both languages into my lectures. In this paper, I will demonstrate how to utilized AI for R-to-Python code conversion and show how to incorporate the side-by-side code in Quarto documents alongside Beamer presentations. This approach not only helps to teach both languages effectively but also enhances the learning experience for students by exposing them to multiple programming paradigms in real-time.

Keywords

AI in Education

R to Python Conversion

Data Science Education

Quarto for Teaching

AI-Assisted Code Translation

First Author

Cathy Poliak, University of Houston

Presenting Author

Cathy Poliak, University of Houston

Teaching and Learning Python, R, SQL, and SAS with One Programming Interface

Students trained in data analytics programs are often judged by future employers on the depth and breadth of their analytical programming knowledge. Exposure to multiple languages, in particular, knowledge of Python, R, SQL, and SAS, will increase the attractiveness of students on the job market. Faculty may wish to focus on teaching one primary language and augment this learning with examples of how to achieve the same result using other languages. Others may have enough instruction time to fully commit to teaching multiple languages. Free cloud software for academics SAS Viya Workbench for Learners (www.sas.com/wfl) provides a Visual Studio interface to write Python, R, SQL, and SAS code to seamlessly teach and learn these languages at the same time within one software. Code from multiple languages can be run within one notebook or organized across many notebooks. Comparison of syntax across languages is easy, setup is minimal, and no installation is required. Git integration is also included. Examples of Python, R, SQL, and SAS code to perform data cleaning and statistical modeling will be shown.

Keywords

software

notebook

First Author

Jacqueline Johnson, SAS Institute

Presenting Author

Jacqueline Johnson, SAS Institute

Teaching LLM literacy for more effective AI-aided learning in a statistics and data science course

Large Language Model (LLM) tools (e.g., ChatGPT) are increasingly helping statistics/data science (DS) courses foster self-efficacy, personalize learning, and make data science accessible to students with less coding training. However, students with inadequate understanding of how LLM tools work may use them counterproductively, thus hindering their learning and problem-solving abilities. To address this, we developed an interactive LLM Literacy curriculum to help students (1) learn LLM fundamentals and then (2) develop best practices for using LLM tools as statistics/DS aids. The modules focus on debugging and statistical design, integrating literature on best practices in these fields with best practices for the use of LLMs as learning aids. The curriculum is tool-agnostic and adaptable to evolving LLM tools. We incorporated the curriculum into a graduate statistics/DS course for biomedical students and found significant improvements in students' LLM prompt-writing practices, ability to solve statistics/DS problems, and confidence in their skills. These findings underscore the importance of LLM literacy training as a necessary part of modern statistics/DS education.

Keywords

large language model

statistics education

data science education

generative AI

statistical literacy

computing

Co-Author

Nils Gehlenborg, Harvard Medical School

First Author

Aparna Nathan, Harvard Medical School

Presenting Author

Aparna Nathan, Harvard Medical School

Understanding of Metabolic Protein Expression Changes in Patient Cancer Datasets

Metabolic alterations in cancer cells are a fundamental characteristic of tumorigenesis. However, limited research has been performed to identify metabolic expression adaptation signatures at the protein level in cancer datasets. In this study, metabolic gene expression datasets from the National Cancer Institute (NCI) Proteomic Data Commons were analyzed to evaluate metabolic protein abundance changes across cancers. In addition, sub-system level metabolic pathway alterations and how they correlate with cancer progression were investigated. Patient metadata, including cancer subtypes, pathological stage, and race/ethnicity, were used to identify features of metabolic protein adaptations driving classifications across tumor subtypes, during cancer progression, and across different patient populations. Gene set enrichment analysis (GSEA) and machine learning approaches were applied to examine protein alterations associated with sub-system metabolic pathways. Understanding metabolic gene expression changes in cancer as the result of metabolic adaptations will enhance our knowledge of cancer biology and highlight functionally important metabolic processes.

Keywords

Metabolic Adaptations

Cancer Metabolism

Proteomic Data Analysis

GSEA

Machine Learning

Co-Author(s)

Agnes Duah
Farzaneh Karimi
Stephanie Reinert, New Mexico State University
Lucas Sullivan, Fred Hutchinson Cancer Center

First Author

Soyoung Jeon, New Mexico State University

Presenting Author

Soyoung Jeon, New Mexico State University