CS028 Advancing Data-Driven Community Well-Being in Health and Education

Conference: Symposium on Data Science and Statistics (SDSS) 2025
05/01/2025: 3:45 PM - 5:15 PM MDT
Refereed 
Room: Alpine East 

Chair

Sunghwan Byun, North Carolina State University

Target Audience

Mid-Level

Tracks

Practice and Applications
Software & Data Science Technologies
Statistical Data Science
Symposium on Data Science and Statistics (SDSS) 2025

Presentations

AI-Driven Tools and Automated Pipelines for Enhancing School Improvement Efforts

Advances in AI and automation are reshaping educational workflows, making processes more efficient, accurate, and equitable. In collaboration with the Illinois State Board of Education (ISBE), the American Institutes for Research (AIR) has modernized traditional school needs assessments processes. These assessments, once requiring months of manual effort, now benefit from streamlined workflows that reduce timelines to mere weeks while improving the level of customization and actionable insights available to schools.

Central to this transformation is the AI Findings Pipeline, which leverages WhisperX and GPT to automate transcription, speaker identification, and the tagging of focus group audio. This tool enables the rapid generation of AI generated insights, converting hours of discussions into minutes of customized findings. Alongside this innovation is the Report Running Data Pipeline, a system that integrates Airtable automations with AWS technologies to produce tailored, data-rich reports on demand. These reports combine AI-generated findings with critical school metrics and survey data, offering a holistic view of school performance and critical needs.

Together, these tools provide researchers and school leaders with timely, evidence-based recommendations, supporting ISBE's Equity Journey Continuum and broader equity goals. By integrating structured data with qualitative expertise, this approach highlights the potential of AI to simplify complex processes, minimize logistical burdens, and drive impactful improvements in educational systems. This methodology underscores the growing importance of leveraging technology to meet the evolving challenges of education while maintaining a focus on equity and inclusivity. 

Presenting Author

Graham Chickering, American Institutes for Research

First Author

Graham Chickering, American Institutes for Research

CoAuthor(s)

Christina Jones, American Institutes for Research
Collin Heckman, American Institutes for Research

Enhancing Public Health Surveillance: Fine-Tuning Large Language Models for Adverse Drug Event Classification

The increasing adoption of artificial intelligence (AI) across regulatory and healthcare domains highlights its transformative potential in addressing critical public health challenges. The U.S. Food and Drug Administration (FDA) has identified adverse drug event (ADE) detection as a priority area for innovation, as outlined in its strategic initiatives. Timely and accurate identification of ADEs is critical for ensuring patient safety and informing regulatory decisions. However, leveraging the FDA Adverse Event Reporting System (FAERS) for ADE detection remains fraught with challenges, including data heterogeneity, reporting inconsistencies, and scalability issues.

Recent advances in generative AI, machine learning (ML), and large language models (LLMs) offer a promising path forward. A recent study demonstrated the efficacy of fine-tuned LLMs, such as GPT-3.5, in analyzing detailed vaccine adverse event reports in the Vaccine Adverse Event Reporting System (VAERS) (Li et al., 2024). Using 91 annotated reports, the authors developed AE-GPT, a tool for automatically extracting and categorizing adverse events, setting a new benchmark in ADE detection.

Our research builds on this precedent, aiming to enhance ADE detection by fine-tuning LLMs for FAERS datasets. FAERS contains millions of masked case reports spanning 2004 to 2024, with data fields including demographic, administrative, drug, reaction, and patient outcome information. We use embeddings from LLMs to classify case severity and identify features predictive of severity, providing a multi-strata classification scheme for ADE detection. We use logistic regression as a baseline and compare the results to standard ML models including a Random Forest classifier, DB Scan, and XGBoost. Our framework achieved notable results demonstrating the potential of LLMs in processing complex medical data and highlight the ability to enhance early ADE detection. 

Presenting Author

John Riddles, Westat

First Author

Joshua Turner, Westat

CoAuthor(s)

John Riddles, Westat
Julianna Lee, Westat
Jeremy Corry, Westat
Rashi Saluja
Sean Chickery, Westat
Gizem Korkmaz, Westat
Marcelo Simas, Westat
Kevin Wilson, Westat

Environmental Risk Factors and Racial Inequities in TNBC Diagnosis Stages: A Bayesian Mediation Study in Louisiana

Triple-negative breast cancer (TNBC) has a higher recurrence rate and poorer overall mortality than other molecular subtypes in U.S. Studies have shown that African American (AA) women are genetically more likely to develop advanced TNBC than Caucasian American (CA) women. In Louisiana (LA), there were 3,790 TNBC cases from 2010 to 2017, of which 1,861 (49.1%) were AA versus 1,900 (50.1%) were CA. However, 32.8% of the LA population were AA and 62.8% were CA. Notably, 43.5% of the AA patients were diagnosed with regional or distant metastasis, compared with 36.6% of CA patients. Thus, TNBC diagnosis stage represents a significant challenge to racial health disparities in LA.

Our research is based on data collected by the Louisiana Tumor Registry (LTR) from 2010-2017. In addition to the routinely collected standard data, LTR connected related variables with U.S. census tract level environmental factors from National Scale Air Toxics Assessment (NATA) along with the environmental justice indices (EJI). A total of 3,225 adult female TNBC patients were included in the dataset. Among them, 1,675 (51.9%) were AA and 1,550 (48.1%) were CA. We used the Bayesian mediation analysis method to identify environmental risk factors and quantify their effects that explain the racial disparities in stage at diagnosis among TNBC patients in Louisiana.

There is significant association between race and stage at diagnosis (p-value < 0.001). The disparity was partially explained using the collected mediators. The significant mediators included patient's age at diagnosis (25.89%), insurance (4.71%), poverty index (26.16%) and environmental chemical Naphthalene (8.38%).

In LA, a high proportion of Black residents live in cancer ally. This exposes them to high toxic emission that contains carcinogens like Naphthalene. Early diagnosis, improving access to health insurance, reducing poverty-related barriers, reducing exposure to Naphthalene can help with early detection of TNBC. 

Presenting Author

Nubaira Rizvi, LSU-Health New Orleans

First Author

Nubaira Rizvi, LSU-Health New Orleans

CoAuthor(s)

Xiao-Cheng Wu, Louisiana Tumor Registry
Bin Li, Louisiana State University
Qingzhao Yu