CS025 Software & Data Science Technologies, Part 1

Conference: Symposium on Data Science and Statistics (SDSS) 2025
05/01/2025: 1:15 PM - 2:45 PM MDT
Lightning 
Room: Wasatch 

Description

This session will be followed by an e-poster session on May 1 from 3:20 - 3:45 PM.

Chair

Praveen Gupta Sanka

Presentations

Adventures in Data Dissemination: The DHL Global Connectedness Tracker

In November 2024, the DHL Initiative on Globalization at NYU Stern added a new offering to its series of publications: the DHL Global Connectedness Tracker. This new publication-a shorter but more frequently updated report on global connectedness at the worldwide level-comes in a hybrid online and print format. The online version features new, interactive charts that allow readers to dig deeper into each of the analyses, while the print version is available for people on the go, as well as for archival and citation purposes.

In order to do this, it was necessary to develop a workflow of data analysis and processing to make updating as efficient as possible. Key priorities were:

• Develop a framework that makes updating two formats as simple as possible and avoid errors or discrepancies
• To the extent possible, avoid reliance on people outside the team to do updates (graphic designer, web developer, etc.)
• Automate updates by making all calculations directly in R

The presentation will explore the process used to develop this hybrid version, from working with a JavaScript developer to program an interactive web app to designing the print version as a Word template that can be updated using Quarto and ggplot2. Now that the tracker has gone through the design and implementation phases and has now been released and updated, we are ready to explore what went well, what didn't go well, and what could be done better under different constraints. 

Presenting Author

Caroline Bastian, New York University

First Author

Caroline Bastian, New York University

WITHDRAWN Building Trust in AI: The Crucial Role of Human Centered Machine Learning in Safety Critical Domains

Abstract: The integration of machine learning (ML) models in safety-critical domains like aviation, healthcare etc., not only present significant opportunities but also pose serious challenges. ML /AI models implemented in the safety critical domain are known to improve performance and decision-making, but they also pose unacceptable dangers in various applications.

Firstly, the presentation focuses on the need for interpretable ML models emphasizing on the reliability, safety and trustworthiness.

Secondly presentation walks you through the success of Human in loop interpretable machine learning models and their role in bridging gap between domain experts and ML. In order to achieve model transparency, the presentation will go over the techniques like feature importance analysis, visualization tools and model-agnostics methods. Examples of these techniques are LIME and SHAP. Talk will also discuss the tactics for integrating the overcoming scalability challenges and human feedback loops.

Overall, the objectives of the Lightning Talk are to deliver a thorough understanding of impact and real-time application of human-in-the-loop interpretable ML models in safety-critical environments. 

Presenting Author

Akshata Moharir

First Author

Akshata Moharir

Cava-Solutions: Smart Data, Simple Solutions, Powerful Growth

Micro businesses are often overlooked in a market dominated by solutions tailored to medium and large companies, despite their critical need for data-driven tools to grow in today's competitive environment. For instance, Alas Services, an Utah-based micro business, has struggled for years to consolidate critical information, effectively track costs, define competitive pricing for their services, and evaluate overall business performance. There is a gap separating the needs of small businesses like Alas Services from the apps produced by data scientists. To bridge this gap, we propose an accessible, flexible, and low-cost application specifically designed for micro businesses. This solution leverages proven technologies-including a structured database to centralize information, dynamic dashboards for real-time insights, and demand prediction tools-to provide entrepreneurs with practical, actionable intelligence. Rather than introducing complex innovations, this app focuses on making data-driven decision-making accessible and affordable. Initially developed for Alas Services, the talk will demonstrate how its features can empower other micro and small businesses to track performance, streamline operations, and make informed decisions-all without requiring extensive technical expertise or a significant financial commitment. 

Presenting Author

Silvia Cativo Alas, Utah Valley University

First Author

Silvia Cativo Alas, Utah Valley University

Economics of Digital Markets: an Introductory Course for Data Scientists

Newly formed digital markets create a multiplicity of jobs for data scientists and other professionals that work with data. All of their work revolves around data collected by businesses operating on digital markets: social media platforms, search engines, streaming services, instant messaging services, online gaming platforms and gaming consoles, credit card markets, and so on. All these markets have a common feature; they bring together different sides of a market to meet and interact. Most of them are two-sided markets because they enable two groups of market participants to interact with each other: players and developers of games, users of computer operating systems and applications developers, holders of bank cards and merchants that accept cards as a method of payment. This course offers students an opportunity to learn how digital markets work, why collect data and how they use said data in their business models, and what data scientists can do to ensure proper data processing. 

Presenting Author

Tetyana Beregovska, Truman State University

First Author

Tetyana Beregovska, Truman State University

Simulation-Based Software for Sample Size Calculations in Linear and Logistic Regression

This study develops statistical software to perform sample size calculations in multivariate linear and logistic regression settings based on simulations. Sample size calculations in multivariate regression settings may have to be estimated without analytic calculations. Simulation studies present one manner of estimating the statistical power across repeated experiments. The software develops a searching algorithm that considers a range of sample sizes. Users can specify the data model to generate the study's variables, the regression model to implement, and the simulation's parameters. This greatly reduces the coding required to develop the simulation and to search for the minimally sufficient sample size. We demonstrate the implementation of the software on an example with multivarite regression. 

Presenting Author

David Shilane, Columbia University

First Author

David Shilane, Columbia University

WITHDRAWN Fine-Tuning Large Language Models: Practical Optimization with LoRA and QLoRA

Fine-tuning Large Language Models (LLMs) has become an essential technique for adapting pre-trained models to domain-specific tasks while balancing efficiency and performance. However, full fine-tuning can be computationally expensive and resource-intensive, making it impractical for many real-world applications.

In this talk, we will explore efficient fine-tuning strategies, focusing on LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA)-two powerful methods that enable parameter-efficient adaptation without the need for extensive computational resources. We will break down:

Why fine-tuning matters in real-world applications
The challenges of full fine-tuning and large-scale model adaptation
How LoRA enables efficient fine-tuning by modifying only a small subset of parameters
QLoRA: Pushing efficiency further with quantization while maintaining performance
Practical use cases, trade-offs, and implementation insights
This session is designed for ML practitioners, researchers, and engineers looking to maximize model performance while optimizing for cost and scalability. Whether you're working on LLM deployment, customization, or domain adaptation, this talk will provide actionable insights to help you navigate the landscape of efficient fine-tuning. 

Presenting Author

Kailash Thiyagarajan, Apple

First Author

Kailash Thiyagarajan, Apple

Forecasting Unleashed: Driving Business Impact with Adaptive Models and MLOps Excellence

This lightning session's goal is to show the audience how powerful forecasting can be in enabling both topline revenue growth and bottom line cost efficiencies. Advanced forecasting techniques - including adaptive algorithms powered by open-source tools like Prophet - not only predict key business metrics but also continuously adapt by comparing forecasted outcomes with actual performance. I use my 11+ years of industry experience and business insights to illustrate how adaptive forecasting can drive measurable business impact in real business scenarios, such as quarterly case volume forecasting to better plan support resourcing, customer retention prediction based on historical behavior, and financial forecasting that improves budget accuracy and timely spending.
I will also show how robust feedback loops-by comparing predictions with actual results-allow models to self-correct and evolve. I also cover how real-time feedback is achieved by building solid data infrastructure and applying modern MLOps practices by drawing upon my own experience and other industry best practices.

The e-poster will delve deeper into how adaptive forecasting can drive business impact and be effectively implemented.
A dynamic retraining process enables algorithms to adjust for biases, seasonal fluctuations, and market volatility, yielding more precise forecasts that support proactive strategies. For example, a company could use these adaptive models to predict customer retention rates across segments or forecast revenue trends in specific geographies, allowing targeted strategies and resource allocation to maximize profitability.
With the right data pipelines and automation tools, businesses can implement continuous monitoring and self-correcting mechanisms that keep predictive analytics aligned with evolving market conditions. By harnessing these technologies, this session will show how modern forecasting solutions serve as powerful engines for sustainable business growth 

Presenting Author

Rajat Verma

First Author

Rajat Verma

Homework Help Desk: Tutors when you need them most.

Traditional tutoring services often prevent students from accessing help in a timely manner, particularly after hours when they may need it most, and come with a high barrier for entry for qualified individuals who want to help to offer their expertise. This project aims to address these issues through a new, innovative, and on-demand tutoring marketplace app that matches students with tutors based on their subject needs, availability, and price constraints.
Our app enables anyone with proven skills, whether currently in school, post-graduation, or professional, to become tutors without the need for extensive onboarding, formal qualifications, or the time constraints of traditional tutoring centers. A dynamic pricing model ensures that tutors are fairly compensated and encourages availability during periods of high demand while offering students affordable options during slower hours.
Other platforms like Wyzant and Varsity Tutors have attempted to address this problem but often impose exorbitant prices and additional high fees for users. In addition to high costs, they regularly lack available tutoring sessions outside of normal tutoring hours. Our platform prioritizes flexibility and ease of access, reducing scheduling conflicts and avoiding the hefty overhead fees associated with existing alternatives.
This presentation highlights the innovative pricing and sorting algorithms behind our platform, ensuring optimal tutor-student matches based on availability, subject needs, and affordability. By focusing on flexible scheduling, dynamic pricing, and streamlined tutor onboarding, we enable students to get timely help and tutors to earn fair compensation. Our user centric, data driven approach addresses the limitations of traditional tutoring, creating a scalable model that improves accessibility, learning outcomes, and long-term growth in the tutoring industry. 

Presenting Author

Gordon Poole

First Author

Gordon Poole

Longitudinal Effects of Bilingualism on Cognitive Resilience Across Clinical Syndromes in Alzheimer's Disease and Related Dementias (ADRD)

This longitudinal study examines the statistical modeling of bilingualism's role in enhancing cognitive resilience and its neuroprotective effects against Alzheimer's disease and related dementias (ADRD). We analyzed data from 453 participants (375 monolingual, 48 bilingual) clinically categorized into five groups: healthy controls and memory, language, behavioral, motor-predominant syndromes over a 10-year follow-up period. To address challenges associated with unbalanced data common in longitudinal clinical trials, we implemented a dual statistical approach: robust non-parametric methods (including Aligned Rank Transform ANOVA) for cross-sectional comparisons, and linear mixed-effects modeling with appropriate covariance structures to analyze repeated measures over time. Our longitudinal analyses revealed that bilingual speakers experienced significantly slower functional decline in specific domains, particularly language-related assessments, compared to monolingual speakers. These domain-specific protective effects were most pronounced in participants with predominant language syndromes, including non-fluent/agrammatic (N = 21), semantic (N = 18), and logopenic variant (N = 20) primary progressive aphasia. These findings underscore how bilingualism may serve as a protective factor against cognitive decline by enhancing resilience and slowing functional deterioration in Alzheimer's disease and related dementias (ADRD). Additionally, they highlight the importance of advanced statistical modeling in addressing challenges such as unbalanced data and distinguishing biological variability from technical noise in a clinical research scenario. 

Presenting Author

Luna Gao

First Author

Luna Gao

Mobile Apps for Teaching & Learning Statistics

I will present six mobile apps - "Explore Data," "Distributions," "Inference," "Resampling," "Regression," and "Concepts" - available under Art of Stat on iOS and Android. These apps provide an intuitive, hands-on approach to teaching and learning statistics, using real data, and illustrate important concepts such as probability distributions, bootstrapping, the coverage probability or the Central Limit Theorem. The apps cover exploratory data analysis (e.g., side-by-side boxplots), statistical inference (e.g., confidence intervals) or regression modeling and prediction (e.g., interactive scatterplots, logistic regression, and multiple linear regression). From a common interface, users select one of many pre-loaded datasets or upload their own CSV file, and then run the analysis. The apps provide a comprehensive set of tools that should enrich any learning environment and are straightforward to use and implement (no computer lab necessary).

In the lightning talk, I will demonstrate key features of the apps and showcase their potential in enhancing statistics education. During the poster session, I invite attendees to explore the apps firsthand and engage in live demonstrations on topics of their choice by projecting the apps on the TV. 

Presenting Author

Bernhard Klingenberg, New College Florida

First Author

Bernhard Klingenberg, New College Florida

WITHDRAWN New Mathematical Framework and Novel Machine-learning Based Computational Methodology to Determine the Influence of Variables in a Time Dependent System

This research project proposes a methodology for predicting rare, unforeseeable events, typically referred to as "black swan" events, by introducing a novel, quantitative "influence score" metric. Where traditional predictive models usually fall short, the proposed influence score provides an analytical, computationally feasible measure for identifying key influencing variables that shape an entire system. With the ability to quantify these key variables, this approach to determining influence also provides insights to the stability and interconnectedness of a system.

Our results show that in a major stock index, the Dow Jones Transportation Average, there are specific companies within the stock network that disproportionately have more influence than other companies. While this project focuses primarily on its applications in this setting, this influence score has implications in various fields beyond economics. Especially during times of uncertainty, the influence score can serve as a metric for determining significance, leading to improved decision-making through targeted responses. With this influence measure providing insight to complicated, stochastic systems, policymakers, researchers, and industry leaders are able to use this tool as a way to navigate and mitigate the impact of unpredictable events. 

Presenting Author

Ryan Ma

First Author

Ryan Ma

Optimizing Software Performance with AI: Metrics, Models, and Best Practices

In today's fast-paced digital ecosystem, Site Reliability Engineers (SREs), Quality Engineers, and Performance Engineering teams face increasing challenges in ensuring software systems remain fast, scalable, and resilient under dynamic workloads. Traditional performance engineering techniques often fall short in proactively identifying and mitigating performance issues, leading to costly downtime and degraded user experiences. AI-driven performance engineering is transforming how organizations approach performance testing, monitoring, and optimization by leveraging machine learning models to predict bottlenecks, automate anomaly detection, and enhance system reliability.

This session will dive into AI-powered performance engineering strategies, highlighting key performance metrics such as latency, throughput, error rates, anomaly detection accuracy, and auto-remediation effectiveness. We will explore best practices for integrating AI into performance workflows, covering areas like intelligent workload modeling, self-healing systems, adaptive load balancing, and real-time observability. 

Presenting Author

Srinivasa Rao Bittla, Adobe Inc

First Author

Srinivasa Rao Bittla, Adobe Inc

Predicting Term Subscription Using ML Models

This project focuses on analyzing the factors influencing customers' decisions to sign up for a term deposit at
a bank, using various predictive models to identify patterns and trends. The dataset includes information
on client demographics, financial indicators, and previous marketing interactions. The primary goal is to
develop a model that can accurately predict whether a client is likely to subscribe to a term deposit, allowing
the bank to optimize its marketing efforts and reduce unnecessary costs.
Several classification models were employed, including Logistic Regression, Decision Trees, Naive Bayes, and
Random Forest. Data preprocessing involved transforming variables such as age, balance, and housing into
factors, and creating dummy variables to enable accurate analysis. We also addressed data imbalances by
focusing on variables that significantly influenced the likelihood of clients signing up for a term deposit.
Among the models tested, the Random Forest model proved to be the most effective, achieving an accuracy of
77.72% with a 95% confidence interval of (76.59%, 78.81%). This model's performance, as assessed through
the confusion matrix, highlighted its strength in predicting clients less likely to sign up, thus enabling the
bank to better target potential subscribers. The analysis demonstrated that key variables like age, balance,
and housing status were pivotal in influencing a client's decision to sign the term deposit.
The findings of this project provide actionable insights for the bank, enabling it to focus resources on high
potential clients and improve the efficiency of marketing strategies. Future iterations could further enhance
model accuracy by incorporating additional data and addressing class imbalances more comprehensively. 

Presenting Author

Neemias Moreira

First Author

Neemias Moreira