SPEED 1: Data Challenge and Prediction Modelling, Part 1

Ayyuce Begum Bektas Chair
Memorial Sloan Kettering Cancer Center
 
Sunday, Aug 3: 2:00 PM - 3:50 PM
4014 
Contributed Speed 
Music City Center 
Room: CC-104A 
Poster presentations for this session will be on display in the JSM Expo Sunday, August 3, 4:00 - 4:45 p.m.

Presentations

Forecasting Monthly Net International Migration for U.S. States Using Airline Flight Data

Recent efforts across the federal statistical system aim to produce more accurate population estimates that incorporate international migration. Recent research by the U.S. Census Bureau relies on new administrative data to measure international migration into the U.S.; however, such data are often unavailable for subnational geographies, such as states. In the research, we leverage administrative data on inbound flights from the Bureau of Transportation Statistics, travel visa issuance from the Bureau of Consular Affairs, and advanced airline passenger statistics from U.S. Customs and Border Protection to produce novel monthly, statewide estimates and forecasts of immigrant admissions to the U.S. Our methodology utilizes structural time series models that directly model the trend and seasonal patterns of migration. These new estimates permit more accurate and timely estimates of migration by state and can be easily incorporated into standard demographic models. 

Keywords

Immigration

Demographic estimates

Airline traffic

State space modeling

Forecasting 

Co-Author

Srijeeta Mitra, University of Maryland College Park

First Author

Andrew C. Forrester, U.S. Bureau of Labor Statistics and University of Maryland

Presenting Author

Andrew C. Forrester, U.S. Bureau of Labor Statistics and University of Maryland

Demystify Flight Data

Flying can be stressful — but some airports make the experience a lot better than others. In this project, we set out to predict customer satisfaction scores (based on J.D. Power rankings) for major U.S. airports using a mix of airport operations data and local economic factors.

We gathered information on how many passengers airports serve, how often flights are delayed (both outbound and inbound), how often baggage gets lost, the average airfare, the local GDP, and even the region's average annual temperature. Using a blend of statistical modeling and machine learning tools, we explored how these factors connect to how travelers rate their airport experience. Additionally, visualization tools will be employed to identify trends and patterns in travel behavior.

By combining exploratory and inferential approaches, this study gives airport managers and planners a clearer roadmap for making travel a little less stressful — and maybe even a little more enjoyable — for millions of passengers each year. 

Keywords

analyzing consumers' travel habits

identify trends and patterns in travel behavior

classical regression methods and neural network techniques 

Co-Author

Bao Anh Maddux, Winston-Salem State University

First Author

Melinda Combs, Winston Salem State University

Presenting Author

Melinda Combs, Winston Salem State University

The Effect of Delays on Airline Flight Patterns

Flight delays can be caused by events such as hazardous weather, crew availability and security issues. When purchasing flight tickets, many passengers hope to minimize delays to avoid spending too much time at the airport or missing a connecting flight. Flight delays often have a cascading effect, where one flight's delay may influence the next. Additionally, multiple flights may experience delays at the same time, when events such as bad weather occur. We hypothesize that airline "hubs" - defined here as an airport/airline pair containing a large percentage of passenger traffic for that airline - may be more equipped to respond to delay perturbations than non-hubs. Herein, a Fast Fourier Transform (FFT) is applied to scheduled arrival/departure times to estimate airport periodicity. The relationship between hub status, periodicity, and delays is explored. We also compare differences between traditional "hub and spoke" airlines such as Delta to "point to point" structured airlines such as Southwest. 

Keywords

sports

exploratory analysis 

Co-Author(s)

Lydia Lucchesi, The University of Texas at Austin
Saptarshi Roy, The University of Texas at Austin

First Author

Sherry Zhang, The University of Texas at Austin

Presenting Author

Sarah Coleman

Flight Issues and Regional Partisan Dynamics: A Longitudinal Analysis

Our study presents a longitudinal analysis that analyzes relationships between airline flight data
from the Bureau of Transportation Statistics and regional partisan shifts from the Biographical Directory of the U.S. Congress. Our motivation is to understand and explain relationships between
flight issues and the local political climates of regions containing airports, including constructing
causal models for these relationships. We focus our attention on 1990 and 2024, which crosses several changes in the national political environment and major historical events that influenced flight
patterns, including 9/11, the 2008 recession, and the COVID-19 pandemic. Using a spatiotemporal
autoregressive model, we identify significant connections between geographic and other factors. Our
findings prompted further modeling to explore causal effects and the partisan consequences of air
travel. Results suggest that while political climates shape flight issues, air travel disruptions can also
influence regional partisan dynamics, forming a feedback loop between transportation infrastructure
and political behavior. 

Keywords

Longitudinal Analysis

Causal Effects

Spatiotemporal Autoregressive Model

Partisan Shifts

Flight Issues

Transportation-Politics 

Co-Author(s)

Weiwei Xie
Ching-Ni Tseng
Wooyoung Kim, Washington State University

First Author

Jackie Carlton-Wargo, Washington State University

Presenting Author

Jackie Carlton-Wargo, Washington State University

Detecting Anomalies in WTI Crude Oil Returns Using Statistical and Machine Learning Methods

Crude oil price fluctuations significantly impact global economies, financial markets, and energy policies. Detecting anomalies in West Texas Intermediate (WTI) crude oil returns is essential for identifying market shocks and enhancing risk management strategies. This study presents a hybrid anomaly detection framework that integrates statistical techniques (Z-score, Bollinger Bands, GARCH) with machine learning models (Isolation Forest, DBSCAN, Autoencoders). Using daily WTI returns from 2014 to 2024, the analysis identifies both extreme return spikes and complex nonlinear deviations.
The results show that Bollinger Bands and GARCH methods detect a higher number of anomalies, reflecting sensitivity to volatility, while machine learning techniques such as Isolation Forest and Autoencoders identify subtler, nonlinear patterns. A total of 26 consensus anomalies, detected by at least three methods, highlight major market disruptions, which was captured by all six models.
This research demonstrates that combining statistical and machine learning approaches enhances anomaly detection by leveraging their complementary strengths. The findings offer valuable insights for financial risk assessment, market surveillance, and economic policy-making, contributing to more robust decision-making in energy and financial markets. 

Keywords

Anomaly Detection

WTI Crude Oil Returns

Machine Learning

Financial Risk Management

Volatility Clustering

Isolation Forest 

First Author

Gadir Alomair, King Faisal University

Presenting Author

Gadir Alomair, King Faisal University

Deviance-based approach to detect cancer fragments in plasma using methylated sequencing targets

Background: Differentially methylated regions (DMRs) that distinguish cancer patients from non-cancer controls have been identified in tissue. Detection of these cancer-specific DMRs in plasma is challenging due to low bioavailability, thus prompting investigation into identifying DNA fragments with a high likelihood of originating from tumor. Methods: We fit a generalized additive model (GAM) to the percent of methylated fragments in non-cancer controls to estimate an expected methylation profile for 432 DMRs. A centered and scaled deviance score based on the fitted model is calculated for each DMR and used to compare 144 cancer plasma samples representing 8 cancer subtypes versus 71 controls. Results: Of 432 DMRs tested, 49 had p-values < 0.005. Combining all DMRs within a random forest model achieved an out-of-bag prediction AUC of 0.74 for discriminating cases from controls. Conclusion: Future evaluations with training and test sets consisting of >5000 DMRs is underway with the expectation of improving the prediction accuracy for cancer detection and cancer sub-type in plasma. This modeling approach may enhance multicancer detection efforts in cancer screening paradigms. 

Keywords

methylation



deviance

generalized additive models

prediction 

Co-Author(s)

Seth Slettedahl
Douglas Mahoney, Mayo Clinic
Jeanette Eckel-Passow, Mayo Clinic
John Kisiel, Mayo Clinic
William Taylor, Mayo Clinic

First Author

Jason Sinnwell, Mayo Clinic

Presenting Author

Jason Sinnwell, Mayo Clinic

Enhancing Fraud Detection: A Comprehensive Analysis of Financial Transactions

This project focuses on the analysis and interpretation of a large financial transactions dataset created by Caixabank Tech for the 2024 AI Hackathon, available on Kaggle. The research involves developing interactive Tableau dashboards, including maps of financial transactions across the United States and time series plots to visualize trends over time. In addition to data visualization, I will identify suitable statistical techniques for analysis and apply statistical models and machine learning methods to predict fraudulent transactions within the dataset. By combining data visualization, statistical analysis, and machine learning, this research aims to uncover actionable insights and enhance the detection of fraudulent activities in financial transactions. 

Keywords

Fraud Detection

Data Visualization

Financial Transactions

Statistical Models

Machine Learning Techniques 

Co-Author

Wanchunzi Yu, Bridgewater State University

First Author

Juliana Patrone

Presenting Author

Juliana Patrone

WITHDRAWN Entrepreneur opportunity as dividend income: the risk's positive perusal by the Erlang(n) model.

Entrepreneurs opportunities are marginals perusal of upcoming dividends incomes. Whatever Entrepreneurships Interests Focus (E.I.F.) are social, economic or institutional, theirs Acting Level (E.A.L.) are Micro, Meso, or Macro, and theirs Dynamic Trend (E.D.T.) are innovation, impact or problems solving, opportunities are risks to undertake. Hence, for a dividend strategy π, the risk to undertake can be represent by the controlled surplus process R[π]...[/π] = u + ct - ∑X...- L[π]...[/π](t), for i from 0 to N(t) and time t ≥ 0, initial capital u≥ 0 and premium income c ≥ 0. Admissible π ∈ Π has predictable, non-decreasing and left-continuous accumulated dividends L[π]...[/π](t) ≤ u + ct - ∑X.... Furthermore, cross features from E.I.F., E.A.L., and E.D.T. via paired extended decision analysis yield a dynamic n-dimensional decision space. In addition, non-parametric density estimation at state i of risks process allows computing and actualizing any m[th]...[/th] moment of the dividend D[u]...[/u] at this state. This finaly provides upper bounds for optimal strategies by Erlang(n) risk model. 

Keywords

Entrepreneurs Opportunities

Dividends Strategy

Controlled Surplus Process

Entrepreneurships Interests Focus (E.I.F.)

Entrepreneurships Acting Level (E.A.L.)

Entrepreneurships Dynamic Trend (E.D.T.) 

First Author

Mfondoum Valery

WITHDRAWN: Evaluating radiomics based predictors of survival under anti-PD1 therapy

Radiomics, the extraction of quantitative features from medical images such as CT scans, may provide clinically relevant insights for cancer patient outcomes beyond the information provided by tumor size changes. Prior studies [Abbas et al 2023, Nardone et al 2024] have examined the changes in radiomic features at different time points (termed delta radiomics) to explore its potential as a longitudinal biomarker of cancer response. Additionally, existing studies have shown that delta radiomics (not baseline) has predictive power, with delta tumor volume being the most important feature. However, few radiomics-based biomarkers have been externally validated. Here, we developed a CT-based radiomic signature score for TNBC and bladder cancer to predict association with survival outcome under pembrolizumab monotherapy. Using a penalized Cox regression model and size-change detrended radiomics features analysis, our findings suggest that CT-based delta radiomics is predictive of survival outcomes but does not add value beyond delta volume. 

Keywords

radiomics

delta radiomics

biomarkers

oncology

survival 

Co-Author(s)

Richard Baumgartner, Merck Research Laboratories
Shubing Wang, Merck & Co., Inc.
Lingkang Huang
Yiqiao Liu, Merck & Co., Inc.
Gregory Goldmacher, Merck & Co., Inc.
Antong Chen, Merck & Co., Inc.
Jianda Yuan, Merck & Co., Inc.
Jared Lunceford, Merck & Co., Inc.

First Author

Michelle Ngo, Merck & Co., Inc.

Hybrid Neural Network Model for Predicting LVO in Ischemic Stroke Patients

This study aims to develop and validate a novel hybrid neural network (HNN) model that integrates classical statistical methods with ordinary neural networks, combining the strengths of statistical learning and machine learning in sense of structured framework, flexibility and regularization and interpretability.
The proposed HNN model incorporates National Institutes of Health Stroke Scale (NIHSS) item scores, demographic information, medical history, and vascular risk factors to predict LVO. Using both simulated and real-world stroke datasets, we evaluated the model's performance based on sensitivity, specificity, accuracy, and area under the curve (AUC). Comparisons were made against other methods, including logistic regression, Random Forest, Decision Tree, and ordinary neural networks. Results from the study demonstrate that the HNN model consistently outperforms traditional statistical and ML-based approaches. Accuracy of HNN is greater than that of logistic regression or ordinary NN by at least 3%. By leveraging the complementary advantages of statistical and neural network methodologies, the HNN offers a robust and efficient tool for prehospital LVO detection. 

Keywords

Machine Learning

Deep Learning

Hybrid Neural Network

LVO

Stroke

predictive model 

Co-Author(s)

Samuel Glandon, Mr.
Megan McCoy, Ms

First Author

Lan Gao, University of Tennessee at Chattanooga

Presenting Author

Megan McCoy, Ms

Predicting MSL Players Market Value using Machine Learning Algorithms

In a sector of continuous growth and development, the market value of each soccer player has become a key element when developing his soccer career. Market value is used to describe how much a player is worth in the transfer market, and is momentous for soccer clubs to determine the financial standing of players.
Several significant factors can influence the market value of a soccer player, such as age, position, number of goals scored, the number of games previously played, etc. The data of soccer players from MLS (Major League Soccer) were gathered from MLSsoccer, Transfermarkt, and Opta Sports. The data were manipulated to achieve the goal of this project.
This study aims to build and compare predictive models using machine learning algorithms to estimate the market value of MLS players based on several key factors, which will help clubs and agents objectively predict the worth of a player they would like to buy or sell. 

Keywords

soccer

players' market values

MLS (Major League Soccer)

machine learning 

Co-Author

Abdelmonaem Jornaz, Park University

First Author

Pablo Cañamero

Presenting Author

Pablo Cañamero

WITHDRAWN Prediction of 30-day Readmission for ICU Patients with Heart Failure

Intensive Care Unit (ICU) readmissions among patients with heart failure (HF) impose a substantial economic burden on both patients and healthcare systems. While previous studies have identified various predictors of readmission, consensus on their relative importance and optimal predictive models remains limited. This study aims to evaluate key predictors and assess the performance of different modeling approaches in forecasting 30-day ICU readmissions using the Medical Information Mart for Intensive Care IV (MIMIC-IV) database. This study applied logistic regression, classification trees, and random forest models to develop predictive frameworks. Although overall model performance did not surpass findings from prior studies, hemoglobin emerged as a significant predictor of 30-day readmission, reinforcing its clinical relevance in HF patient management. These findings highlight the challenges and potential of predictive modeling in ICU readmission risk assessment. 

Keywords

Heart Failure

ICU Readmission

Electronic Health Record

Machine Learning

Variable Importance 

First Author

Feiyi Sun

Predictive Modeling of Racial Disparities in U.S. Violent Deaths

Violent death rates in the United States exhibit pronounced racial disparities that challenge the healthcare, insurance, and public safety sectors. These disparities, shaped by demographics, mental health, substance abuse, and geography, complicate practical risk assessment and targeted interventions. Leveraging data from the National Violent Death Reporting System (NVDRS) for 2020–2021, this study examines racial differences in suicides, homicides, and other violent deaths. Logistic regression models assess the effects of race, age, sex, mental health, substance use, and state-level variability. The results are compared with several machine learning models to evaluate the trade-off between predictive performance and interpretability. Guided by the Social Determinants of Health and structured with the Design Science Framework, findings reveal that logistic regression delivers interpretable, actionable insights while achieving competitive accuracy and sensitivity. These insights enhance our understanding of violent death outcomes and support the development of refined risk profiles and targeted business solutions for high-risk groups. 

Keywords

MORTALITY

RACE

LOGISTIC REGRESSION

MACHINE LEARNING

VIOLENT DEATHS

UNITED STATES 

Co-Author

Tatjana Miljkovic, Miami University

First Author

Ying-Ju Chen, University of Dayton

Presenting Author

Ying-Ju Chen, University of Dayton

Smooth Tensor Decomposition for Ambulatory Blood Pressure Monitoring Data

Ambulatory blood pressure monitoring (ABPM) is widely used to track blood pressure and heart rate over periods of 24 hours or more. Most existing studies rely on basic summary statistics of ABPM data, such as means or medians, which obscure temporal features like nocturnal dipping and individual chronotypes. To better characterize the temporal features of ABPM data, we propose a novel smooth tensor decomposition method. Built upon traditional low-rank tensor factorization techniques, our method incorporates a smoothing penalty to handle noise and employs an iterative algorithm to impute missing data. We also develop an automatic approach for the selection of optimal smoothing parameters and ranks. We apply our method to ABPM data from patients with concurrent obstructive sleep apnea and type II diabetes. Our method explains temporal components of data variation and outperforms the traditional approach of using summary statistics in capturing the associations between covariates and ABPM measurements. Notably, it distinguishes covariates that influence the overall levels of blood pressure and heart rate from those that affect the contrast between the two. 

Keywords

Low-rank tensor factorization

Smoothing penalty

Missing data imputation 

Co-Author

Irina Gaynanova, University of Michigan

First Author

Leyuan Qian, University of Michigan

Presenting Author

Leyuan Qian, University of Michigan

Statistical Properties of a Subjective Value Exchange Model

The exchange process of commodities is the essence of market dynamics. Features such as value, price, and satisfaction are commonly interpreted in the field of economics. However, this has also caught the attention of statistical physicists, who view the market as a statistical ensemble. Concepts borrowed from statistical mechanics, such as temperature or entropy, now appear in the understanding of market dynamics. In this context, we developed a microscopic model for the exchange of a basket of commodities. We consider that people value each commodity in an "individual and subjective" manner and eventually decide to exchange them in the market. We ran the model with a large number of agents acting as traders. We recorded all the trading actions and computed the statistical distribution of exchange ratios and the flux of commodities. These simulations allowed us to make a connection between price and the thermodynamic concept of temperature. The corresponding entropy of the system was also compared to that expected for a microscopic thermodynamic system. 

Keywords

exchange value

valuation

econophysics

entropy

temperature 

Co-Author

Matías González, Departamento de Física - FCEN - UBA

First Author

Guillermo Frank

Presenting Author

Guillermo Frank

Stock Market Strategies Implied by a Stochastic Particle System Model Adopted as Econophysics

The Inequality Process (IP) (Angle, 1983-2022) is a stochastic particle system model of a process of competitive exclusion driving wealth production. The IP may be a natural law; it has been adopted as econophysics. Labor income statistics teem with invariant patterns implied by the IP. The IP implies a number of statistical patterns, "stylized facts", in the market capitalizations of exchange listed corporations. This paper identifies strategies that buyers/sellers of listed stocks use that are implied by the IP, i.e. putting those strategies on an econophysical footing. Recognized experts in quantitative finance claim nothing like the IP operates in stock markets. 

Keywords

competition

invariances

particle system

quantitative finance

stock market

trading strategies 

First Author

John Angle, The Inequality Process Institute LLC

Presenting Author

John Angle, The Inequality Process Institute LLC

Target then treat: analyzing sales impacts of two-level assignment

In sales operations, a customer is first assigned to a sales program (e.g., defined by market segmentation or customer prioritization) and then treated in doses by a sales team within that program (e.g., through meetings and pitches). We use these two levels of treatment assignment to deconfound the impact of a sales team's treatment dose on customer outcomes. First, for sales program assignment based on thresholding rules (e.g., customer spending), we apply regression discontinuity techniques to identify exogenous variation in that assignment process. Second, using this exogenous variation, we apply instrumental variables techniques to analyze the impact of sales treatments on a continuous scale. We present a case study and application about the intent-to-treat and as-treated impact of sales specialists that focus on key product areas in Google's advertising business. 

Keywords

Causal inference

Two-stage treatment

Regression discontinuity

Instrumental variables

Customer sales

Impact analysis 

Co-Author

Frank Yoon, Google

First Author

Natalia Ordaz Reynoso, Google

Presenting Author

Natalia Ordaz Reynoso, Google

To Kick or Receive: A Deep Dive into the National Football League New Playoff Overtime Rules

The NFL has had a problem with overtimes for decades; the team with the first possession in overtime has had a distinct advantage due to the sudden death rules. In 2023, the NFL changed their playoff overtime rules with the aim for both teams to have an equal chance of winning regardless of who gets the ball first. An example under the new rules: Team A wins the coin toss and can elect to kick or receive. Suppose Team A decides to receive the ball. After Team A's attempt, Team B gets a turn regardless. If the score is tied, the first team to score wins, with Team A to receive. The first time this rule was implemented was in the 2024 Super Bowl, where the team who won the coin toss elected to receive and lost. This incited the masses to declare receiving first is the wrong decision. We feel this decision is more complex. To investigate whether it is better to receive the ball first or second given these new rules, we constructed a series of discrete-time Markov chain models to estimate the probability of winning for each team across a range of scoring probabilities. In particular, the Markov models allow Team B to change strategies in reaction to the outcome Team A's possession. 

Keywords

Sports

National Football League

Markov Chain Models 

Co-Author

Daniel Hippe, Fred Hutchinson Cancer Center

First Author

Philip Stevenson, Fred Hutchinson Cancer Research Center

Presenting Author

Philip Stevenson, Fred Hutchinson Cancer Research Center

Transformer Models for Enhanced Time Series Forecasting

Time series forecasting is essential for various real-world applications, often requiring domain expertise and extensive feature engineering, which can be time-consuming and knowledge-intensive. Deep learning offers a compelling alternative, enabling data-driven approaches to efficiently capture temporal dynamics. This talk introduces a new class of Transformer-based models for time series forecasting, leveraging attention mechanisms while integrating principles from classical time series methods to enhance their ability to learn complex patterns. These models are highly versatile, effectively handling both univariate and multivariate time series data. Empirical evaluations demonstrate significant improvements over conventional benchmarks, showcasing the practical effectiveness of these models. 

Keywords

Time Series Forecasting

Transformer Models

Deep Learning

Attention Mechanism

Temporal Dynamics

Univariate and Multivariate Forecasting 

First Author

Thu Nguyen, University of Maryland-Baltimore County

Presenting Author

Thu Nguyen, University of Maryland-Baltimore County