Statistics ⊗ Sports: Advancements in Basketball, Baseball, Football, and Soccer

Michael Schuckers Chair
St. Lawrence University
 
Tuesday, Aug 6: 8:30 AM - 10:20 AM
5093 
Contributed Papers 
Oregon Convention Center 
Room: CC-E148 

Main Sponsor

Section on Statistics in Sports

Presentations

A Defensive Switch? A Compositional Data Approach for Understanding Modern NBA Player Archetypes

In the last decade, the offensive and defensive philosophies employed by teams in the National Basketball Association (NBA) have changed substantially. As a result, most players can no longer be classified into only one of the five traditional positions (PG, SG, SF, PF, C) and instead spend a percentage of their playing time at multiple positions, making positional data compositional. Further, given the desirability for versatile players, an argument can be made that traditional positions themselves are archaic. Using data from the 2016-17, 2017-18, and 2018-19 seasons, I explore how Bayesian hierarchical models can be used to estimate team defensive strength in three ways. First, only considering players classified by their majority traditional position. Second, by using compositional traditional positional data. Third, using compositional data from modern positions (archetypes) defined by fuzzy k-means clustering. I find that the fuzzy k-means approach leads to a modest improvement in both the root mean squared error and median 95% posterior predictive interval width for the test data, and, more importantly, identifies 11 modern archetypes that, when combined, are correlated with 

Keywords

compositional data

Bayesian

hierarchical models

basketball

NBA 

View Abstract 2325

First Author

Charles South

Presenting Author

Charles South

Improving the Aggregation and Evaluation of NBA Mock Drafts

If professional teams can accurately predict the order of their league's draft, they would have a competitive advantage when using or trading their draft picks. Many experts and enthusiasts publish forecasts of the order players are drafted into professional sports leagues, known as mock drafts. Using a novel dataset of mock drafts for the National Basketball Association (NBA), we explore mock drafts' ability to forecast the actual draft. We analyze authors' mock draft accuracy over time and ask how we can reasonably aggregate information from multiple authors. For both of these tasks, mock drafts are usually analyzed as ranked lists, and in this paper we propose ways to improve on these methods. We propose that rank-biased distance is the appropriate error metric for measuring accuracy of mock drafts as ranked lists. To best combine information from multiple mock drafts into a single consensus mock draft, we also propose a combination method based on the ideas of ranked-choice voting. We show that this method provides improved forecasts over the standard Borda count combination method used for most similar analyses in sports, and that either combination method provides a more accurate forecast across seasons than any single author. 

Keywords

Expert elicitation

Rank-biased overlap

Borda count

Ranked-choice voting

Instant-runoff voting

Rank aggregation 

View Abstract 2213

Co-Author

Colin Montague, Sacramento Kings

First Author

Jared Fisher, Brigham Young University

Presenting Author

Jared Fisher, Brigham Young University

Competing Risks Analysis of MLB Draft Data

Baseball is unique in the major US sports in that nearly every player who is drafted will spend significant time in the minor leagues (MiLB) before reaching the major league (MLB). Beginning in 2021, the MLB draft was cut in half from 40 rounds to 20, yet still most will spend years in MiLB and retire before making it to the big leagues. This research applies competing risks analysis to investigate how different draft day factors influence the time it takes draftees to either reach MLB or retire before doing so. The results suggest position, pick number, type (high school vs. college), and bonus as a proportion of slot are all important features. This approach can be used to quantify a draftee's likelihood of reaching MLB or retiring over time based on these features, which can be of immense use to the players, their agents, and even the teams drafting them. 

Keywords

competing risks

Fine-Gray model

MLB draft

survival analysis

baseball 

View Abstract 2613

First Author

Eric Gerber

Presenting Author

Eric Gerber

Regression models for estimation of park effects in Major League Baseball

It is well-known that some ballparks in Major League Baseball are are more conducive to scoring than others. Estimation of "park factors" that quantify these differences has received considerable attention in industry and in the literature, but has not been without criticism. We make two contributions towards the improvement of estimating these effects. We compute, for each ballpark, runs and home runs achieved by all players (home and visiting) with plate appearances at the park, when visiting all other parks. This "elsewhere" measure of performance can be used to quantify offensive strength-of-schedule observed at each park. Secondly, we fit generalized linear models to test data to estimate probabilities of a variety of outcomes (e.g. home runs, doubles, foulouts) that are specific to batter-pitcher handedness combinations. These regression models use handedness-specific relative frequencies of events computed using training data as explanatory variables. The models are fit using test data and used to compute handedness-specific event probabilities adjusted to league averages of event probabilities which we define as park factors. 

Keywords

Generalized linear models.

Regression. Analysis of covariance. Covariate-adjustment.

Baseball. Park Factors. 

View Abstract 3657

Co-Author

Richard Levine, San Diego State University

First Author

Jason Osborne, North Carolina State University

Presenting Author

Jason Osborne, North Carolina State University

Memory Learning: A Computational Approach to Estimating Memory Bias in Human Decision Making

In this paper we introduce a novel inverse decision problem formulation which we call "Memory Learning". Given a data set of human decisions and their consequences (i.e. rewards), we consider the situation in which the decisions appear to be "sub-optimal" according to a statistical analysis of the data. Our proposed method seeks to explain these deviations by learning a reweighting of observations, or 'memories', such that the analytical model trained on the reweighted observations matches the observed human behavior. We interpret the reweighting of the observations as a representation of the memory bias inherent in the decision-maker's choices. To bridge the gap between theoretical models and real-world decisions we explore various strategies for learning optimal weightings, employing both analytical and simulation methods. Finally, we introduce a unique iterative resampling approach to apply our method to the well-studied fourth down decision in professional football. Remarkably, our research reveals that our Memory Learning approach outperforms traditional classification methods in predicting coach decisions. 

Keywords

sport

inverse optimization

fourth down

bootstrap

resampling 

View Abstract 3592

Co-Author

Nathan Sandholtz, Brigham Young University

First Author

Connor Thompson

Presenting Author

Connor Thompson

Statistical Adjustment for "Prevent Defense" When Evaluating Team Performance in Soccer

Having looked at the full match statistics for the England-France 2022 FIFA World Cup Quarterfinal, one could come away thinking "England lost despite having played better than France": 16 to 8 shot attempts, 8 to 5 shots on target, 5 corners to France's 2, resulting in a 1-2 loss. What's disregarded is the score situation: in the 40 minutes when the match was tied (0-0, 1-1), France led in all of the mentioned statistical categories, while consciously ceding initiative to England in the 66 minutes when up a goal - a tactic we'll refer to as "prevent defense" (term borrowed from American football). We use match event sequencing data across five European club leagues over the past 15 years to study impacts of prevent defense on the aforementioned statistical categories and goal-scoring tendencies for teams when trailing, leading or tied. For that we leverage categorical and count response modeling approaches with predictors that could reasonably affect the likelihood of a given team implementing prevent defense tactic, which would include, besides scoring differential, such aspects as prematch booking odds (to gauge the relative levels of opponents), red cards, time in the match. 

Keywords

Sports statistics


Categorical data

Multivariate regression 

View Abstract 2694

Co-Author(s)

Andrey Skripnikov, New College of Florida
David Gillman, New College of Florida

First Author

Ahmet Cemek, New College of Florida

Presenting Author

Andrey Skripnikov, New College of Florida