CS027 Practice and Applications, Part 1

Conference: Symposium on Data Science and Statistics (SDSS) 2023
05/26/2023: 9:50 AM - 11:20 AM CDT
Lightning 
Room: Grand Ballroom C 

Description

This session will be followed by e-poster presentations on Friday, 5/26 at 11:20 AM.

Chair

Ellen Breazel, Clemson University

Tracks

Machine Learning
Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2023

Presentations

BEACON: An Industry Classification Tool

Every five years, the U.S. Census Bureau conducts the Economic Census – an extensive survey covering approximately 8 million business establishments that provides a detailed view of the U.S. economy. BEACON (Business Establishment Automated Classification of NAICS) is a machine learning tool currently being used in the 2022 Economic Census to help respondents self-classify their primary business activity in terms of NAICS (North American Industry Classification System). Using BEACON is similar to using a search engine. The respondent provides a short business description, and then BEACON returns a list of relevant NAICS codes. The methodology involves natural language processing, machine learning, and information retrieval. This presentation provides an overview of BEACON and its performance so far in the 2022 Economic Census. 

Presenting Author

Brian Dumbacher, US Census Bureau

First Author

Brian Dumbacher, US Census Bureau

BIG DATA ANALYTICS, DATA SCIENCE, ML&AI FOR CONNECTED, DATA-DRIVEN PRECISION AGRICULTURE AND SMART FARMING SYSTEMS: CHALLENGES AND FUTURE DIRECTIONS

Big data and data scientific applications in the modern agriculture are rapidly evolving as the data technology advances and more computational power becomes available. The adoption of Big data has enabled farmers to optimize their agricultural activities sustainably with cutting-edge technologies, resulting in eco-friendly and efficient farming. Wireless Sensor Networks (WSNs) and Machine Learning (ML) have had a direct impact on smart and precision agriculture, with Deep Learning (DL) techniques applied to data collected via sensor nodes. Additionally, robotics, Internet of things (IoT), and drones are being incorporated into farming techniques. Digital data handling has amplified the information wave, and information and communication technology (ICT) have been used to deliver benefits for both farmers and consumers. This work highlights the technological implications and challenges that arise in data-driven agricultural practices as well as the research problems that need to be solved. 

Presenting Author

David Han, University of Texas at San Antonio

First Author

David Han, University of Texas at San Antonio

Home range and spatial interaction modelling of black bears

Interaction between individuals within the same species is an important component of population dynamics. An interaction can be either static (based on spatial overlap) or dynamic (based on movement interactions). Using GPS collar data, we can quantify both static and dynamic interactions between black bears. The goal of this work is to determine the level of black bear interactions using the 95% and 50% home ranges, as well as to model black bear spatial interactions, which could be attraction, avoidance/repulsion, or lack of interaction at all, in order to gain new insights and improve our understanding of ecological processes. Recent methodological developments in home range estimation, inhomogeneous multitype/cross-type summary statistics, and envelope testing methods are explored to study the nature of black bear interactions. Our findings in general indicate that the black bears of one type in our data set tend to cluster around another type. 

Presenting Author

Fekadu Bayisa, Auburn University

First Author

Fekadu Bayisa, Auburn University

CoAuthor(s)

Christopher L. Seals, Auburn University
Hannah J. Leeper, Auburn University
Elvan Ceyhan, Auburn University
Todd Steury, Auburn University

Integration of Statistical Analysis and Machine Learning with Two-Sided Matching to Achieve Win-Win

For many business applications, statistical modeling or machine learning is commonly employed to optimize an objective. While successful in practice, this approach is one-sided, typically from the developer or corporate perspective which may not necessarily be beneficial to the target audience (e.g., customers, employees). In this paper, we propose a mutually beneficial approach using two-sided matching. Consider the following two cases.

Customer-Product Matching in Marketing: Product recommendation engines are often developed to maximize customer purchase or engagement by selecting customers who are most responsive to a product offer or marketing intervention. On the other hand, if customer preference or experience can be quantified, a model can be trained to recommend the right products such that customer preference or value to customer is maximized. How do we integrate the models to drive both value to customer and value to business?

Employee-Project Matching in Project Assignment: If a firm has a group of employees with certain skills (e.g., data scientists) and a large number of projects, how should the employees be assigned to projects? Often it is based on historical experience, skills, and availability, which are important factors to drive success aided by statistical analysis or predictive/prescriptive analytics that benefits the firm. However, employees' perspective is taken as a priority, their interests can be captured to construct a machine learning or rule-based model.

We propose an approach using the deferred acceptance algorithm in conjunction with statistical modeling and machine learning methods to generate the optimized solutions, which achieve "stability" in the sense that no pair of agents (customer/product, employee/project in our cases) would prefer each other to their match recommended by the algorithm. 

Presenting Author

Ping Yao

First Author

Ping Yao

CoAuthor(s)

Victor Lo, Fidelity Investments
Srikar M, Fidelity Investments
Jason Moser, Fidelity Investments
Arsalan Khursheed, Fidelity Investments

Neuromarketing and Decision-Making: A Machine Learning Approach Using EEG and Brain Region Analysis

Neuromarketing involves the study of brain responses that focuses on understanding how consumers' brains respond to products and services, and how these responses influence their behavior. One of the most common techniques used in neuromarketing is electroencephalography (EEG), which involves measuring the electrical activity of the brain's surface. This can provide valuable insights into consumer preferences and decision-making processes. The ultimate goal of this study is to assess the relative importance of right/left brain regions (including hemispheres, frontal, temporal, parietal, and occipital lobes) that might be associated with the consumer choice towards E-commerce products using a publicly available neuromarketing dataset. Change in EEG signal has been evaluated using a mixed model for repeated measures for all brain regions. The objective of this study is to build a classification system that can distinguishes the EEG characteristics of consumers' preferences between like and dislike based on different classifiers. 

Presenting Author

Ismail El Moudden, Eastern Virginia Medical School

First Author

Ismail El Moudden, Eastern Virginia Medical School

CoAuthor(s)

Mohan Pant, Eastern Virginia Medical School
Rachel L Bradley, Eastern Virginia Medical School
Mounir Ouzir, 1High Institute of Nursing Professions and Technical Health, Beni Mellal (ISPITS de Beni Mellal) Mor

Population Obfuscation for Data Privacy and a Masking Problem Solved by Optimal Transport

Data managers are often charged to share data samples that have proprietary or sensitive elements--data entries, variables, or individuals' whole records--whose privacy must be maintained. Meeting these conflicting goals of data access and privacy is a challenging sample obfuscation problem that has been broadly studied from a variety of perspectives. Population obfuscation, by contrast, protects information and features of a whole statistical population of data, the population being represented by an algorithm, formula, model, or sampling plan from which unlimited numbers of data records can be produced. We propose a problem in population obfuscation in which two samples are given: a large sample from a population with a subset of variables that must be masked and a small sample of masked data. This situation can arise in the case of, for example, archived data being repurposed for new analyses. This is a data augmentation problem in which a small data set--the masked data--is supported by a large data set--the marked data--from a different, but related source. A solution to this problem based on Monge-Kantorovich-formulated optimal transport (OT) is explored. OT finds the unique optimal map, or push-forward operator, to transform one probability distribution to another. Experiments using earth mover distance to quantify learning error are conducted to determine the effectiveness of the OT solution approach relative to the masked sample size. These experiments involve five factors: covariance and shape of the population marked for masking, number of population variables, choice of variable(s) to be masked, and different types (linear/non-linear) of masking map. These experiments show 1) that marked data can effectively augment a limited set of masked data, and 2) that the OT solution's masking error decreases log-log linearly with training data sample size, with a constant log-log slope, not significantly different from −1/2 in the two-variable case. 

Presenting Author

Angela Folz, University of Colorado Boulder

First Author

Angela Folz, University of Colorado Boulder

CoAuthor(s)

Michael Frey, National Institute of Standards & Technology
Adam Wunderlich, Communications Technology Laboratory, National Institute of Standards and Technology

Power Grid Data Quality Filter and Machine Learning for Event Classification

The stability and reliability of the power grid are of great importance to the nation's economic system and national security. The power grid is a complex system that has many interconnected networks. With the advent of phasor measurement unit (PMU) data, system operators can view the status of the power system from a wide-area interconnection level. Since there are constant disturbances happening in the power grid, data analytics techniques could be valuable for applications on PMU data that inform the operators regarding interesting and significant power system events. In this study, we develop a data processing and machine learning approach that handles near-real-time PMU data for detecting and classifying power system events. This paper provides details regarding the techniques we use for filtering out common PMU data quality issues like frequency channel extreme values, locked frequency channels, missing data that leads to false spikes, and unreliable derived frequency. After data pre-processing, an atypicality engine is used to flag atypical minutes in the PMU data. The atypicality score is mainly based on principal component analysis and clustering. Also presented is a machine learning classifier using gradient boosting machine (GBM) that distinguishes between generator trips and other types of power system events since generator trips are usually more significant than the other events for system operators to take notice. This classifier works on extracted features from the time series PMU channels like frequency and phase angle difference. Among the other types of power system events, we also develop metrics based on k-means clustering for the characterization of these events in order to discover interesting events. There are six metrics created that focus on sudden changes and gradual shifting in the data. Testing is conducted on real world PMU data showing that the approach works as expected and achieves satisfactory results. 

Presenting Author

Tianzhixi Yin, Pacific Northwest National Laboratory

First Author

Tianzhixi Yin, Pacific Northwest National Laboratory

CoAuthor(s)

Nick Betzsold, Pacific Northwest National Laboratory
James Follum, Pacific Northwest National Laboratory
Shuchismita Biswas, Pacific Northwest National Laboratory

Statistical Properties of Solar Flare Dependency

As machine learning methods become more prevalent within the solar flare prediction community, a complete understanding of the distributions which govern the flare process is needed for appropriate statistical modeling. In order to analyze the dependency structure that subsequent flares exhibit, we adopt the use of hypothesis testing to identify time intervals which flaring events are highly dependent as well as time intervals where they appear to be independent events. Information from this analysis could be implemented to improve operational solar flare prediction systems where forecasts are constantly updated with the most up to date information. 

Presenting Author

Noah Kochanski, University of Michigan

First Author

Noah Kochanski, University of Michigan

CoAuthor

Yang Chen, University of Michigan

Truth or consequences? A principled path to evaluating classifiers using survey data

Surveys are commonly used to facilitate empirical social science research. Due to several constraints, they are often not simple random samples. Therefore, respondents are usually assigned weights indicative of their relative worth in a statistical procedure. It has been proven that using weights produces unbiased estimates of population totals and accurate explanatory models of outcomes. However, predictive modeling, which has become popular in the social sciences, does not traditionally incorporate representative weighting in model development or assessment. This research investigates whether weighted performance measures on survey testing data, used with well-established model development approaches, produce reliable estimates of population performance. We test this using simulated stratified sampling, both under known relationships between predictors and outcomes and with real-world data. We show that unweighted metrics on sample testing data for models subject to default train/test cycles do not represent population performance, but weighted metrics do. We also show that the same holds for models trained using methods directly orthogonal to population representation, such as upsampling for mitigating class imbalance. Our results suggest that regardless of development procedure, weighted metrics should be used when evaluating performance on sample test data. 

Presenting Author

Adway Wadekar

First Author

Adway Wadekar

CoAuthor

Jerome Reiter, Duke University

Unique Implementation Methods for Machine Learning Models in SQL Server

As artificial intelligence becomes more integrated into the business landscape, the implementation of model inference into production software environments is becoming a more vital topic. Though training and refitting of a model can be done locally, inference typically needs to be performed and supported in production. This means not only must the process live in an environment different from where it was trained, but often needs to be supported by a team that did not build the initial model. Finding a method of inference that is most effectively supported by this team allows the model to move to production.

Several variables go into deciding what is the appropriate method. Real-time inference is typically performed with the use of an inference-specific endpoint. Custom code to predict is occasionally used but the significant increase in complexity doesn't typically outweigh any benefits. With batch-processed inference at fixed intervals, the implementation methods can vary significantly more. Inference endpoints are still common but an FTP with a triggered process is also a more easily implemented method. Despite these methods, requirements such as security concerns and infrastructure can often make any of the solutions described above infeasible.  

In this lightning session, we discuss the implementation of machine learning model inference using Structured Querying Language (SQL). A real-world example will demonstrate the technique used to put a random forest into production that required the processing records in limited time without the use of SQL functions and case statements. Instead, the model is built using a table of splits and dynamically generated update statements in order to make predictions based on an input table. The model, predicting ideal consumer contact times, processes millions of records nightly using these parameters. The method for creating this process, pitfalls, and retraining methods will be discussed.  

Presenting Author

Katie Bakewell, NLP Logix

First Author

Katie Bakewell, NLP Logix

Using Causal Inference to Inform Survey Administration

A common question asked in a survey research institute is, "what impact does an increase in survey incentive amount have on survey completion rates?" Many decisions concerning increasing completion rates are based on analyzing past survey designs and resulting responses. An analyst typically gathers as many covariates as possible, adds them to a regression model, and interprets the beta coefficient of incentives as a causal treatment effect. However, this approach is incorrect. Ignoring the survey design and how records are kept in a database can lead to selection biases and spurious associations. This poster aims to show how inverse probability weights (IPW) can inform survey administrators how much incentive rewards should be used to increase completion rates for a specific survey. IPWs requires knowledge of the causal relationships between variables to recover causal effects. A directed acyclic graph (DAG) can be used to represent the causal relationships visually and help with covariate selection. The weights can then be used in a regression model to obtain a causal estimate. R was used to create the DAG, IPW, and regression models. This was done with three packages, "dagitty," "twangContinous," and "survey." Three years' worth of NORC's AmeriSpeak survey completion data was used to allocate an appropriate incentive amount to meet a client's completion rate requirement for a survey. AmeriSpeak is a nationally representative probability-based panel of survey respondents from households across the U.S. The survey completions records come from an administrative database, where the intent is for record-keeping, not for conducting research. A DAG was created by combining the survey designs and knowledge of how the data were stored. The DAG helped inform the choice of covariates in creating the IPWs. Finally, the weights were used in a regression model to create reliable incentive effects on completion rates. The estimate is compared to a conventional approach model. 

Presenting Author

Frank Rojas, NORC at the University of Chicago

First Author

Frank Rojas, NORC at the University of Chicago

Using Clustering to Analyze Positions in Professional Basketball Through Different Eras

Basketball is a sport played between two teams of five players, with each player typically assigned to one of five traditional positions. These positions then help define a player's role during a game. The style of play in basketball has evolved over time, and this has caused some to question the usefulness of labelling players using the five traditional positions. This has motivated research into defining new positional roles for professional basketball players. Most of the work in this area has focused on analyzing the modern National Basketball Association (NBA), but little has been done comparing styles of play across different eras. To investigate this further, this project uses k-means clustering to analyze multiple years of performance data from the NBA in order to study the evolution of player positions across different eras. The findings of this research can be used to better understand how the game has changed and can also aid in player evaluation and planning for team composition. 

Presenting Author

Tyler Cook, University of Central Oklahoma

First Author

Tyler Cook, University of Central Oklahoma

CoAuthor

Nomel Esso, University of Central Oklahoma