CS020 Practice and Applications, Part 1

Conference: Symposium on Data Science and Statistics (SDSS) 2024
06/06/2024: 1:15 PM - 2:45 PM EDT
Lightning 
Room: Shenandoah 

Description

This session will be followed by an e-poster session on June 6 from 2:45 - 3:10 PM.

Chair

Sunghwan Byun, North Carolina State University

Tracks

Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2024

Presentations

A Copula Model Approach to Identify Differential Expressed Genes

Microarray technology is instrumental in pinpointing differentially expressed genes (DEGs) from the vast number of genes on a DNA molecule. The Spotted cDNA array and the oligonucleotide array are two primary microarray types used for detecting gene expressions, and our focus is on the former array. Various methods have been proposed in the literature to identify DEGs, such as those by Newton et al. (2001) and Mav & Chaganty (2004). In this research, we make use of the Gaussian copula to construct a joint distribution for the red and green intensities in cDNA microarrays. We also incorporate a latent Bernoulli variable to indicate the presence of differential expression and use the EM algorithm to estimate the model parameters. By calculating posterior probabilities and ranking them, we identify DEGs in the analysis of five microarray E. coli samples originally studied in Richmond at al. (1999). Our findings show as expected the "Control" sample has no DEGs, IPTG samples have a few DEGs, and Heat shock samples have many DEGs. 

Presenting Author

N. Rao Chaganty, Old Dominion University

First Author

Prasansha Liyanaarachchi

A non-parametric approach to predict the recruitment for randomized clinical trial

Successfully recruiting the prespecified number of trial participants is critical and remains challenging to the success of clinical trials. Although various types of prediction models for recruitment have been developed in the past, they either relied on assumptions of parametric distributions or prior information on the recruitment rate. We developed a recruitment model using a simulation-based non-parametric approach for clinical trials based on inpatient settings such as those taking place in acute care for the elderly (ACE) units at UTMB. We examined recruitment logs, we studied patterns and evaluated parametric assumptions. We found that violation of assumptions is common in real-world settings. We then conducted the simulation of future enrollment based on the empirical distribution from recruitment logs. We proposed a weighted approach that put higher weights on enrollment from prior dates near the later enrolling dates. Using simulated distributions and resampling techniques, we calculated confidence intervals for recruitment numbers at the end of the time allotted for recruitment and for the time needed to finalize recruitment. We compared our method with previously published Bayesian method using our proposed measures of efficiency. The preliminary results demonstrate that a simulation-based non-parametric approach is feasible to be used as a prediction model for clinical trial recruitment. 

Presenting Author

Alejandro Villasante Tezanos, University of Texas Medical Branch

First Author

Alejandro Villasante Tezanos, University of Texas Medical Branch

CoAuthor(s)

Xiaoying Yu, University of Texas Medical Branch at Galveston
Yong-Fang Kuo, University of Texas, Medical Branch
Christopher Kurinec, University of Texas Medical Branch

Utilizing Bayesian Vector Autoregressive Model to Model and Predict the Information Cycle of Crisis Events Related to the Russia-Ukraine Conflict

During crisis events, people routinely post large amounts of information to social media.. The discussions of conflict on social media platforms dominate the interest and perception of active users. Unfortunately, social media can also be manipulated to spread disinformation during crisis events that purposefully leads the public away from the truth. This work aims to explore the information cycle during crisis events related to the Ukraine conflict and help guide stakeholders to participate in a timely manner for the public good. We use the Bayesian Vector Auto-Regressive model to investigate posting behavior among a series of crisis events that occurred during the Ukraine War. The goal of this study is to predict the time-dependent volume of the discussions for new crisis events related to this conflict on social media platforms. We also detect the change point of these events where a sharp drop or a sudden increase occurs. All this information can then be used to help policymakers decide how to react before the maximum dissemination of false information. Our results show that blogs tend to exhibit self-regulation, with a notable negative correlation between event mentions on consecutive days. This suggests a reactive pattern within the blogosphere, wherein spikes in discussions are typically followed by declines, possibly reflecting the pursuit of novelty in blog content. In contrast, news mentions demonstrate a positive momentum, indicating that increases in mentions are likely to endure across multiple days. In essence, by understanding the dynamics of information dissemination on social media during crisis events like the Ukraine conflict, officials can engage with and counteract disinformation. This, in turn, safeguards the public's access to accurate information and enhances crisis response strategies for the greater good. 

Presenting Author

Jacob Britt

First Author

Jacob Britt

CoAuthor(s)

Yifei Wang, Meta
Xiaoxia Champon, North Carolina State University
William Rand, North Carolina State University
Chatura Jayalah, University of Central Florida
Ivan Garibay, University of Central Florida

Assessing Differences in Survival Distributions in Complex Survey Data: A Comparison Study

When analyzing time-to-event data, the Cox proportional hazards model is often used to compare the relative risk of a given outcome (e.g., mortality) for various characteristics. Although several statistical methods have been developed for analyzing complex survey data emanating from population-level health surveys, methods for analyzing time-to-event data from such surveys have not progressed as rapidly. For comparing differences in survival between groups, log rank and linear rank tests are frequently utilized. As an initial step to facilitate these comparisons for complex survey data, Rader (2014) proposed an approach to compare censored survival outcomes for two groups for complex surveys based on linear rank tests (log-rank, Peto-Peto, and Harrington-Fleming ). While this approach is limited to two groups, this method does attempt to control for confounding effects through the use of propensity scores (as opposed to using a stratified test). In addition, Ritter (2021) adapted the use of non-parametric k-sample tests developed by Gray (1988) to time-to-event data from complex surveys for comparing survival distributions in the presence of competing risks. In the absence of competing risks, Gray's test simplifies to the log-rank test. Because the methods proposed by Rader (2014) and Ritter (2021) have not been previously compared, the primary aim of this presentation is to contrast these approaches for comparing survival distributions in time-to-event data collected from complex health surveys, including differences in the use of covariates and software implementation. These comparisons will be based on a simulation approach utilizing methods developed by Rader (2014) to simulate clustered survival outcomes with a general covariance structure based on a set of covariates. These findings will be useful for extending the analytic uses for time-to-event complex survey data. 

Presenting Author

John Pleis, National Center for Health Statistics

First Author

John Pleis, National Center for Health Statistics

Attributing Credit and Measuring Impact of Open Source Software Using Fractional Counting

Open source software (OSS) has become an essential in knowledge production and innovation in both academic and business sectors around the globe. OSS is developed by a variety of entities and is considered a "unique scholarly activity" due to the complexity of scientific computational tasks and the necessity of cooperation and transparency for research methodology. While the developers of OSS are thought to be very widespread, there remains many questions to be answered about who these contributors are, who are the largest contributors (countries, sectors, organizations), and how they influence each other.

Using data collected on Python and R packages from GitHub, we leverage fractional-counting methods to measure the exact contribution of each developer and use weighted counting based on the lines of code added to accurately sum the contribution of countries to OSS. We find that for both Python and R, developers from a small group of top countries account for a considerable share of code additions. Developers from the top 10 countries, which include the United States, Germany, United Kingdom, France, and China comprise of 76.1% of the total R repositories, and 66.6% of Python repositories.

Next, we use the dependency relationship between packages and study the pairwise connections between countries to measure their respective impact, finding that the packages attributed to United States are most frequently reused by packages from Germany, Spain, Italy, Australia, and United Kingdom based on the total dependency fractions. In parallel, United States mostly uses packages from Germany, France, and Denmark.

Influential contributors to OSS can contribute heavily to the priorities and practices of scientific research when their work is widely used or built upon by other researchers. In this context, studying the global distribution, collaboration, and impact of the contributors is important to understanding the landscape of innovation in scientific research. 

Presenting Author

Nick Askew, Westat

First Author

Nick Askew, Westat

CoAuthor(s)

Gizem Korkmaz, Westat
Clara Boothby, National Center for Science and Engineering Statistics

CameraTrapDetectoR: deep learning methods to detect, classify, and count animals in camera trap images.

Camera traps are a popular, non-invasive, and cost-effective way to monitor animal populations, and evaluate animal behavior and ecological processes influencing populations. Examples include but are not limited to the detection of endangered or invasive species, determining species interactions, predicting population dynamics, and the identification of diseased animals. The time and labor required to manually classify potentially millions of images generated by a single camera array presents a significant challenge; reducing this burden facilitates implementation of larger or longer-lasting camera trap arrays, resulting in more comprehensive analyses and better decisions. To address this challenge, a multi-agency USDA team has developed CameraTrapDetectoR - a free, open-source tool that deploys a series of generalizable deep learning object detection models at the class, family, and species taxonomic levels to detect, classify, and count animals in camera trap images. The tool is available as an R package with an R Shiny interface, a desktop application, or a command-line Python script so it can be easily integrated into many analytical pipelines. Crucially, the tool enables users to retain complete data privacy. Each model is independently trained from a dataset of 311584 manually annotated images from 29 unique sites, representing 58 unique families and 177 unique species, currently using a Faster-RCNN model architecture with a ResNet-50 backbone. Median recall accuracy on test data for the most recent models is 87.5% for the species model (n= 78, range 51.2% - 100%), 93.9% for the family model (n=33, range 56.2% - 100%), and 98.3% for the class model (n=5, range 97.1% - 100%). New models are iteratively trained using additional images and state-of-the-art computer vision approaches to increase prediction accuracy on out-of-sample, out-of-site data. The e-poster presentation will include tool demonstration on multiple platforms. 

Presenting Author

Amira Burns, USDA - ARS - APHIS

First Author

Amira Burns, USDA - ARS - APHIS

CoAuthor(s)

Hailey Wilmer, USDA - ARS
Ryan S. Miller, USDA APHIS CEAH
Patrick E. Clark, USDA Agricultural Research Service
Jay Angerer, USDA Agricultural Research Service

Construction of Strata Boundaries in Tax Auditing

Abstract: Practice and Applications

The cumulative square root of the frequency (the "cum√f ") method is a generally accepted statistical technique used for the construction of strata boundaries in sampling. Many statistical consultants and state and federal taxing and auditing agencies utilize this method originally developed by Dalenius and Hodges (1959). But there is a general lack of guidance on the determination and effects of interval (i.e., class) widths. Dalenius and Hodges proposed the application of their method using frequency distributions with class widths of 5 units. In this paper, we present the results of empirical tests to contrast Dalenius' method with different class widths and to other approximate, non-iterative methods using several typical skewed accounting populations.

What is the problem and why? Most state revenue agencies use only the "cum√f ") method and with no prescribed class width usage. The purpose of the cum√f method is to approximate optimal boundaries by minimizing the product of the stratum weight multiplied by the true variance which the method seeks to accomplish by equalizing the cum√f across the strata (Cochran 1977). But this does not frequently happen with the common skewed accounting data.

What additional value does the presenter's approach provide? The research supports the conclusion that interval width (i.e., class width) has a meaningful effect on the cum√f method and the associated representativeness of the sample and the accuracy and precision of the estimate … and thus should be used dynamically and not as a one size fits all. 

Presenting Author

Zachary Rhyne, Ryan, LLC

First Author

Zachary Rhyne, Ryan, LLC

CoAuthor

Roger Pfaffenberger, Ryan LLC

Deciphering Article Popularity in the Digital Era: Comprehending Public Attitude with Supervised Machine Learning Models

The age of the Internet has transformed information retrieval and engagement with the news. 2020 was marked with a series of unprecedented events, such as the pandemic, the murder of George Floyd, and the crucial 2020 U.S. presidential election. In a survey conducted by the Pew Research Center in 2020, a little over half of the respondents (53%) claim that they got their news from social media and digital platforms. Thus, an increasing number of individuals actively participated in discussions on various platforms, including commenting on media platforms or posting on social media like Twitter. As discussions around these events proliferated across social media and news outlets, understanding the factors driving the popularity of articles became paramount. Articles were collected from The New York Times between 01/01/2020 to 12/31/2020 to understand what features of user engagement and the characteristics of the article itself have with popularity. A total of 16,787 articles were included in our analysis, with information on the article's section, headline, abstract, keyword, word count, publication date, number of comments, sentiment, and popularity recorded. Supervised machine learning models, including linear, ridge, lasso, random forest, and gradient boosting regressions, were employed to understand the data. Based on feature selection in using the random forest, factors like publication date (0.324), section (0.305), and word count (0.371) significantly impact article engagement, while sentiment had no influence over the popularity of an article. Using those features, top sections and keywords were identified from popular articles, while exploring temporal trends to gauge discourse intensity during specific periods. Analysis of top sections with the highest comment counts revealed keywords centered around major events like COVID-19, the 2020 election, and the killing of George Floyd, with engagement peaking during the summer of 2020. 

Presenting Author

Anusha Natarajan

First Author

Anusha Natarajan

Enhancing Real Estate Market Prediction: A Comparative Analysis of Modeling Techniques

The real estate market holds significant interest for researchers in industry and academia due to its impact on every household. Despite numerous studies leveraging available data and emerging technologies like artificial intelligence, there remains a need for an efficient and robust approach to predict market trends. Our study conducts a comparative analysis of various deep learning and hybrid models for predicting the future price of real estate market indices. To build our models, we select several predictors, including fundamental market indicators, macroeconomic factors, and technical indicators. We then assess model performance using standard regression metrics and employ statistical analysis for model selection and validation to ensure robustness. 

Presenting Author

Ramchandra Rimal, Middle Tennessee State University

First Author

Ramchandra Rimal, Middle Tennessee State University

CoAuthor(s)

Binod Rimal, The University of Tampa
Hum Nath Bhandari, Rogers William University
Keshab Dahal, State University of New York Cortland
Nawa Raj Pokhrel, Xavier University of Louisiana

Evaluating Assays in the Absence of a Gold Standard: The AZ Proteus / Matrix Studies of ctDNA

Circulating Tumor DNA (ctDNA) are fragments of DNA shed by tumor cells into the bloodstream, some having known cancer-related mutations. Several companies are developing new assays to measure ctDNA for determining cancer presence and tumor burden from plasma. The pharmaceutical developer wants to find the best value assays for use in screening for new cancer cases, drug performance during clinical trials and monitoring post-treatment status. However, there is no gold standard to compare against. Statistical issues that arise include agreement among assays, computing a common value for ease of comparisons, missing data, creating contrived samples for evaluating assays at very low concentrations, evaluating limits of detection, estimating assay variability, and estimating accuracy in measuring change, especially change due to treatment. 

Presenting Author

David Shera

First Author

David Shera

CoAuthor

Daniel Stetson, AstraZeneca

Exploring Computational Approaches for Coding Qualitative Responses in the Medical Expenditure Panel Survey

The Medical Expenditure Panel Survey (MEPS) is a widely utilized nationally representative survey designed to explore healthcare utilization and expenditure patterns within the U.S. Information in the MEPS, such as the use of healthcare services, is represented by both quantitative (close-ended) and qualitative (open-ended) responses. One of the primary challenges when working with MEPS data involves the process of coding open-ended responses into standardized categories. Manual coding of text data from open-ended questions is time-consuming and costly. The accumulated manual coding data in MEPS has enabled the training of computational models to automate the process of coding qualitative responses. However, such efforts have not been undertaken within the context of MEPS.

To accelerate the data preprocessing of MEPS data, we explored computational approaches to automatically code the qualitative responses. We began by transforming qualitative responses into word embeddings using BERT-based models. Our category prediction process involves two approaches: (1) predicting the code by identifying the most similar responses from previous years using embedding similarities and linking the current qualitative response to the coding results from those prior years, and (2) using the embeddings as features to train machine learning models for predicting the code.

We evaluated our approaches to coding two open-ended questions. The responses collected for both questions, along with their coding results from 2018 to 2021, were utilized as the training dataset, while the data from 2022 was used as the testing dataset. Both approaches consistently achieve high accuracy, ranging from 90.7% to 95.4%, in coding responses to the two questions. Our results indicate that computational models hold significant promise for coding qualitative responses in MEPS, underscoring the need for further exploration in future studies. 

Presenting Author

Mengshi Zhou

First Author

Mengshi Zhou

CoAuthor(s)

Oliva He, Westat
Chris Barzola, Westat
Alexandra Marin, Westat
Michael Raithel, Westat
Jeannie Hudnall, Westat
Kevin Wilson, Westat

Graduate student staffing model for university library Data Science Consulting service: lessons learned and new horizons

The Data Science Consulting Service at North Carolina State University Libraries provides support for the NC State community for requests spanning the entirety of the data science lifecycle. The collaborative consulting efforts of the Libraries and NC State's Data Science Academy leverage graduate student data science consultants, as they help us provide and scale data science service to our campus while providing an important learning opportunity for the students. We have found that it also opens graduate students to new options for future work they may otherwise not have encountered in their disciplinary programs. Our consulting service uses the information obtained from our appointment booking forms and email requests to improve service delivery, plan outreach, and to recruit students with the skills desired by our users. We believe these data-driven insights aids in providing the best service and equips our student workers with the knowledge needed to assist with the wide-ranging requests our service receives. We will share our experiences developing this staffing model, the successes of our program, and how we plan to grow and expand data science support services at NC State. 

Presenting Author

Shannon Ricci, North Carolina State University

First Author

Shannon Ricci, North Carolina State University

CoAuthor(s)

Alp Tezbasaran
Mara Blake
Emily Griffith, NC State University

Impact of NBA Team Performance on Fan Engagement

Winning is the goal of a professional sports team, but does winning impact the local reputation of the NBA? Munoz showed that winning tends to boost attendance numbers (Munoz, et al. 2022). I use both localized survey data and passive behavioral data to see how the NBA's brand image and digital engagement change as the local team wins or loses.

I used data from three sources: A syndicated brand tracker (Harris Brand Platform) which measures intangibles that are integral to brand health; Samba TV data which provides us with passive viewing behavior of US TV viewers; and digital engagement data from a metered panel called Luth.

I aggregated the data by market to get the correlations of various metrics with local team performance, and also plotted these metrics over time against team performance.

Analyzing these disparate sets of data at an aggregate level revealed important insights, including:
- Performance in a single month is not as heavily correlated with digital engagement as cumulative season performance to date.
- The NBA's brand approval and reputation in an area were not correlated with the local team's performance. Being a better team increases local viewership, but it does not really impact the league's overall approval metrics in that area.
- Areas with no NBA team have considerably lower viewership than all areas with a team. This supports league expansion, as the increased footprint would most likely increase viewership.

Attendees will learn how syndicated brand tracker data can be used in conjunction with other key brand metrics to answer important business questions.

[1] Munoz, Ercio, Chen, Jiadi and Thomas, Milan. "Jumping on the bandwagon? Attendance response to recent victories in the NBA" Journal of Quantitative Analysis in Sports, vol. 18, no. 3, 2022, pp. 161-170. https://doi.org/10.1515/jqas-2020-0092 

Presenting Author

Tomer Zur, The Harris Poll

First Author

Tomer Zur, The Harris Poll

Implementing retrieval-augmented generation with survey question evaluation reports.

Survey question evaluation studies play a crucial role in improving questionnaire design and enhancing the interpretation and analysis of survey data. The Collaborating Center for Questionnaire Design and Evaluation Research at the Centers for Disease Control and Prevention's National Center for Health Statistics maintains an online repository, Q-Bank, which houses extensive research reports on survey questions spanning back through 1990. Many are validity studies that delineate construct(s) captured by individual questions as they relay the phenomena considered by respondents when formulating answers within an interview setting. This research enables data users to have a better understanding of the data, allowing for a more sophisticated interpretation of findings. The objective of this project is to determine the feasibility of using AI tools to enhance user navigation within Q-Bank. We developed a retrieval-augmented generation (RAG) based interface that leverages generative AI tools to facilitate user access to relevant information from Q-Bank. The RAG aims to index information about research documents in the repository and retrieve salient details such as citation information and links in response to user queries. Improved indexing and information retrieval increases the usefulness of Q-Bank as it would allow for a more comprehensive search of questions, enabling researchers and survey methodologists to access insights on question validity and construct capture. We also implemented an evaluation framework to derive performance metrics of the RAG. These findings can be used to inform approaches to index other sources of data, disseminate research, and streamline literature review processes, saving time and effort while ensuring informed decision-making. 

Presenting Author

Priyam Patel, Centers for Disease Control and Prevention

First Author

Priyam Patel, Centers for Disease Control and Prevention

CoAuthor(s)

Justin Mezetin, Swan Solutions/NCHS
Benjamin Rogers, NCHS

Use of Current Population Survey and Cooperative Election Study in Analyzing Registered Voter Turnout

The Current Population Survey (CPS) is sponsored jointly by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics (BLS). Within the CPS infrastructure, the Census Bureau has collected voting and registration data biennially in the November Current Population Survey since 1964. It is the national statistical source of record for state-and national voter registration and turnout statistics. These statistics were used recently by an expert for the Plaintiff's in the case of White v. Mississippi State Board of Election Commissioners. The Plaintiff's expert made a fundamental mathematical mistake in their calculations – and concluded that Black voters in Mississippi underperformed white voters in both registration rates and turnout. In reality, according to the offically published CPS statistics – Blacks in Mississippi have outperformed whites in registration and turnout and have done so for some time.

Facing this reality, the Plaintiff's expert sought a new dataset that could potentially support their case. They found it in what is known as the Cooperative Election survey, or "CES". The CES is a well-established and highly regarded survey conducted by a consortium of universities. It differs from the CPS insofar as it engages a separate, external service to verify its respondent reported answers as to whether they registered to vote and actually voted. The Plaintiff's expert analyzed the CES microdata and concluded again that whites did in fact outperform Blacks in voter registration and turnout. In fact, the expert again erred in their analysis, selecting the incorrect demographic weights. Had they used the correct weights, they would have found no statistically significant difference between Black and white voters.

This presentation will include an assessment of the similarities and differences between the CES and CPS in Mississippi. 

Presenting Author

Thomas Bryan, BGD

First Author

Thomas Bryan, BGD

CoAuthor

David Swanson, University of California-Riverside

State and Multi-state Data Applications for the Public Good

The Coleridge Initiative is a non-profit with the goal of providing public agencies with the opportunity to understand and analyze their data to develop effective policies for the public good. Three use-cases will be described in which secure administrative data access and training, visualizations, and key stakeholder feedback and involvement are all used to support state evidence-based policymaking.

1. Coleridge hosts a restricted and public data in our enclave, the Administrative Data Research Facility (ADRF). Using this, Coleridge hosts training classes aimed at helping state staff more effectively use their data (often administrative). One class explored both coding and the use of rich unemployment insurance claim data to provide labor market insights, often by equity groups thought to be most-affected by the pandemic.
2. Coleridge collaborates with state and other external partners to create visualizations or research outputs. Visualizations are used to inform local policymaking, especially when decision makers have a matter of minutes to go over a plethora of data. Work with one state has linked data from Workforce Innovation Opportunity Act (WIOA) training program participants to wage data to provide information on employment outcomes by region. Another has looked at these outcomes among those participating in reemployment services programs. In both cases, feedback from agency staff was used to iterate on and develop existing dashboards that would ultimately provide them with the necessary tools to make decisions.
3. Coleridge facilitates the exchange of ideas across states and regions, helping to build a data network that will support the public good. One example is the multi-state data collaborative. This involves six different states using a single state's administrative data to pilot various new measures and create accompanying visuals. Once these measures are created, each state will begin work on the same measures using their own data. 

Presenting Author

Allison Nunez

First Author

Allison Nunez