Wednesday, Aug 6: 10:30 AM - 12:20 PM
0755
Topic-Contributed Paper Session
Music City Center
Room: CC-209A
Applied
Yes
Main Sponsor
Section on Statistical Computing
Co Sponsors
Quality and Productivity Section
Presentations
Modern natural language processing (NLP) applications often rely on large language models (LLMs) to automate tasks previously requiring human input. Given the high cost of obtaining ground-truth labels, LLMs have been recently used as a proxy model for human ratings (e.g., AutoRater), which can take on the form of labels, preferences, or feedback. However, it can be challenging to fully evaluate LLMs' performance in NLP or Natural Language Understanding (NLU) task settings. We investigate using statistical measures of agreement and evaluate their potential for assessing the general quality of LLMs for text analysis and inference.
Keywords
test
Many data scientists and executives fall into the trap of blindly following Null Hypothesis Significance Testing (NHST) without understanding its limitations. This "null ritual" can lead to misguided business decisions. In 2016, the American Statistical Association warned against reducing scientific inference to mechanical rules like "p < 0.05," noting this can lead to poor decision-making.
This presentation proposes a "Decisions First" framework that prioritizes business objectives over rigid statistical procedures. By adopting a Bayesian perspective, we can treat data analysis as a continuous learning process, estimate decision-relevant probabilities, and properly acknowledge uncertainty.
The framework guides users through defining decisions, formulating data-driven questions, designing appropriate studies, and presenting findings transparently. Rather than seeking absolute certainty, it emphasizes aligning research with business goals and embracing the inherent uncertainty in real-world decision-making. This approach helps avoid common pitfalls and leads to more effective use of data.
At YouTube, we continuously develop classifiers to detect content that violates our community guidelines. However, comparing the performance between classifiers is challenging because of limited human labels.
In this talk, we discuss two approaches to increase the statistical power for detecting classifier improvements: a) Paired data sampling to maximize the information contained in human labels, and b) using proxy metrics that have higher sensitivity in the evaluation task. With these improvements, we are able to significantly boost our efficiency in evaluating abuse classifiers.
As data scientists, a lot of the work that we do is not immediately visible to colleagues & stakeholders. Ensuring that colleagues understand what we have done & why our work matters is vital to maximizing and multiplying the impact of our data science work. In this talk, I will share 5 simple (and familiar) principles for communicating your work effectively and I will give examples for each principle in a data science context.