How Product Thinking Shapes Methodological Innovation at Google

Angela Schoergendorfer Chair
Google, Inc.
 
YunChu Huang Organizer
Google
 
Wednesday, Aug 6: 10:30 AM - 12:20 PM
0755 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-209A 

Applied

Yes

Main Sponsor

Section on Statistical Computing

Co Sponsors

Quality and Productivity Section

Presentations

Statistical Measures for Evaluating Quality of LLM Performance in NLU Tasks

Modern natural language processing (NLP) applications often rely on large language models (LLMs) to automate tasks previously requiring human input. Given the high cost of obtaining ground-truth labels, LLMs have been recently used as a proxy model for human ratings (e.g., AutoRater), which can take on the form of labels, preferences, or feedback. However, it can be challenging to fully evaluate LLMs' performance in NLP or Natural Language Understanding (NLU) task settings. We investigate using statistical measures of agreement and evaluate their potential for assessing the general quality of LLMs for text analysis and inference. 

Keywords

test 

Speaker

Grace Deng, Google

Why Bayes? A Decisions First Framework for Business Data Science

Many data scientists and executives fall into the trap of blindly following Null Hypothesis Significance Testing (NHST) without understanding its limitations. This "null ritual" can lead to misguided business decisions. In 2016, the American Statistical Association warned against reducing scientific inference to mechanical rules like "p < 0.05," noting this can lead to poor decision-making.

This presentation proposes a "Decisions First" framework that prioritizes business objectives over rigid statistical procedures. By adopting a Bayesian perspective, we can treat data analysis as a continuous learning process, estimate decision-relevant probabilities, and properly acknowledge uncertainty.

The framework guides users through defining decisions, formulating data-driven questions, designing appropriate studies, and presenting findings transparently. Rather than seeking absolute certainty, it emphasizes aligning research with business goals and embracing the inherent uncertainty in real-world decision-making. This approach helps avoid common pitfalls and leads to more effective use of data. 

Speaker

Ignacio Martinez, Google

Improving Statistical Power of Classifier Evaluation with Limited Labels

At YouTube, we continuously develop classifiers to detect content that violates our community guidelines. However, comparing the performance between classifiers is challenging because of limited human labels.

In this talk, we discuss two approaches to increase the statistical power for detecting classifier improvements: a) Paired data sampling to maximize the information contained in human labels, and b) using proxy metrics that have higher sensitivity in the evaluation task. With these improvements, we are able to significantly boost our efficiency in evaluating abuse classifiers. 

Speaker

Yi Liu, Google

Multiplying your Impact: Communicating the impact of Data Science work

As data scientists, a lot of the work that we do is not immediately visible to colleagues & stakeholders. Ensuring that colleagues understand what we have done & why our work matters is vital to maximizing and multiplying the impact of our data science work. In this talk, I will share 5 simple (and familiar) principles for communicating your work effectively and I will give examples for each principle in a data science context. 

Speaker

Jean Steiner, Google