Building Better Data Analyses: Theory, Methods, and Lessons Learned

Roger Peng Chair
University of Texas, Austin
 
Roger Peng Organizer
University of Texas, Austin
 
Thursday, Aug 8: 8:30 AM - 10:20 AM
1355 
Invited Paper Session 
Oregon Convention Center 
Room: CC-B118 
Data science has risen rapidly in the collective consciousness of our society and the ability to analyze data well is quickly becoming an essential skill. As a result, it has become urgent that data analysis training and education be scaled broadly. However, a fundamental problem in the practice of data analysis is determining how to formally evaluate the quality of a given data analysis and how to get students and practitioners to do better data analyses. We must move beyond the "know it when we see it" phase of data analysis and build more formal understandings of data analytic quality. This requires building models of various aspects of the analytic process and distilling generalizable lessons from data analytic experience. With such models and information, we can then scale the training of data analysis beyond the often-used apprenticeship model. The session will explore the theoretical, practical, and pedagogical aspects of conducting data analyses and address key areas that can lead to better data analyses and scalable training. We highlight four areas of data science activity -- the iterative cycle of analysis, the alignment of the analyst and audience to produce useful analyses, the distillation of generalizable lessons from analytic experience, and the role of reflective practice when comparing expectations to observations in data analysis.

For many, including students at both the undergraduate and graduate level, data analysis can appear to be a nebulous and mysterious process. While some eventually learn through experience, many do not, and it is worth asking whether such a process can be accelerated and made more equitable? This session will explore formal mechanisms for understanding data analysis that are analogous to the approaches taken with learning statistical theory and methods. More people than ever before are analyzing data, whether they know it or not. More students than ever before want to learn data science and get data science jobs. There is a therefore a demand to develop approaches for formally discussion the quality of data analysis and for providing concrete and consistent advice for how to improve data analyses. This session provides some of the foundational ideas upon which such a formal system can be built.

Applied

Yes

Main Sponsor

Section on Statistics and Data Science Education

Co Sponsors

Caucus for Women in Statistics
Section on Statistical Graphics
Section on Teaching of Statistics in the Health Sciences

Presentations

Analytic fluency: What it is, who has it, and how it is learned.

Compared to novice analysts, experienced analysts often have heightened analytic fluency, or those skills and intuitions for producing trustworthy and well-performing analyses that go beyond the formal skills of measurement, data formatting, modelling, and communication found in statistics textbooks. Traditionally, trainees build analytic fluency through informal mentorship, which is provided under the assumption that there is no substitute for experience. Unfortunately, today's analysts face new pressures requiring a faster uptake of analytic fluency than experience-based mentorship can give, such as influxes of new data, replication crises driven by widespread misunderstanding of core statistical principles, and technological advances that expose today's analysts to novel ethical challenges. These pressures prompt three questions: what exactly defines analytic fluency, how can we formalize its assessment, and how can mentors help trainees build analytic fluency faster than experience alone can? Here, I present results from mixed-methodological empirical investigations into the content, application, and transmission of analytic fluency. Findings and implications are discussed.  

Speaker

Matthew Vanaman, University of Texas at Austin

Evaluating the Alignment of a Data Analysis between Analyst and Audience

A challenge that all data analysts face is building a data analysis that is useful for a given audience. In this talk, we will begin by proposing a set of principles for describing data analyses. We will then introduce a concept that we call the alignment of a data analysis between the data analyst and audience. We define a successfully aligned data analysis as the matching of principles between the analyst and the audience for whom the analysis is developed. We will propose a statistical model and general framework for evaluating the alignment of a data analysis. This framework can be used as a guide for practicing data scientists and students in data science courses for how to build better data analyses.  

Speaker

Lucy D'Agostino McGowan, Wake Forest University

Lessons Learned from 1,000 Data Science Projects

The process of Building Better Data Analyses often begins in the classroom, and increasingly, due to growing enrollments, this is happening at scale. Since 2018, we have taught COGS 108 Data Science in Practice at UC San Diego every single term to 400+ students at a time. This large-enrollment, project-based course aims to teach the critical skills needed to pursue a technical data-focused career. Throughout this course, students complete a term-long group data science project on a topic of the students' choosing. Groups carry out the entire data science process: formulating a question; finding, cleaning and analyzing data; answering their question of interest; and finally, communicating their process and findings in both a detailed, technical data science report and short, oral presentation. Having advised 4,000 students through more than 1,000 projects, we summarize the key lessons we've learned in how to teach using and analyzing data at scale. Our findings highlight the importance of clear instruction, project scaffolding, regular checkpoints, detailed & project-specific feedback, and careful consideration of the technical stack used. 

Speaker

Shannon Ellis, UC San Diego

Modeling Analytic Iteration with Probabilistic Outcome Sets

In exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an under-appreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. We propose a model for the iterative process of data analysis based on what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. Finally, we show how our framework can be used to characterize common situations in practical data analysis. 

Speaker

Stephanie Hicks, Johns Hopkins University, Bloomberg School of Public Health