Modeling Analytic Iteration with Probabilistic Outcome Sets

Stephanie Hicks Speaker
Johns Hopkins University, Bloomberg School of Public Health
 
Thursday, Aug 8: 9:50 AM - 10:15 AM
Invited Paper Session 
Oregon Convention Center 
In exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an under-appreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. We propose a model for the iterative process of data analysis based on what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. Finally, we show how our framework can be used to characterize common situations in practical data analysis.