The Frontier of Statistics and the Social Science: Celebrating 25 Years of CS&SS

Christopher Adolph Chair
University of Washington
 
Tyler McCormick Organizer
University of Washington
 
Tuesday, Aug 6: 8:30 AM - 10:20 AM
1218 
Invited Paper Session 
Oregon Convention Center 
Room: CC-G132 
The intersection of statistics and the social sciences is now a flourishing area of research, with complex statistical problems impacting problems and policy in the areas of economic inequality, social determinants of health, and social networks, among many (many) other topic. This session brings together dynamic speakers working at the frontier of both methodological and application-focused work in social sciences. The session is organized to celebrate the 25th anniversary of the Center for Statistics and the Social Sciences (CS&SS) at the University of Washington.

Applied

Yes

Main Sponsor

Social Statistics Section

Co Sponsors

American Sociological Association
Business and Economic Statistics Section
Caucus for Women in Statistics

Presentations

A Bayesian Information Synthesis Framework for Opioid Use Disorder Prevalence EstimationPresentation

Identifying the prevalence of OUD in the population is a critical public health activity for prevention and intervention. While tracking OUD prevalence is critical, it is challenging. Survey data are most often used to assess OUD, but are known to seriously underestimate true prevalence of OUD. Administrative records and treatment datasets represent only a minority of individuals with OUD. Other methods have been proposed for OUD prevalence estimation, such as capture-recapture, venue-based methods, network methods, and multiplier methods. In this talk, I will present a Bayesian framework that connects multiple data sources that reveal widespread and heterogeneous patterns of under-diagnosis of OUD in New York State.  

Speaker

Tian Zheng, Columbia University

How to Use Generative AI in Downstream Analysis with Design-based Supervised Learning

Generative artificial intelligence (AI) has shown incredible capabilities on a range of tasks. For social scientists, one promising application is to use generative AIs to automatically annotate unstructured big data, such as texts, images, audio, and videos, in order to generate variables of interest. We overview a general framework of design-based supervised learning (DSL), which allows social scientists to use AI-based automated annotation and analyze AI-generated labels without bias. First, we clarify the risk of directly using AI-generated labels in downstream analyses. Non-random prediction errors in generative AIs lead to substantial bias and invalid confidence intervals in downstream analyses, even if the accuracy of AI automated annotation is high, e.g., above 90%. We provide a discussion of extensions, applications, and practical guidance. 

Speaker

Brandon Stewart, Princeton

Interpretable network-assisted prediction

Machine learning algorithms often assume that training samples are independent. When data points are connected by a network, it creates dependency between samples, which is a challenge, reducing effective sample size, and an opportunity to improve prediction by leveraging information from network neighbors. Multiple prediction methods taking advantage of this opportunity are now available. Many methods including graph neural networks are not easily interpretable, limiting their usefulness in the biomedical and social sciences, where understanding how a model makes its predictions is often more important than the prediction itself. Some are interpretable, for example, network-assisted linear regression, but generally do not achieve similar prediction accuracies as more flexible models. We bridge this gap by proposing a family of flexible network-assisted models built upon a generalization of random forests (RF+), which both achieves highly-competitive prediction accuracy and can be interpreted through feature importance measures. In particular, we provide a suite of novel interpretation tools that enable practitioners to not only identify important features that drive model predictions, but also quantify the importance of the network contribution to prediction. This suite of general tools broadens the scope and applicability of network-assisted machine learning for high-impact problems where interpretability and transparency are essential. This is joint work with Tiffany Tang and Ji Zhu.  

Speaker

Elizaveta Levina, University of Michigan

Normatively Backwards Rubric Scoring: Evidence from NIH Peer Review

Rubrics are thought to improve quality and decrease social bias in scientific peer review. However, rubrics cannot serve these functions if reviewers sequence their judgments in a normatively backwards order. If reviewers determine the overall merit of a submission before scoring for specific criteria, criteria scores serve as post hoc rationalizations that can, intentionally or unintentionally, mask intellectual and social biases. Despite the importance of proper sequencing in rubric review and the wide adoption of rubrics in high-stakes peer review contexts, there is little to no research on the order with which reviewers score rubric elements in practice. Using a large dataset of preliminary scores for R01 proposals submitted to the National Institutes of Health (NIH) in fiscal years 2014-2016, we employ causal discovery methodology to investigate the causal direction with which assigned reviewers tended to score criteria (Significance, Investigator(s), Innovation, Approach, and Environment) and Overall Impact before panel discussion. We find that Overall Impact tends to be evaluated before Approach – which focuses on scientific strategy, methodology, analyses, and feasibility. We also find that Investigator and Environment tend to be evaluated first, before evaluations of scientific criteria relevant to the content of the proposed research. This evidence stresses the importance of structuring and sequencing rubric review processes to minimize the potential for normatively backwards assessment.
(This is joint work with Carole J. Lee, Fan Xia, Kwun C. G. Chan, Sheridan Grant, and Thomas S. Richardson) 

Speaker

Elena Erosheva, University of Washington

Tendencies toward triadic closure: Field-experimental evidence

Empirical social networks are characterized by a high degree of triadic closure (i.e. transitivity, clustering), whereby network neighbors of the same individual are also likely to be directly connected. It is unknown to what degree this results from dispositions to form such relationships (i.e. to close open triangles) per se or whether it reflects other processes, such as homophily and more opportunities for exposure. These are difficult to disentangle in many settings, but in social media not only can they be decomposed, but platforms frequently make decisions that can depend on these distinct processes. Here, using a field experiment on Twitter, we randomize the existing network structure that a user faces when followed by a target account that we control, and we examine whether they reciprocate this tie formation. Being randomly assigned to have an existing tie to an account that follows the target user increases tie formation by 35%. Through the use of multiple control conditions in which the relevant tie is absent (never existent or removed), we attribute this effect to small variation in the stimulus that indicates the presence (or absence) of a potential mutual follower.  

Speaker

Dean Eckles