Causal Inference and Generative AI: Discovering Causal Relations Using Large Language Models

Summer Han Chair
Stanford University
 
Zhenjiang Fan Organizer
Stanford University School of Medicine
 
Sunday, Aug 3: 4:00 PM - 5:50 PM
0941 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-104B 

Keywords

Large language model

Causal inference and discovery

Machine learning

AI 

Applied

Yes

Main Sponsor

Section on Text Analysis

Co Sponsors

Section on Statistical Computing

Presentations

Estimating Causal Relationships in Complex Systems from Tabular Data Using Language Models

Large language models (LLMs) are increasingly being applied in scientific research due to their advanced reasoning capabilities. For instance, multi-modal LLMs can process diverse data types as inputs, expanding their utility across various domains. However, while traditional causality methods primarily focus on tabular data, existing language models are largely limited to inferring causal relationships from textual data. In this work, we leverage the powerful reasoning capabilities of language models to infer and discover causal relationships directly from tabular data. The proposed framework utilizes the Mamba (State Space Model) language model architecture with added layers for classification tasks. To ensure the framework's robustness and generalizability, we incorporate a diverse range of simulation data and 10 curated real-world datasets into the training procedure. Furthermore, our framework is designed to be extensible, enabling users to easily integrate their data and additional scores and tests. Our results demonstrate that the proposed causal framework outperforms existing methods in terms of accuracy. Additionally, the framework is designed to be extensible, allowing users to incorporate their data for further customization and application. 

Keywords

Large language model

Causal inference

Causal discovery

Tubular data

Complex causal systems 

Speaker

Zhenjiang Fan, Stanford University School of Medicine

When Causal Discovery Meets LLMs

Traditional methods for causal graph recovery often rely on statistical estimation or expert input, which can be limited by bias and incomplete knowledge. In this presentation, we introduce an approach that integrates large language models (LLMs) with constraint-based causal discovery to infer causal structures from scientific literature. Our method employs LLMs as knowledge extractors to identify associational relationships among variables from extensive scientific corpora. These relationships are then refined into causal graphs via constraint-based algorithms that eliminate inconsistent connections.

Rather than depending on LLMs for complex causal reasoning, our method leverages their strength in interpreting and extracting information from large-scale scientific texts. This allows us to uncover nuanced associational and causal insights without relying solely on the models' reasoning capabilities. By integrating textual knowledge extraction with causal inference techniques, our method provides a scalable, automated solution for causal discovery, mitigating human bias and harnessing the collective knowledge embedded in scientific discourse.
 

Keywords

LLMs

Causal discovery and recovery 

Speaker

Lina Yao Yao, Data61

Using LLM predictions in unbiased causal estimation with unobserved variables

Causal estimation requires assumptions about the underlying data-generating process. To achieve unbiased estimates, we typically assume no unobserved confounding and adjust for confounders that influence both the treatment and outcome. In application domains such as clinical records where text may supplement structured data, there may be unobserved confounders that can be accounted for using more complex identification and estimation strategies. However, the large language models (LLMs) that demonstrate predictive performance often do not meet the statistical assumptions required for causal estimation. This presentation discusses two ways in which causal estimation methods can be augmented with LLMs to enable unbiased estimation of causal effects, through either Double Machine Learning or a measurement error framework. 

Speaker

Zach Wood-Doughty, Northwestern University

Causal Representation Learning and Causal Generative AI

Causality is a fundamental notion in science, engineering, and even in machine learning. Uncovering the causal process behind observed data can naturally help answer 'why' and 'how' questions, inform optimal decisions, and achieve adaptive prediction. In many scenarios, observed variables (such as image pixels and questionnaire results) are often reflections of the underlying causal variables rather than being causal variables themselves. Causal representation learning aims to reveal the underlying hidden causal variables and their relations. In this talk, we show how the modularity property of causal systems makes it possible to recover the underlying causal representations from observational data with identifiability guarantees: under appropriate assumptions, the learned representations are consistent with the underlying causal process. We demonstrate how identifiable causal representation learning can naturally benefit generative AI, with image generation, image editing, and text generation as particular examples. 

Speaker

Kun Zhang, Carnegie Mellon University & MBZUAI