Contributed Poster Presentations: Section on Text Analysis

Shirin Golchi Chair
McGill University
 
Tuesday, Aug 5: 2:00 PM - 3:50 PM
4124 
Contributed Posters 
Music City Center 
Room: CC-Hall B 

Main Sponsor

Section on Text Analysis

Presentations

63: Benchmarking Graph-based RAG for Open-domain Question Answering

We benchmark various graph-based retrieval-augmented generation (RAG) systems across a broad spectrum of query types, including OLTP-style (fact-based) and OLAP-style (thematic) queries, to address the complex demands of open-domain question answering (QA). Traditional RAG methods often fall short in handling nuanced, multi-document synthesis tasks. By structuring knowledge as graphs, we can facilitate the retrieval of context that captures greater semantic depth and enhances language model operations. We explore various graph-based RAG methodologies and introduce TREX, a novel, cost-effective alternative that combines graph-based and vector-based retrieval techniques. Our extensive benchmarking across four diverse datasets highlights scenarios where each approach excels and reveals the limitations of current evaluation methods, motivating new metrics for assessing answer correctness. In a real-world technical support case study, we demonstrate how graph-based RAG can surpass conventional vector-based RAG in efficiently synthesizing data from heterogeneous sources. 

Keywords

GraphRAG

TREX

question answering

LLM

Large Language Models

benchmarking 

Co-Author(s)

Prerna Singh, Microsoft
Nick Litombe, Microsoft
Mirco Milletari, Microsoft
Jonathan Larson, Microsoft
Ha Trinh, Microsoft
Yiwen Zhu, Microsoft
Andreas Mueller, Microsoft
Fotis Psallidas, Microsoft
Carlo Curino, Microsoft

First Author

Joyce Cahoon, Microsoft

Presenting Author

Joyce Cahoon, Microsoft

64: Integrating Users with Machine Learning and Large Language Models to Healthcare Data and Text

This paper presents an interface that uses machine learning (ML) and large language models (LLMs) to enhance predictive analytics, patient experience and performance improvement; ultimately, this user-friendly tool will provide actionable insights into an interactive and shareable dashboard. From the beginning, the paper illustrates the importance of a comprehensive healthcare analytics platform and advanced analytical tools, driven by the exponential growth of meaningful data and text availability. It goes on to outline our team's approach to integrating various data sources, applying ML algorithms and deploying LLMs for natural language processing. This interface shows potential to improve predictive analytics by identifying patterns and trends, which can then be used to improve or resolve complexities related to the patient experience. The impact of integrating ML and LLMs in healthcare analytics is transformative and warrants broader adoption. 

Keywords

Comprehensive Analytics Platform

Machine Learning (ML)

Large Language Models (LLMs)

Predictive Analytics

Patient Experience

Performance Improvement 

Co-Author(s)

Julius Hahn, Keck Medicine of USC
Amanda Schmitz, Keck Medicine of USC
Holly Hallman, Keck Medicine of USC

First Author

Chien-chih Lin, Keck Medicine of USC

Presenting Author

Chien-chih Lin, Keck Medicine of USC

65: Leveraging Collective Entity Resolution to Improve OpenAlex's Citation Network Repository

OpenAlex, a free and open-source citation network repository, presents an exciting alternative to Elsevier's proprietary SCOPUS database, offering broader coverage of scholarly work, including papers in diverse languages. However, OpenAlex suffers from metadata inconsistencies, such as missing institutional affiliations and incomplete author records, which limit its utility for research purposes. This project explores the application of collective entity resolution to clean and enhance OpenAlex's citation network data. Unlike traditional methods, collective entity resolution utilizes co-authorship ties in addition to author attributes to disambiguate and enrich metadata. Using a subset of authors-recent recipients of the NSF Graduate Research Fellowship Program (GRFP)-we evaluate the approach's ability to improve data quality and offer insights on network structure. Our findings highlight the potential of collective entity resolution to transform OpenAlex into a high-quality, open-access alternative to SCOPUS, fostering equity and accessibility in academic research. 

Keywords

entity resolution

bibliometrics

open source 

First Author

Abigail Smith, NORC at the University of Chicago

Presenting Author

Abigail Smith, NORC at the University of Chicago

66: Supervised Matrix Factorization for Estimating Individualized Treatment Rule

Precision medicine aims to tailor treatments to the unique characteristics of individual patients. In this paper, we develop a classification-based approach to an estimate individualized treatment rule (ITR) by leveraging both structured quantitative data and high-dimensional unstructured textual documents. To tackle the challenge of incorporating text data, we propose an outcome-driven supervised nonnegative matrix factorization method that extracts relevant topics for ITR estimation in a one-step procedure. Our proposed method factorizes vectorized documents into a document-topic matrix and a topic-word matrix, guided by the outcome. For estimation, we constructed a weighted and penalized objective function, solved by a projected gradient approach, to jointly estimate the document representation and the ITR. Our formulation enables the interpretability of the effect of the learned topics on the ITR. We demonstrate the performance of our method through simulation studies and a real-world example from the MIMIC-IV intensive care unit dataset. 

Keywords

electronic health record

individualized treatment rule

natural language processing

topic modeling

precision medicine 

Co-Author(s)

Lu Tang, University of Pittsburgh
Rebecca Deek, University of Pittsburgh

Presenting Author

Crystal Zang