Tuesday, Aug 5: 2:00 PM - 3:50 PM
4124
Contributed Posters
Music City Center
Room: CC-Hall B
Main Sponsor
Section on Text Analysis
Presentations
We benchmark various graph-based retrieval-augmented generation (RAG) systems across a broad spectrum of query types, including OLTP-style (fact-based) and OLAP-style (thematic) queries, to address the complex demands of open-domain question answering (QA). Traditional RAG methods often fall short in handling nuanced, multi-document synthesis tasks. By structuring knowledge as graphs, we can facilitate the retrieval of context that captures greater semantic depth and enhances language model operations. We explore various graph-based RAG methodologies and introduce TREX, a novel, cost-effective alternative that combines graph-based and vector-based retrieval techniques. Our extensive benchmarking across four diverse datasets highlights scenarios where each approach excels and reveals the limitations of current evaluation methods, motivating new metrics for assessing answer correctness. In a real-world technical support case study, we demonstrate how graph-based RAG can surpass conventional vector-based RAG in efficiently synthesizing data from heterogeneous sources.
Keywords
GraphRAG
TREX
question answering
LLM
Large Language Models
benchmarking
Co-Author(s)
Prerna Singh, Microsoft
Nick Litombe, Microsoft
Mirco Milletari, Microsoft
Jonathan Larson, Microsoft
Ha Trinh, Microsoft
Yiwen Zhu, Microsoft
Andreas Mueller, Microsoft
Fotis Psallidas, Microsoft
Carlo Curino, Microsoft
First Author
Joyce Cahoon, Microsoft
Presenting Author
Joyce Cahoon, Microsoft
This paper presents an interface that uses machine learning (ML) and large language models (LLMs) to enhance predictive analytics, patient experience and performance improvement; ultimately, this user-friendly tool will provide actionable insights into an interactive and shareable dashboard. From the beginning, the paper illustrates the importance of a comprehensive healthcare analytics platform and advanced analytical tools, driven by the exponential growth of meaningful data and text availability. It goes on to outline our team's approach to integrating various data sources, applying ML algorithms and deploying LLMs for natural language processing. This interface shows potential to improve predictive analytics by identifying patterns and trends, which can then be used to improve or resolve complexities related to the patient experience. The impact of integrating ML and LLMs in healthcare analytics is transformative and warrants broader adoption.
Keywords
Comprehensive Analytics Platform
Machine Learning (ML)
Large Language Models (LLMs)
Predictive Analytics
Patient Experience
Performance Improvement
OpenAlex, a free and open-source citation network repository, presents an exciting alternative to Elsevier's proprietary SCOPUS database, offering broader coverage of scholarly work, including papers in diverse languages. However, OpenAlex suffers from metadata inconsistencies, such as missing institutional affiliations and incomplete author records, which limit its utility for research purposes. This project explores the application of collective entity resolution to clean and enhance OpenAlex's citation network data. Unlike traditional methods, collective entity resolution utilizes co-authorship ties in addition to author attributes to disambiguate and enrich metadata. Using a subset of authors-recent recipients of the NSF Graduate Research Fellowship Program (GRFP)-we evaluate the approach's ability to improve data quality and offer insights on network structure. Our findings highlight the potential of collective entity resolution to transform OpenAlex into a high-quality, open-access alternative to SCOPUS, fostering equity and accessibility in academic research.
Keywords
entity resolution
bibliometrics
open source
Precision medicine aims to tailor treatments to the unique characteristics of individual patients. In this paper, we develop a classification-based approach to an estimate individualized treatment rule (ITR) by leveraging both structured quantitative data and high-dimensional unstructured textual documents. To tackle the challenge of incorporating text data, we propose an outcome-driven supervised nonnegative matrix factorization method that extracts relevant topics for ITR estimation in a one-step procedure. Our proposed method factorizes vectorized documents into a document-topic matrix and a topic-word matrix, guided by the outcome. For estimation, we constructed a weighted and penalized objective function, solved by a projected gradient approach, to jointly estimate the document representation and the ITR. Our formulation enables the interpretability of the effect of the learned topics on the ITR. We demonstrate the performance of our method through simulation studies and a real-world example from the MIMIC-IV intensive care unit dataset.
Keywords
electronic health record
individualized treatment rule
natural language processing
topic modeling
precision medicine