Print Close

Text Cluster Profiling using Generative Language Models and Vector Search

Presented During: SPEED 5: Machine Learning, Visualization, and Nonparametric Statistical Approaches, Part 1

Peter Baumgartner Co-Author
RTI International

Anthony Berghammer Co-Author
RTI International

Alexander Preiss First Author
RTI International

Alexander Preiss Presenting Author
RTI International

Tuesday, Aug 6: 10:05 AM - 10:10 AM
2563
Contributed Speed

Oregon Convention Center

Text clustering is a common tool used to identify natural groupings in a set of documents. But once you have the clusters, how do you know what they represent? The answer is often manual review by subject matter experts, which introduces a bottleneck. In prior work, we showed how generative language models can be used to name and describe text clusters. Here, we add a vector search step, which assesses the quality of both the cluster and the cluster's description. First, we use a generative language model to generate a brief description of each cluster. Next, we query a vector database of document embeddings to identify the documents most similar to each cluster description. Finally, we calculate F1 for query results, relative to the documents in each cluster. As a proof of concept, we fit five HDBSCAN models to the 20 Newsgroups dataset: one with the correct number of clusters (20), and others with 5, 10, 40, and 80 clusters. We ran this pipeline for each clustering model, as well as for the true 20 Newsgroups classes. Results show how our approach can be used to profile clusters, compare models, and what expected values should be relative to a ground truth.

Keywords

Machine Learning

Generative AI

Large Language Models

Vector Databases

Natural Language Processing

View Abstract 2563

Main Sponsor

Section on Statistical Learning and Data Science