Text Cluster Profiling using Generative Language Models and Vector Search

Peter Baumgartner Co-Author
RTI International
 
Anthony Berghammer Co-Author
RTI International
 
Alexander Preiss First Author
RTI International
 
Alexander Preiss Presenting Author
RTI International
 
Tuesday, Aug 6: 10:05 AM - 10:10 AM
2563 
Contributed Speed 
Oregon Convention Center 
Text clustering is a common tool used to identify natural groupings in a set of documents. But once you have the clusters, how do you know what they represent? The answer is often manual review by subject matter experts, which introduces a bottleneck. In prior work, we showed how generative language models can be used to name and describe text clusters. Here, we add a vector search step, which assesses the quality of both the cluster and the cluster's description. First, we use a generative language model to generate a brief description of each cluster. Next, we query a vector database of document embeddings to identify the documents most similar to each cluster description. Finally, we calculate F1 for query results, relative to the documents in each cluster. As a proof of concept, we fit five HDBSCAN models to the 20 Newsgroups dataset: one with the correct number of clusters (20), and others with 5, 10, 40, and 80 clusters. We ran this pipeline for each clustering model, as well as for the true 20 Newsgroups classes. Results show how our approach can be used to profile clusters, compare models, and what expected values should be relative to a ground truth.

Keywords

Machine Learning

Generative AI

Large Language Models

Vector Databases

Natural Language Processing 

Main Sponsor

Section on Statistical Learning and Data Science