Text Cluster Profiling using Generative Language Models and Vector Search

Abstract Number:

2563 

Submission Type:

Contributed Abstract 

Contributed Abstract Type:

Speed 

Participants:

Alexander Preiss (1), Peter Baumgartner (1), Anthony Berghammer (1)

Institutions:

(1) RTI International, N/A

Co-Author(s):

Peter Baumgartner  
RTI International
Anthony Berghammer  
RTI International

First Author:

Alexander Preiss  
RTI International

Presenting Author:

Alexander Preiss  
RTI International

Abstract Text:

Text clustering is a common tool used to identify natural groupings in a set of documents. But once you have the clusters, how do you know what they represent? The answer is often manual review by subject matter experts, which introduces a bottleneck. In prior work, we showed how generative language models can be used to name and describe text clusters. Here, we add a vector search step, which assesses the quality of both the cluster and the cluster's description. First, we use a generative language model to generate a brief description of each cluster. Next, we query a vector database of document embeddings to identify the documents most similar to each cluster description. Finally, we calculate F1 for query results, relative to the documents in each cluster. As a proof of concept, we fit five HDBSCAN models to the 20 Newsgroups dataset: one with the correct number of clusters (20), and others with 5, 10, 40, and 80 clusters. We ran this pipeline for each clustering model, as well as for the true 20 Newsgroups classes. Results show how our approach can be used to profile clusters, compare models, and what expected values should be relative to a ground truth.

Keywords:

Machine Learning|Generative AI|Large Language Models|Vector Databases|Natural Language Processing|

Sponsors:

Section on Statistical Learning and Data Science

Tracks:

Machine Learning

Can this be considered for alternate subtype?

Yes

Are you interested in volunteering to serve as a session chair?

Yes

I have read and understand that JSM participants must abide by the Participant Guidelines.

Yes

I understand that JSM participants must register and pay the appropriate registration fee by June 1, 2024. The registration fee is non-refundable.

I understand