Text Cluster Profiling using Generative Language Models and Vector Search
Abstract Number:
2563
Submission Type:
Contributed Abstract
Contributed Abstract Type:
Speed
Participants:
Alexander Preiss (1), Peter Baumgartner (1), Anthony Berghammer (1)
Institutions:
(1) RTI International, N/A
Co-Author(s):
First Author:
Presenting Author:
Abstract Text:
Text clustering is a common tool used to identify natural groupings in a set of documents. But once you have the clusters, how do you know what they represent? The answer is often manual review by subject matter experts, which introduces a bottleneck. In prior work, we showed how generative language models can be used to name and describe text clusters. Here, we add a vector search step, which assesses the quality of both the cluster and the cluster's description. First, we use a generative language model to generate a brief description of each cluster. Next, we query a vector database of document embeddings to identify the documents most similar to each cluster description. Finally, we calculate F1 for query results, relative to the documents in each cluster. As a proof of concept, we fit five HDBSCAN models to the 20 Newsgroups dataset: one with the correct number of clusters (20), and others with 5, 10, 40, and 80 clusters. We ran this pipeline for each clustering model, as well as for the true 20 Newsgroups classes. Results show how our approach can be used to profile clusters, compare models, and what expected values should be relative to a ground truth.
Keywords:
Machine Learning|Generative AI|Large Language Models|Vector Databases|Natural Language Processing|
Sponsors:
Section on Statistical Learning and Data Science
Tracks:
Machine Learning
Can this be considered for alternate subtype?
Yes
Are you interested in volunteering to serve as a session chair?
Yes
I have read and understand that JSM participants must abide by the Participant Guidelines.
Yes
I understand that JSM participants must register and pay the appropriate registration fee by June 1, 2024. The registration fee is non-refundable.
I understand
You have unsaved changes.