12. Impact of Text Preprocessing Techniques on Fake News Detection

Conference: Women in Statistics and Data Science 2024
10/17/2024: 11:45 AM - 1:15 PM EDT
Speed 

Description

The journey from raw text to actionable insights in the realm of natural language processing involves several critical preprocessing stages. These stages prepare the textual data for further analysis by implementing strategies such as eliminating infrequently occurring words, removing stopwords, removing numerical entities, and standardizing text to lowercase. Following these initial steps, the processed text undergoes word embedding, utilizing advanced algorithms like Word2Vec and BERT. This study delves into how various text preprocessing and word embedding techniques influence the effectiveness of fake news detection systems. Specifically, it examines the roles that the choice of classification, embedding, and preprocessing techniques play in optimizing key metrics such as accuracy, precision, sensitivity, and specificity in the context of fake news identification. Our findings highlight that the strategic inclusion of stopwords, particularly in conjunction with BERT embeddings, enhances the performance of fake news detection models, alongside the careful selection of threshold criteria for word frequency.

Presenting Author

Jessica Hauschild, United States Air Force Academy

First Author

Jessica Hauschild, United States Air Force Academy

CoAuthor

Kent Eskridge, University of Nebraska, Statistics Department

Target Audience

Mid-Level

Tracks

Knowledge
Women in Statistics and Data Science 2024