Text Analysis for Statisticians: Introduction to Advanced Language Modeling

Abstract Number:

1161 

Submission Type:

Professional Development Course/CE  

Participants:

Karl Pazdernik (1), Robin Cosbey (1)

Institutions:

(1) Pacific Northwest National Laboratory, N/A

Co-Instructor:

Robin Cosbey  
Pacific Northwest National Laboratory

Primary Instructor:

Karl Pazdernik  
Pacific Northwest National Laboratory

Description:

This course will provide a broad overview of text analysis and natural language processing (NLP), including a significant amount of introductory material with extensions to state-of-the-art methods. All aspects of the text analysis pipeline will be covered including data preprocessing, converting text to numeric representations (from simple aggregation methods to more complex embeddings), and training supervised and unsupervised learning methods for standard text-based tasks such as named entity recognition (NER), sentiment analysis, topic modeling, and text generation using Large Language Models (LLMs). The course will alternate between presentations and hands-on exercises in Python. Translations from Python to R will be provided for students more comfortable in that language and support will be given for both Mac and Windows users. Attendees should be familiar with Python (preferably), R, or both and have a basic understanding of statistics and/or machine learning. Attendees will gain the practical skills necessary to begin using text analysis tools for their tasks, an understanding of the strengths and weaknesses of these tools, and an appreciation for the ethical considerations of using these tools in practice.

Instructor Background:

Dr. Karl Pazdernik is a senior data scientist at Pacific Northwest National Laboratory, receiving his Ph.D. in Statistics from Iowa State University. He is also a research assistant professor at North Carolina State University (NCSU). His research has focused on the dynamic modeling of multi-modal data with a particular interest in text analytics, spatial statistics, pattern recognition, anomaly detection, Bayesian statistics, and computer vision.

Robin Cosbey is a research data scientist at Pacific Northwest National Laboratory focusing on natural language processing and foundation modeling. She received her M.S. in Computer Science from Western Washington University with a focus on applied deep learning. Her research interests comprise model evaluation and explainability, including the identification of biases and their greater impacts.

This course was previously taught at JSM in 2022 and 2023 with exemplary survey results.

Course Outline:

1. Introduction - Students will be introduced to examples of applied text analysis and the challenges of text data.
2. Ethics and Bias - The variety of ethics concerns in text analysis will be discussed, including gender, racial, financial, environmental, privacy, and tool misuse.
3. Text Data Preprocessing - Students will learn standard text preprocessing methods such as tokenization, stemming, and cleaning capitalization and punctuation.
4. Text Representations - Numeric representations for text will be introduced. Aggregation methods such as bag-of-words, n-grams, and TF-IDF will be covered along with a brief introduction to neural network embedding approaches. Analytic goals such as regression, classification, dimension reduction, and clustering will be introduced in the context of text analysis.
5. Modeling - Students will be exposed to supervised and self-supervised approaches in deep learning, such as NER, sentiment analysis, topic modeling, and text generation.

Learning Outcomes:

1. Introduction - At the end of this section, students will understand what makes text data unique and will be able to identify areas where text analysis can be a powerful tool.
2. Ethics and Bias - At the end of this section, students will have a greater appreciation for the sensitivity of models built through text analysis and ethical concerns regarding their usage. Students will understand the importance of training data and the context surrounding words when making interpretations.
3. Text Data Preprocessing - At the end of this section, students will be able to identify common pitfalls in text preprocessing and will be able to properly clean their text data for downstream processes with well-known software tools.
4. Text Representations - At the end of this section, students will be able to convert their text data into a numeric representation useful for follow-on analytical methods. Students will know the difference between bag-of-words and embedding approaches and will be able to identify the strengths and weaknesses of each approach. Students be able to fit both supervised and unsupervised methods, assessing model performance for each.
5. Modeling - At the end of this section, students will have a deeper understanding of deep learning, transformers, and large language models. At this point, students will be able to piece together all components of text analysis, applying both supervised and unsupervised learning methods. Students will be able to fit deep learning models and assess their performance, all while being mindful of the ethical implications of the model that they create.

Sponsors:

Section on Statistical Learning and Data Science 2
Section on Statistics in Defense and National Security 3
Section on Text Analysis 1

Do you need additional equipment for your course?

Yes

Length of Course (pick 1)

Full Day Course