Wednesday, Aug 6: 2:00 PM - 3:50 PM
0643
Topic-Contributed Paper Session
Music City Center
Room: CC-Davidson Ballroom A3
Applied
Yes
Main Sponsor
Survey Research Methods Section
Co Sponsors
Section on Text Analysis
Social Statistics Section
Presentations
The recent development and wider accessibility of large language models (LLMs) have spurred discussions about how these language models can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether their findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs of the GPT, Llama, and Mistral families, and several prompting approaches, including zero- and few-shot prompting and fine-tuning, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research and their impact on data quality.
Keywords
open-ended questions
labeling
LLM application
Standardized survey interviews are straightforward to administer but fail to address idiosyncratic data quality issues for individuals. Meanwhile, conversational interviews can enable personalized interactions and richer data collection, but cannot easily scale and allow for quantitative comparison. To bridge this divide, we introduce a framework for AI-assisted conversational survey interviewing. Among other things, AI 'textbots' can dynamically probe respondents and live code open-ended responses with real-time respondent validation. To evaluate these capabilities, we conducted an experiment on a conversational AI platform, randomly assigning participants to textbots performing probing and coding on open-ended questions. Our findings show that, even further fine-tuning, textbots perform moderately well in live coding and can improve the specificity, detail, and informativeness of open-ended responses. These gains come with slight negative impacts to user experience as measured by self-reported evaluations and respondent attrition. Our investigation demonstrates the potential for AI-assisted conversational interviewing to enhance data quality for open-ended questions.
Web survey data is key for social and political decision-making, including official statistics. However, respondents are often recruited through online access panels or social media platforms, making it difficult to verify that answers come from humans. As a consequence, bots – programs that autonomously interact with systems – may shift web survey outcomes and social and political decisions. Bot and human answers often differ regarding word choice and lexical structure. This may allow researchers to identify bots by predicting robotic language in open narrative answers. In this study, we therefore investigate the following research question: Can we predict robotic language in open narrative answers? We conducted a web survey on equal gender partnerships, including three open narrative questions. We recruited 1,512 respondents through Facebook ads. We also programmed two LLM-driven bots that each ran through our web survey 200 times: The first bot is linked to the LLM Gemini Pro, and the second bot additionally includes a memory feature and adopts personas (e.g. age and gender). Using a transformer model (BERT) we attempt to predict robotic language in the open narrative answers.
Co-Author
Jan Karem Höhne, German Centre for Higher Education Research and Science Studies (DZHW)
Speaker
Jan Karem Höhne, German Centre for Higher Education Research and Science Studies (DZHW)
Multi-label or check-all-that-apply open-ended questions allow for multiple answers. Previous research on the design of such questions found that providing multiple small answer boxes yields more and richer answers than providing one larger answer box. Using a series of classifiers based on the BERT language model, we empirically study how this design choice affects the classification of such answers. We design a 2x2 factorial experiment: 1) analysis with a multi-label vs. single-label classifier and 2) answers obtained from one larger answer box vs. multiple smaller answer boxes. We find the multi-label classifier gives more accurate results than the single-label classifier (1 % vs. 9 % misclassification of individual labels) regardless of how the answers were obtained. Surprisingly, analysis with a multi-label classifier is preferable. We attribute this success to the classifier's ability to take advantage of label correlations. We conclude that multi-label open-ends should continue to provide multiple answer boxes due to better data quality. However, answer boxes should be concatenated for analysis to improve classification performance.
Keywords
open-ended question
multi-label
large language model
survey methodology
statistical learning
check-all-that-apply
An increasing smartphone usage in web surveys paves the way for new answer collection methods, as smartphones are equipped with numerous sensors. More specifically, smartphones contain built-in microphones that facilitate the collection of voice responses to open questions that resemble the voice input functions of popular instant messaging apps, such as WhatsApp and WeChat. Even though voice responses potentially trigger open narrations resulting in nuanced and in-depth information from respondents, they usually require complex data processing for which no best practices exist yet. In this study, we contribute to an expansion of the data analysis toolkit in mobile web survey research leveraging acoustic features and Large Language Models (LLMs). In doing so, we explore the potential of integrating LLMs with acoustic feature analysis to automate and improve the evaluation of voice response quality. Our work draws on a smartphone survey (N = 501) which includes two open comprehension probing questions with requests for voice answers. Voice responses are categorized into five different quality types (1) uninterpretable responses, (2) probe-response mismatches, (3) soft nonresponses, (4) hard nonresponses, and (5) substantive responses. Our ongoing analytical methods employ three approaches: (1) encoded linguistic information generated by LLMs, (2) acoustic features derived from speech characteristics, and (3) a multi-modal fusion of linguistic and acoustic features. Our study evaluates the effectiveness of each analytical approach individually and assesses the synergistic benefits of combining them for voice response quality classification. Potentially, our findings will have important implications for social science research in general and mobile web survey research in particular, offering scalable and automated tools for response quality analysis and evaluation. Importantly, the methods developed can be applied to other research domains reliant on voice input, including interview analysis, customer feedback evaluation, and conversational AI systems.
Keywords
Data Quality
Machine Learning
Speech Processing