Integrating Large Language Models in Political Discourse Studies on Social Media: Challenges of Validating an LLMs-in-the-loop Pipeline
DOI:
https://doi.org/10.6092/issn.1971-8853/19524Keywords:
Large Language Models (LLMs), Political Discourse, Social Media, Natural Language Processing (NLP)Abstract
The integration of Large Language Models (LLMs) into research workflows has the potential to transform the study of political content on social media. This essay discusses a validation protocol addressing three key aspects of LLM-integrated research: the versatility of LLMs as general-purpose models, the granularity and nuance in LLM-uncovered narratives, and the limitations of human assessment capabilities. The protocol includes phases for fine-tuning and validating a binary political classifier, evaluating cluster coherence, and assessing machine-generated cluster label accuracy. We applied this protocol to validate an LLMs-in-the-loop research pipeline designed to analyze political content on Facebook during the Italian general elections of 2018 and 2022. Our approach classifies political links, clusters them by similarity, and generates descriptive labels for clusters. This methodology presents unique validation challenges, prompting a reevaluation of accuracy assessment strategies. By sharing our experiences, this essay aims to guide social scientists in employing LLM-based methodologies, highlighting challenges and advancing recommendations for colleagues intending to integrate these tools for political content analysis on social media.
References
Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., & Hassan, A. (2023). Topic Modeling Algorithms and Applications: A Survey. Information Systems, 112, 1–17. https://doi.org/10.1016/j.is.2022.102131
Bail, C.A. (2024). Can Generative AI Improve Social Science? Proceedings of the National Academy of Sciences of the United States of America, 121(21), 1–10. https://doi.org/10.1073/pnas.2314021121
Balloccu, S., Schmidtová, P., Lango, M., & Dušek, O. (2024). Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs. arXiv. https://doi.org/10.48550/arXiv.2402.03927
Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922
Boydstun, A.E. (2013). Making the News: Politics, the Media & Agenda Setting. Chicago, IL: University of Chicago Press.
Bradshaw, S., Elswah, M., Haque, M., & Quelle, D. (2024). Strategic Storytelling: Russian State-Backed Media Coverage of the Ukraine War. International Journal of Public Opinion Research, 36(3), edae028. https://doi.org/10.1093/ijpor/edae028
Chen, L., Zaharia, M., & Zou, J. (2023). How is ChatGPT’s Behavior Changing over Time? arXiv. https://doi.org/10.48550/arXiv.2307.09009
Chiang, C.-H., & Lee, H.-Y. (2023). Can Large Language Models Be an Alternative to Human Evaluations? arXiv. https://doi.org/10.48550/arXiv.2305.01937
Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., & Smith, N. A. (2021). All That’s “Human” Is Not Gold: Evaluating Human Evaluation of Generated Text. arXiv. https://doi.org/10.48550/arXiv.2107.00061
Clinciu, M., Eshghi, A., & Hastie, H. (2021). A Study of Automatic Metrics for the Evaluation of Natural Language Explanations. arXiv. https://doi.org/10.48550/arXiv.2103.08545
Dyjack, N., Baker, D.N., Braverman, V., Langmead, B., & Hicks, S.C. (2023). A Scalable and Unbiased Discordance Metric with H. Biostatistics, 25(1), 188–202. https://doi.org/10.1093/biostatistics/kxac035
Eagleton, T. (1979). Ideology, Fiction, Narrative. Social Text, 2, 62–80. https://doi.org/10.2307/466398
European Digital Media Observatory (EDMO). (2024). Disinformation Narratives during the 2023 Elections in Europe. https://edmo.eu/publications/second-edition-march-2024-disinformation-narratives-during-the-2023-elections-in-europe/
Gagolewski, M. (2021). genieclust: Fast and Robust Hierarchical Clustering. SoftwareX, 15, 100722. https://doi.org/10.1016/j.softx.2021.100722
Genette, G. (1980). Narrative Discourse: An Essay in Method. (J.E. Lewin, Trans.). Ithaca, NY: Cornell University Press. (Original work published 1972)
Giglietto, F. (2024). Evaluating Embedding Models for Clustering Italian Political News: A Comparative Study of Text-Embedding-3-Large and UmBERTo. OSF Preprints. https://doi.org/10.31219/osf.io/2j9ed
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT Outperforms Crowd Workers for Text-annotation Tasks. Proceedings of the National Academy of Sciences of the United States of America, 120(30), e2305016120. http://doi.org/10.1073/pnas.2305016120
Gillick, D., & Liu, Y. (2010). Non-expert Evaluation of Summarization Systems is Risky. In C. Callison-Burch, & M. Dredze (Eds.), Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (pp. 148–151). Association for Computational Linguistics. https://aclanthology.org/W10-0722
Gillings, M., & Hardie, A. (2022). The Interpretation of Topic Models for Scholarly Analysis: An Evaluation and Critique of Current Practice. Digital Scholarship in the Humanities, 38(2), 530–543. https://doi.org/10.1093/llc/fqac075
Grimmer, J., & King, G. (2011). General Purpose Computer-assisted Clustering and Conceptualization. PNAS, 108(7), 2643–2650. https://doi.org/10.1073/pnas.1018067108
Grossmann, I., Feinberg, M., Parker, D.C., Christakis, N.A., Tetlock, P.E., & Cunningham, W.A. (2023). AI and the Transformation of Social Science Research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778
Groth, S. (2019). Political Narratives / Narrations of the Political: An Introduction. Narrative Culture, 6(1), 1–18. https://doi.org/10.13110/narrcult.6.1.0001
Gupta, S., Bolden, S., & Kachhadia, J. (2020). PoliBERT: Classifying Political Social Media Messages with BERT (Working paper SBP-BRIMS 2020 conference). Social, Cultural. https://news.illuminating.ischool.syr.edu/2020/11/24/polibert-classifying-political-social-media
Guzmán, F., Abdelali, A., Temnikova, I., Sajjad, H., & Vogel, S. (2015). How Do Humans Evaluate Machine Translation. In O. Bojar, R. Chatterjee, C. Federmann, B. Haddow, C. Hokamp, M. Huck, V. Logacheva, & P. Pecina (Eds.), Proceedings of the Tenth Workshop on Statistical Machine Translation (pp. 457–466). Association for Computational Linguistics. https://aclanthology.org/W15-3059/
Huang, F., Kwak, H., & An, J. (2023). Is ChatGPT Better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. In Y. Ding, J. Tang, J. Sequeda, L. Aroyo, C. Castillo, & G.-J. Houben (Eds.), Companion Proceedings of the ACM Web Conference 2023 (pp. 294–297). ACM Digital Library. https://doi.org/10.1145/3543873.3587368
Illuminating. (2020). 2020 Presidential Campaign Facebook and Instagram Ads. https://illuminating.ischool.syr.edu/campaign_2020/
Iskender, N., Polzehl, T., & Möller, S. (2020). Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation. In S. Eger, Y. Gao, M. Peyrard, W. Zhao, & E. Hovy (Eds.), Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems (pp. 164–175). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.eval4nlp-1.16
Jahan, I., Laskar, M.T.R., Peng, C., & Huang, J. (2023). Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. arXiv. https://doi.org/10.48550/arXiv.2306.04504
Karpinska, M., Akoury, N., & Iyyer, M. (2021). The Perils of Using Mechanical Turk to Evaluate Open-ended Text Generation. In M.F. Moens, X. Huang, L. Specia, & S. Wen-tau Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1265–1285). Association for Computational Linguistics. https://10.18653/v1/2021.emnlp-main.97
Kasthuriarachchy, B., Chetty, M., Shatte, A., & Walls, D. (2021). Cost Effective Annotation Framework Using Zero-shot Text Classification. Proceedings of the 2021 International Joint Conference on Neural Networks (pp. 1–8). IEEE. https://doi.org/10.1109/IJCNN52387.2021.9534335
Kotseva, B., Vianini, I., Nikolaidis, N., Faggiani, N., Potapova, K., Gasparro, C., Steiner, Y., Scornavacche, J., Jacquet, G., Dragu, V., Della Rocca, L., Bucci, S., Podavini, A., Verile, M., Macmillan, C., & Linge, J. (2023). Trend Analysis of COVID-19 Mis/Disinformation Narratives: A 3-year Study. PLOS ONE, 18(11), 1–26. https://doi.org/10.1371/journal.pone.0291423
Kuzman, T., Mozetic, I., & Ljubešic, N. (2023). ChatGPT: Beginning of an End of Manual Linguistic Data Annotation. Use Case of Automatic Genre Identification. arXiv. https://doi.org/10.48550/arXiv.2303.03953
Matthes, J., & Kohring, M. (2008). The Content Analysis of Media Frames: Toward Improving Reliability and Validity. The Journal of Communication, 58(2), 258–279. https://doi.org/10.1111/j.1460-2466.2008.00384.x
McCombes, M., Lopez-Escobar, E., & Llamas, J.P. (2006). Setting the Agenda of Attributes in the 1996 Spanish General Election. The Journal of Communication, 50(2), 77–92. https://doi.org/10.1111/j.1460-2466.2000.tb02842.x
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive Text Embedding Benchmark. arXiv. https://doi.org/10.48550/ARXIV.2210.07316
Mu, Y., Dong, C., Bontcheva, K., & Song, X. (2024). Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling. arXiv. http://arxiv.org/abs/2403.16248
OpenAI. (2024). Prompt Engineering. https://platform.openai.com/docs/guides/prompt-engineering/prompt-engineering
Pangakis, N., Wolken, S., & Fasching, N. (2023). Automated Annotation with Generative AI Requires Validation. arXiv. https://doi.org/10.48550/arXiv.2306.00176
Pianzola, F. (2018). Looking at Narrative as a Complex System: The Proteus Principle. In R. Walsh & S. Stepney (Eds.), Narrating Complexity (pp. 101–122). NY, New York: Springer International Publishing.
Piper, A., So, R.J., & Bamman, D. (2021). Narrative Theory for Computational Narrative Understanding. In Moens, M.-F., Huang, X., Specia, L., & Yih, S. W.-T. (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 298–311). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.26
Popkova, A. (2023). Strategic Narratives of Russiagate on Russian Mainstream and Alternative Television. In O. Boyd-Barrett & S. Marmura (Eds.), Russiagate Revisited: The Aftermath of a Hoax (pp. 203–223). NY, New York: Springer International Publishing.
Rask, M., & Shimizu, K. (2024). Beyond the Average: Exploring the Potential and Challenges of Large Language Models in Social Science Research. Proceedings of the 2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (pp. 1–5). IEEE. https://doi.org/10.1109/ACDSA59508.2024.10467341
Reese, S.D. (2007). The Framing Project: A Bridging Model for Media Research Revisited. The Journal of Communication, 57(1), 148–154. https://doi.org/10.1111/j.1460-2466.2006.00334.x
Scheufele, D.A. (2000). Agenda-Setting, Priming, and Framing Revisited: Another Look at Cognitive Effects of Political Communication. Mass Communication and Society, 3(2–3), 297–316. https://doi.org/10.1207/S15327825MCS0323_07
Schmitt, O. (2018). When Are Strategic Narratives Effective? The Shaping of Political Discourse through the Interaction between Political Myths and Strategic Narratives. Contemporary Security Policy, 39(4), 487–511. https://doi.org/10.1080/13523260.2018.1448925
Schuff, H., Vanderlyn, L., Adel, H., & Vu, N.T. (2023). How to Do Human Evaluation: A Brief Introduction to User Studies in NLP. Natural Language Engineering, 29(5), 1199–1222. https://doi.org/10.1017/S1351324922000535
Silwal, S., Ahmadian, S., Nystrom, A., McCallum, A., Ramachandran, D., & Kazemi, S.M. (2023). KwikBucks: Correlation Clustering with Cheap-weak and Expensive-strong Signals. In N.S. Moosavi, I. Gurevych, Y. Hou, G. Kim, Y.J. Kim, T. Schuster, & A. Agrawal (Eds.), Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (pp. 1–31). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.sustainlp-1.1
Törnberg, P. (2023). ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning. arXiv. https://doi.org/10.48550/arXiv.2304.06588
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2023). Improving Text Embeddings with Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2401.00368
Watters, C., & Lemanski, M.K. (2023). Universal Skepticism of ChatGPT: A Review of Early Literature on Chat Generative Pre-trained Transformer. Frontiers in Big Data, 6. https://doi.org/10.3389/fdata.2023.1224976
Wlezien, C. (2005). On the Salience of Political Issues: The Problem with “Most Important Problem.” Electoral Studies, 24(4), 555–579. https://doi.org/10.1016/j.electstud.2005.01.009
Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. (2023). Benchmarking Large Language Models for News Summarization. Transactions of the Association for Computational Linguistics, 12, 39–57. https://doi.org/10.1162/tacl_a_00632
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Giada Marino, Fabio Giglietto
This work is licensed under a Creative Commons Attribution 4.0 International License.