Integrating Large Language Models in Political Discourse Studies on Social Media: Challenges of Validating an LLMs-in-the-loop Pipeline

Authors

  • Giada Marino Department of Communication Sciences, Humanities and International Studies, University of Urbino Carlo Bo https://orcid.org/0000-0002-9087-2608
  • Fabio Giglietto Department of Communication Sciences, Humanities and International Studies, University of Urbino Carlo Bo https://orcid.org/0000-0001-8019-1035

DOI:

https://doi.org/10.6092/issn.1971-8853/19524

Keywords:

Large Language Models (LLMs), Political Discourse, Social Media, Natural Language Processing (NLP)

Abstract

The integration of Large Language Models (LLMs) into research workflows has the potential to transform the study of political content on social media. This essay discusses a validation protocol addressing three key aspects of LLM-integrated research: the versatility of LLMs as general-purpose models, the granularity and nuance in LLM-uncovered narratives, and the limitations of human assessment capabilities. The protocol includes phases for fine-tuning and validating a binary political classifier, evaluating cluster coherence, and assessing machine-generated cluster label accuracy. We applied this protocol to validate an LLMs-in-the-loop research pipeline designed to analyze political content on Facebook during the Italian general elections of 2018 and 2022. Our approach classifies political links, clusters them by similarity, and generates descriptive labels for clusters. This methodology presents unique validation challenges, prompting a reevaluation of accuracy assessment strategies. By sharing our experiences, this essay aims to guide social scientists in employing LLM-based methodologies, highlighting challenges and advancing recommendations for colleagues intending to integrate these tools for political content analysis on social media.

References

Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., & Hassan, A. (2023). Topic Modeling Algorithms and Applications: A Survey. Information Systems, 112, 1–17. https://doi.org/10.1016/j.is.2022.102131

Bail, C.A. (2024). Can Generative AI Improve Social Science? Proceedings of the National Academy of Sciences of the United States of America, 121(21), 1–10. https://doi.org/10.1073/pnas.2314021121

Balloccu, S., Schmidtová, P., Lango, M., & Dušek, O. (2024). Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs. arXiv. https://doi.org/10.48550/arXiv.2402.03927

Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922

Boydstun, A.E. (2013). Making the News: Politics, the Media & Agenda Setting. Chicago, IL: University of Chicago Press.

Bradshaw, S., Elswah, M., Haque, M., & Quelle, D. (2024). Strategic Storytelling: Russian State-Backed Media Coverage of the Ukraine War. International Journal of Public Opinion Research, 36(3), edae028. https://doi.org/10.1093/ijpor/edae028

Chen, L., Zaharia, M., & Zou, J. (2023). How is ChatGPT’s Behavior Changing over Time? arXiv. https://doi.org/10.48550/arXiv.2307.09009

Chiang, C.-H., & Lee, H.-Y. (2023). Can Large Language Models Be an Alternative to Human Evaluations? arXiv. https://doi.org/10.48550/arXiv.2305.01937

Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., & Smith, N. A. (2021). All That’s “Human” Is Not Gold: Evaluating Human Evaluation of Generated Text. arXiv. https://doi.org/10.48550/arXiv.2107.00061

Clinciu, M., Eshghi, A., & Hastie, H. (2021). A Study of Automatic Metrics for the Evaluation of Natural Language Explanations. arXiv. https://doi.org/10.48550/arXiv.2103.08545

Dyjack, N., Baker, D.N., Braverman, V., Langmead, B., & Hicks, S.C. (2023). A Scalable and Unbiased Discordance Metric with H. Biostatistics, 25(1), 188–202. https://doi.org/10.1093/biostatistics/kxac035

Eagleton, T. (1979). Ideology, Fiction, Narrative. Social Text, 2, 62–80. https://doi.org/10.2307/466398

European Digital Media Observatory (EDMO). (2024). Disinformation Narratives during the 2023 Elections in Europe. https://edmo.eu/publications/second-edition-march-2024-disinformation-narratives-during-the-2023-elections-in-europe/

Gagolewski, M. (2021). genieclust: Fast and Robust Hierarchical Clustering. SoftwareX, 15, 100722. https://doi.org/10.1016/j.softx.2021.100722

Genette, G. (1980). Narrative Discourse: An Essay in Method. (J.E. Lewin, Trans.). Ithaca, NY: Cornell University Press. (Original work published 1972)

Giglietto, F. (2024). Evaluating Embedding Models for Clustering Italian Political News: A Comparative Study of Text-Embedding-3-Large and UmBERTo. OSF Preprints. https://doi.org/10.31219/osf.io/2j9ed

Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT Outperforms Crowd Workers for Text-annotation Tasks. Proceedings of the National Academy of Sciences of the United States of America, 120(30), e2305016120. http://doi.org/10.1073/pnas.2305016120

Gillick, D., & Liu, Y. (2010). Non-expert Evaluation of Summarization Systems is Risky. In C. Callison-Burch, & M. Dredze (Eds.), Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (pp. 148–151). Association for Computational Linguistics. https://aclanthology.org/W10-0722

Gillings, M., & Hardie, A. (2022). The Interpretation of Topic Models for Scholarly Analysis: An Evaluation and Critique of Current Practice. Digital Scholarship in the Humanities, 38(2), 530–543. https://doi.org/10.1093/llc/fqac075

Grimmer, J., & King, G. (2011). General Purpose Computer-assisted Clustering and Conceptualization. PNAS, 108(7), 2643–2650. https://doi.org/10.1073/pnas.1018067108

Grossmann, I., Feinberg, M., Parker, D.C., Christakis, N.A., Tetlock, P.E., & Cunningham, W.A. (2023). AI and the Transformation of Social Science Research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778

Groth, S. (2019). Political Narratives / Narrations of the Political: An Introduction. Narrative Culture, 6(1), 1–18. https://doi.org/10.13110/narrcult.6.1.0001

Gupta, S., Bolden, S., & Kachhadia, J. (2020). PoliBERT: Classifying Political Social Media Messages with BERT (Working paper SBP-BRIMS 2020 conference). Social, Cultural. https://news.illuminating.ischool.syr.edu/2020/11/24/polibert-classifying-political-social-media

Guzmán, F., Abdelali, A., Temnikova, I., Sajjad, H., & Vogel, S. (2015). How Do Humans Evaluate Machine Translation. In O. Bojar, R. Chatterjee, C. Federmann, B. Haddow, C. Hokamp, M. Huck, V. Logacheva, & P. Pecina (Eds.), Proceedings of the Tenth Workshop on Statistical Machine Translation (pp. 457–466). Association for Computational Linguistics. https://aclanthology.org/W15-3059/

Huang, F., Kwak, H., & An, J. (2023). Is ChatGPT Better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. In Y. Ding, J. Tang, J. Sequeda, L. Aroyo, C. Castillo, & G.-J. Houben (Eds.), Companion Proceedings of the ACM Web Conference 2023 (pp. 294–297). ACM Digital Library. https://doi.org/10.1145/3543873.3587368

Illuminating. (2020). 2020 Presidential Campaign Facebook and Instagram Ads. https://illuminating.ischool.syr.edu/campaign_2020/

Iskender, N., Polzehl, T., & Möller, S. (2020). Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation. In S. Eger, Y. Gao, M. Peyrard, W. Zhao, & E. Hovy (Eds.), Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems (pp. 164–175). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.eval4nlp-1.16

Jahan, I., Laskar, M.T.R., Peng, C., & Huang, J. (2023). Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. arXiv. https://doi.org/10.48550/arXiv.2306.04504

Karpinska, M., Akoury, N., & Iyyer, M. (2021). The Perils of Using Mechanical Turk to Evaluate Open-ended Text Generation. In M.F. Moens, X. Huang, L. Specia, & S. Wen-tau Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1265–1285). Association for Computational Linguistics. https://10.18653/v1/2021.emnlp-main.97

Kasthuriarachchy, B., Chetty, M., Shatte, A., & Walls, D. (2021). Cost Effective Annotation Framework Using Zero-shot Text Classification. Proceedings of the 2021 International Joint Conference on Neural Networks (pp. 1–8). IEEE. https://doi.org/10.1109/IJCNN52387.2021.9534335

Kotseva, B., Vianini, I., Nikolaidis, N., Faggiani, N., Potapova, K., Gasparro, C., Steiner, Y., Scornavacche, J., Jacquet, G., Dragu, V., Della Rocca, L., Bucci, S., Podavini, A., Verile, M., Macmillan, C., & Linge, J. (2023). Trend Analysis of COVID-19 Mis/Disinformation Narratives: A 3-year Study. PLOS ONE, 18(11), 1–26. https://doi.org/10.1371/journal.pone.0291423

Kuzman, T., Mozetic, I., & Ljubešic, N. (2023). ChatGPT: Beginning of an End of Manual Linguistic Data Annotation. Use Case of Automatic Genre Identification. arXiv. https://doi.org/10.48550/arXiv.2303.03953

Matthes, J., & Kohring, M. (2008). The Content Analysis of Media Frames: Toward Improving Reliability and Validity. The Journal of Communication, 58(2), 258–279. https://doi.org/10.1111/j.1460-2466.2008.00384.x

McCombes, M., Lopez-Escobar, E., & Llamas, J.P. (2006). Setting the Agenda of Attributes in the 1996 Spanish General Election. The Journal of Communication, 50(2), 77–92. https://doi.org/10.1111/j.1460-2466.2000.tb02842.x

Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive Text Embedding Benchmark. arXiv. https://doi.org/10.48550/ARXIV.2210.07316

Mu, Y., Dong, C., Bontcheva, K., & Song, X. (2024). Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling. arXiv. http://arxiv.org/abs/2403.16248

OpenAI. (2024). Prompt Engineering. https://platform.openai.com/docs/guides/prompt-engineering/prompt-engineering

Pangakis, N., Wolken, S., & Fasching, N. (2023). Automated Annotation with Generative AI Requires Validation. arXiv. https://doi.org/10.48550/arXiv.2306.00176

Pianzola, F. (2018). Looking at Narrative as a Complex System: The Proteus Principle. In R. Walsh & S. Stepney (Eds.), Narrating Complexity (pp. 101–122). NY, New York: Springer International Publishing.

Piper, A., So, R.J., & Bamman, D. (2021). Narrative Theory for Computational Narrative Understanding. In Moens, M.-F., Huang, X., Specia, L., & Yih, S. W.-T. (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 298–311). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.26

Popkova, A. (2023). Strategic Narratives of Russiagate on Russian Mainstream and Alternative Television. In O. Boyd-Barrett & S. Marmura (Eds.), Russiagate Revisited: The Aftermath of a Hoax (pp. 203–223). NY, New York: Springer International Publishing.

Rask, M., & Shimizu, K. (2024). Beyond the Average: Exploring the Potential and Challenges of Large Language Models in Social Science Research. Proceedings of the 2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (pp. 1–5). IEEE. https://doi.org/10.1109/ACDSA59508.2024.10467341

Reese, S.D. (2007). The Framing Project: A Bridging Model for Media Research Revisited. The Journal of Communication, 57(1), 148–154. https://doi.org/10.1111/j.1460-2466.2006.00334.x

Scheufele, D.A. (2000). Agenda-Setting, Priming, and Framing Revisited: Another Look at Cognitive Effects of Political Communication. Mass Communication and Society, 3(2–3), 297–316. https://doi.org/10.1207/S15327825MCS0323_07

Schmitt, O. (2018). When Are Strategic Narratives Effective? The Shaping of Political Discourse through the Interaction between Political Myths and Strategic Narratives. Contemporary Security Policy, 39(4), 487–511. https://doi.org/10.1080/13523260.2018.1448925

Schuff, H., Vanderlyn, L., Adel, H., & Vu, N.T. (2023). How to Do Human Evaluation: A Brief Introduction to User Studies in NLP. Natural Language Engineering, 29(5), 1199–1222. https://doi.org/10.1017/S1351324922000535

Silwal, S., Ahmadian, S., Nystrom, A., McCallum, A., Ramachandran, D., & Kazemi, S.M. (2023). KwikBucks: Correlation Clustering with Cheap-weak and Expensive-strong Signals. In N.S. Moosavi, I. Gurevych, Y. Hou, G. Kim, Y.J. Kim, T. Schuster, & A. Agrawal (Eds.), Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (pp. 1–31). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.sustainlp-1.1

Törnberg, P. (2023). ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning. arXiv. https://doi.org/10.48550/arXiv.2304.06588

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2023). Improving Text Embeddings with Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2401.00368

Watters, C., & Lemanski, M.K. (2023). Universal Skepticism of ChatGPT: A Review of Early Literature on Chat Generative Pre-trained Transformer. Frontiers in Big Data, 6. https://doi.org/10.3389/fdata.2023.1224976

Wlezien, C. (2005). On the Salience of Political Issues: The Problem with “Most Important Problem.” Electoral Studies, 24(4), 555–579. https://doi.org/10.1016/j.electstud.2005.01.009

Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. (2023). Benchmarking Large Language Models for News Summarization. Transactions of the Association for Computational Linguistics, 12, 39–57. https://doi.org/10.1162/tacl_a_00632

Downloads

Published

2024-10-30

How to Cite

Marino, G., & Giglietto, F. (2024). Integrating Large Language Models in Political Discourse Studies on Social Media: Challenges of Validating an LLMs-in-the-loop Pipeline. Sociologica, 18(2), 87–107. https://doi.org/10.6092/issn.1971-8853/19524

Issue

Section

Symposium