Best Practices for Text Annotation with Large Language Models
DOI:
https://doi.org/10.6092/issn.1971-8853/19461Keywords:
Text labeling, classification, data annotation, large language models, text-as-dataAbstract
Large Language Models (LLMs) have ushered in a new era of text annotation, as their ease-of-use, high accuracy, and relatively low costs have meant that their use has exploded in recent months. However, the rapid growth of the field has meant that LLM-based annotation has become something of an academic Wild West: the lack of established practices and standards has led to concerns about the quality and validity of research. Researchers have warned that the ostensible simplicity of LLMs can be misleading, as they are prone to bias, misunderstandings, and unreliable results. Recognizing the transformative potential of LLMs, this essay proposes a comprehensive set of standards and best practices for their reliable, reproducible, and ethical use. These guidelines span critical areas such as model selection, prompt engineering, structured prompting, prompt stability analysis, rigorous model validation, and the consideration of ethical and legal implications. The essay emphasizes the need for a structured, directed, and formalized approach to using LLMs, aiming to ensure the integrity and robustness of text annotation practices, and advocates for a nuanced and critical engagement with LLMs in social scientific research.
References
Alizadeh, M., Kubli, M., Samei, Z., Dehghani, S., Zahedivafa, M., Bermeo, J.D., Korobeynikova, M., & Gilardi, F. (2023). Open-Source Large Language Models for Text-Annotation: A Practical Guide for Model Setting and Fine-Tuning. arXiv, 2307.02179. https://doi.org/10.48550/arXiv.2307.02179
Alpaydin, E. (2021). Machine Learning. Cambridge, MA: MIT press.
Barrie, C., Palaiologou, E., & Törnberg, P. (2024). Prompt Stability Scoring for Text Annotation with Large Language Models. arXiv, 2407.02039. https://doi.org/10.48550/arXiv.2407.02039
Bommasani, R., Liang, P., & Lee, T. (2023). Holistic Evaluation of Language Models. Annals of the New York Academy of Sciences, 1525(1), 140–146. https://doi.org/10.1111/nyas.15007
BSA. (2017). Statement of Ethical Practice. British Sociological Association. https://www.britsoc.co.uk/media/24310/bsa_statement_of_ethical_practice.pdf
Byrne, D.S. (2002). Interpreting quantitative data. London: Sage. https://doi.org/10.4135/9781849209311
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P.S., Yang, Q., & Xie, X. (2023b). A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 15(3), 39, 1–45. https://doi.org/10.1145/3641289
Chen, L., Zaharia, M., & Zou, J. (2023). How Is ChatGPT’s Behavior Changing over Time?. arXiv, 2307.09009. https://doi.org/10.48550/arXiv.2307.09009
Chia, Y.K., Hong, P., Bing, L., & Poria, S. (2024). InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models. In A.V. Miceli-Barone, F. Barez, S. Cohen, E. Voita, U. Germann, 6 M. Lukasik (Eds.), Proceedings of the First edition of the Workshop on the Scaling Behavior of Large Language Models (pp. 35–64). Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2306.04757
Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tai, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., … Wei, J. (2024). Scaling Instruction-Finetuned Language Models. Journal of Machine Learning Research, 25(70), 1–53. http://jmlr.org/papers/v25/23-0870.html
Fernandes, P., Madaan, A., Liu, E., Farinhas, A., Martins, P.H., Bertsch, A., de Souza, J.G.C., Zhou, S., Wu, T., Neubig, G., & Martins, A.F.T. (2023). Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation. Transactions of the Association for Computational Linguistics, 11, 1643–1668. https://doi.org/10.1162/tacl_a_00626
Franzke, A.S, Bechmann, A., Zimmer, M., & Ess, C. (2020). Internet Research: Ethical Guidelines 3.0. Association of Internet Researchers. https://aoir.org/reports/ethics3.pdf
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT Outperforms Crowd Workers for Text-annotation Tasks. Proceedings of the National Academy of Sciences of the United States of America, 120(30), e2305016120. https://doi.org/10.1073/pnas.2305016120
Glaser, B.G., & Strauss, A.L. (2009). The Discovery of Grounded Theory: Strategies for Qualitative Research (4th ed.). New Brunswick, NJ: Aldine Transaction. (Original work published 1999)
Grimmer, J., Roberts, M.E., & Stewart, B.M. (2021). Machine Learning for Social Science: An Agnostic Approach. Annual Review of Political Science, 24(1), 395–419. https://doi.org/10.1146/annurev-polisci-053119-015921
Heseltine, M., & Clemm Von Hohenberg, B. (2024). Large Language Models as a Substitute for Human Experts in Annotating Political Text. Research & Politics, 11(1), 20531680241236239. https://doi.org/10.1177/20531680241236239
HuggingFace. (2024). Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Jiang, Z., Xu, F.F., Araki, J., & Neubig, G. (2020). How Can We Know What Language Models Know? Transactions of the Association for Computational Linguistics, 8, 423–438. https://doi.org/10.1162/tacl_a_00324
Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R., & McHardy, R. (2023). Challenges and Applications of Large Language Models. arXiv, 2307.10169. https://doi.org/10.48550/arXiv.2307.10169
Karjus, A. (2023). Machine-Assisted Mixed Methods: Augmenting Humanities and Social Sciences with Artificial Intelligence. arXiv, 2309.14379. https://doi.org/10.48550/arXiv.2309.14379
Kim, H., Mitra, K., Chen, R.L., Rahman, S., & Zhang, D. (2024). MEGAnno+: A Human-LLM Collaborative Annotation System. In N. Aletras, O. De Clercq (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (pp. 168–176). Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2402.18050
Krippendorff, K. (2004). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research, 30(3), 411–433. https://doi.org/10.1093/hcr/30.3.411
Kristensen-McLachlan, R.D., Canavan, M., Kardos, M., Jacobsen, M., & Aarøe, L. (2023). Chatbots Are Not Reliable Text Annotators. arXiv, 2311.05769. https://doi.org/10.48550/arXiv.2311.05769
Kuhn, T.S. (1962). The Structure of Scientific Revolutions. Chicago, IL: University of Chicago Press.
Latour, B., & Woolgar, S. (2013). Laboratory Life: The Construction of Scientific Facts. Princeton, NJ: Princeton University Press. https://doi.org/10.2307/j.ctt32bbxc
Liesenfeld, A., Lopez, A., & Dingemanse, M. (2023). Opening Up ChatGPT: Tracking Openness, Transparency, and Accountability in Instruction-Tuned Text Generators. In M. Lee, C. Munteanu, M. Porcheron, J. Trippas, S.T. Völkel (Eds.), Proceedings of the 5th International Conference on Conversational User Interfaces (pp. 1–6). New York, NY: ACM Press. https://doi.org/10.1145/3571884.3604316
Mudde, C., & Kaltwasser, C.R. (2017). Populism: A Very Short Introduction. Oxford: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780198803560.013.1
Neuendorf, K.A. (2017). The Content Analysis Guidebook. London: Sage. https://doi.org/10.4135/9781071802878
Ollion, É., Shen, R., Macanovic, A., & Chatelain, A. (2024). The Dangers of Using Proprietary LLMs for Research. Nature Machine Intelligence, 6, 4–5. https://doi.org/10.1038/s42256-023-00783-6
OpenAI. (2024). Prompt Engineering. https://platform.openai.com/docs/guides/prompt-engineering/strategy-write-clear-instructions
Pangakis, N., Wolken, S., & Fasching, N. (2023). Automated Annotation with Generative AI Requires Validation. arXiv, 2306.00176. https://doi.org/10.48550/arXiv.2306.00176
Pryzant, R., Iter, D., Li, J., Lee, Y.T., Zhu, C., & Zeng, M. (2023). Automatic Prompt Optimization with “Gradient Descent” and Beam Search. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 7957–7968). Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2305.03495
Rathje, S., Mirea, D.M., Sucholutsky, I., Marjieh, R., Robertson, C.E., & Van Bavel, J.J. (2024). GPT Is an Effective Tool for Multilingual Psychological Text Analysis. Proceedings of the National Academy of Sciences of the United States of America, 121(34), e2308950121. https://doi.org/10.1073/pnas.2308950121
Rogers, R. (2013). Digital Methods. Cambridge, MA: The MIT Press.
Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L., Zaid, A., Antoine, C., Arnaud, S., Arun, R., & Manan, D. (2022). Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv, 2110.08207. https://doi.org/10.48550/arXiv.2110.08207
Saravia, E. (2022). Prompt Engineering Guide. GitHub. https://github.com/dair-ai/Prompt-Engineering-Guide>
Sartor, G., & Lagioia, F. (2020). The Impact of the General Data Protection Regulation (GDPR) on Artificial Intelligence. Study. Panel for the Future of Science and Technology. EPRS, European Parliamentary Research Service. https://doi.org/10.2861/293
Sharma, S. (2019). Data Privacy and GDPR Handbook. Hoboken, NJ: Wiley. https://doi.org/10.1002/9781119594307
Spirling, A. (2023). Why Open-source Generative AI Models Are an Ethical Way Forward for Science. Nature, 616, 413. https://doi.org/10.1038/d41586-023-01295-4
Tan, Z., Li, D., Wang, S., Beigi, A., Jiang, B., Bhattacharjee, A., Karami, M., Li, J., Cheng, L., & Liu, H. (2024). Large Language Models for Data Annotation: A Survey. arXiv, 2402.13446. https://doi.org/10.48550/arXiv.2402.13446
Törnberg, P. (2024a). How to Use Large-Language Models for Text Analysis. London: Sage. https://doi.org/10.4135/9781529683707
Törnberg, P. (2024b). Large Language Models Outperform Expert Coders and Supervised Classifiers at Annotating Political Social Media Messages. Social Science Computer Review, https://doi.org/10.1177/08944393241286471
Törnberg, P., & Uitermark, J. (2021). For a Heterodox Computational Social Science. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211047725
Weber, M., & Reichardt, M. (2023). Evaluation Is All You Need. Prompting Generative Large Language Models for Annotation Tasks in the Social Sciences. A Primer Using Open Models. arXiv, 2401.00284. https://doi.org/10.48550/arXiv.2401.00284
Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., & Le, Q.V. (2022). Finetuned Language Models Are Zero-Shot Learners. arXiv, 2109.01652. https://doi.org/10.48550/arXiv.2109.01652
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2024). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems (pp. 24824–24837). New Orleans, LA: Curran.
White, J., Hays, S., Fu, Q., Spencer-Smith, J., Schmidt, D.C. (2024). ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. In A. Nguyen-Duc, P. Abrahamsson, F. Khomh (Eds.), Generative AI for Effective Software Development. Cham: Springer. https://doi.org/10.1007/978-3-031-55642-5_4
Yan, C.T., Birks, M., & Francis, K. (2019). Grounded Theory Research: A Design Framework for Novice Researchers. Sage Open Medicine, 7, 205031211882292. https://doi.org/10.1177/2050312118822927
Yu, H., Yang, Z., Pelrine, K., Godbout, J.F., & Rabbany, R. (2023). Open, Closed, or Small Language Models for Text Classification?. arXiv, 2308.10092. https://doi.org/10.48550/arXiv.2308.10092
Zhao, Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate Before Use: Improving Few-Shot Performance of Language Models. In M. Meila & T. Zhang (Eds.), Proceedings of the International Conference on Machine Learning (pp. 12697–12706). https://doi.org/10.48550/arXiv.2102.09690
Ziegler, D.M., Stiennon, N., Wu, J., Brown, T.B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). Fine-Tuning Language Models from Human Preferences. arXiv, 1909.08593. https://doi.org/10.48550/arXiv.1909.08593
Zimmer, M. (2020). “But the Data Is Already Public”: On the Ethics of Research in Facebook. In K.W. Miller & M. Taddeo (Eds.), The Ethics of Information Technologies (pp. 229–241). London: Routledge. https://doi.org/10.4324/9781003075011-17
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Petter Törnberg
This work is licensed under a Creative Commons Attribution 4.0 International License.