Best Practices for Text Annotation with Large Language Models

Authors

  • Petter Törnberg Institute for Language, Logic and Computation, University of Amsterdam https://orcid.org/0000-0001-8722-8646

DOI:

https://doi.org/10.6092/issn.1971-8853/19461

Keywords:

Text labeling, classification, data annotation, large language models, text-as-data

Abstract

Large Language Models (LLMs) have ushered in a new era of text annotation, as their ease-of-use, high accuracy, and relatively low costs have meant that their use has exploded in recent months. However, the rapid growth of the field has meant that LLM-based annotation has become something of an academic Wild West: the lack of established practices and standards has led to concerns about the quality and validity of research. Researchers have warned that the ostensible simplicity of LLMs can be misleading, as they are prone to bias, misunderstandings, and unreliable results. Recognizing the transformative potential of LLMs, this essay proposes a comprehensive set of standards and best practices for their reliable, reproducible, and ethical use. These guidelines span critical areas such as model selection, prompt engineering, structured prompting, prompt stability analysis, rigorous model validation, and the consideration of ethical and legal implications. The essay emphasizes the need for a structured, directed, and formalized approach to using LLMs, aiming to ensure the integrity and robustness of text annotation practices, and advocates for a nuanced and critical engagement with LLMs in social scientific research.

References

Alizadeh, M., Kubli, M., Samei, Z., Dehghani, S., Zahedivafa, M., Bermeo, J.D., Korobeynikova, M., & Gilardi, F. (2023). Open-Source Large Language Models for Text-Annotation: A Practical Guide for Model Setting and Fine-Tuning. arXiv, 2307.02179. https://doi.org/10.48550/arXiv.2307.02179

Alpaydin, E. (2021). Machine Learning. Cambridge, MA: MIT press.

Barrie, C., Palaiologou, E., & Törnberg, P. (2024). Prompt Stability Scoring for Text Annotation with Large Language Models. arXiv, 2407.02039. https://doi.org/10.48550/arXiv.2407.02039

Bommasani, R., Liang, P., & Lee, T. (2023). Holistic Evaluation of Language Models. Annals of the New York Academy of Sciences, 1525(1), 140–146. https://doi.org/10.1111/nyas.15007

BSA. (2017). Statement of Ethical Practice. British Sociological Association. https://www.britsoc.co.uk/media/24310/bsa_statement_of_ethical_practice.pdf

Byrne, D.S. (2002). Interpreting quantitative data. London: Sage. https://doi.org/10.4135/9781849209311

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P.S., Yang, Q., & Xie, X. (2023b). A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 15(3), 39, 1–45. https://doi.org/10.1145/3641289

Chen, L., Zaharia, M., & Zou, J. (2023). How Is ChatGPT’s Behavior Changing over Time?. arXiv, 2307.09009. https://doi.org/10.48550/arXiv.2307.09009

Chia, Y.K., Hong, P., Bing, L., & Poria, S. (2024). InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models. In A.V. Miceli-Barone, F. Barez, S. Cohen, E. Voita, U. Germann, 6 M. Lukasik (Eds.), Proceedings of the First edition of the Workshop on the Scaling Behavior of Large Language Models (pp. 35–64). Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2306.04757

Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tai, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., … Wei, J. (2024). Scaling Instruction-Finetuned Language Models. Journal of Machine Learning Research, 25(70), 1–53. http://jmlr.org/papers/v25/23-0870.html

Fernandes, P., Madaan, A., Liu, E., Farinhas, A., Martins, P.H., Bertsch, A., de Souza, J.G.C., Zhou, S., Wu, T., Neubig, G., & Martins, A.F.T. (2023). Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation. Transactions of the Association for Computational Linguistics, 11, 1643–1668. https://doi.org/10.1162/tacl_a_00626

Franzke, A.S, Bechmann, A., Zimmer, M., & Ess, C. (2020). Internet Research: Ethical Guidelines 3.0. Association of Internet Researchers. https://aoir.org/reports/ethics3.pdf

Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT Outperforms Crowd Workers for Text-annotation Tasks. Proceedings of the National Academy of Sciences of the United States of America, 120(30), e2305016120. https://doi.org/10.1073/pnas.2305016120

Glaser, B.G., & Strauss, A.L. (2009). The Discovery of Grounded Theory: Strategies for Qualitative Research (4th ed.). New Brunswick, NJ: Aldine Transaction. (Original work published 1999)

Grimmer, J., Roberts, M.E., & Stewart, B.M. (2021). Machine Learning for Social Science: An Agnostic Approach. Annual Review of Political Science, 24(1), 395–419. https://doi.org/10.1146/annurev-polisci-053119-015921

Heseltine, M., & Clemm Von Hohenberg, B. (2024). Large Language Models as a Substitute for Human Experts in Annotating Political Text. Research & Politics, 11(1), 20531680241236239. https://doi.org/10.1177/20531680241236239

HuggingFace. (2024). Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Jiang, Z., Xu, F.F., Araki, J., & Neubig, G. (2020). How Can We Know What Language Models Know? Transactions of the Association for Computational Linguistics, 8, 423–438. https://doi.org/10.1162/tacl_a_00324

Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R., & McHardy, R. (2023). Challenges and Applications of Large Language Models. arXiv, 2307.10169. https://doi.org/10.48550/arXiv.2307.10169

Karjus, A. (2023). Machine-Assisted Mixed Methods: Augmenting Humanities and Social Sciences with Artificial Intelligence. arXiv, 2309.14379. https://doi.org/10.48550/arXiv.2309.14379

Kim, H., Mitra, K., Chen, R.L., Rahman, S., & Zhang, D. (2024). MEGAnno+: A Human-LLM Collaborative Annotation System. In N. Aletras, O. De Clercq (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (pp. 168–176). Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2402.18050

Krippendorff, K. (2004). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research, 30(3), 411–433. https://doi.org/10.1093/hcr/30.3.411

Kristensen-McLachlan, R.D., Canavan, M., Kardos, M., Jacobsen, M., & Aarøe, L. (2023). Chatbots Are Not Reliable Text Annotators. arXiv, 2311.05769. https://doi.org/10.48550/arXiv.2311.05769

Kuhn, T.S. (1962). The Structure of Scientific Revolutions. Chicago, IL: University of Chicago Press.

Latour, B., & Woolgar, S. (2013). Laboratory Life: The Construction of Scientific Facts. Princeton, NJ: Princeton University Press. https://doi.org/10.2307/j.ctt32bbxc

Liesenfeld, A., Lopez, A., & Dingemanse, M. (2023). Opening Up ChatGPT: Tracking Openness, Transparency, and Accountability in Instruction-Tuned Text Generators. In M. Lee, C. Munteanu, M. Porcheron, J. Trippas, S.T. Völkel (Eds.), Proceedings of the 5th International Conference on Conversational User Interfaces (pp. 1–6). New York, NY: ACM Press. https://doi.org/10.1145/3571884.3604316

Mudde, C., & Kaltwasser, C.R. (2017). Populism: A Very Short Introduction. Oxford: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780198803560.013.1

Neuendorf, K.A. (2017). The Content Analysis Guidebook. London: Sage. https://doi.org/10.4135/9781071802878

Ollion, É., Shen, R., Macanovic, A., & Chatelain, A. (2024). The Dangers of Using Proprietary LLMs for Research. Nature Machine Intelligence, 6, 4–5. https://doi.org/10.1038/s42256-023-00783-6

OpenAI. (2024). Prompt Engineering. https://platform.openai.com/docs/guides/prompt-engineering/strategy-write-clear-instructions

Pangakis, N., Wolken, S., & Fasching, N. (2023). Automated Annotation with Generative AI Requires Validation. arXiv, 2306.00176. https://doi.org/10.48550/arXiv.2306.00176

Pryzant, R., Iter, D., Li, J., Lee, Y.T., Zhu, C., & Zeng, M. (2023). Automatic Prompt Optimization with “Gradient Descent” and Beam Search. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 7957–7968). Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2305.03495

Rathje, S., Mirea, D.M., Sucholutsky, I., Marjieh, R., Robertson, C.E., & Van Bavel, J.J. (2024). GPT Is an Effective Tool for Multilingual Psychological Text Analysis. Proceedings of the National Academy of Sciences of the United States of America, 121(34), e2308950121. https://doi.org/10.1073/pnas.2308950121

Rogers, R. (2013). Digital Methods. Cambridge, MA: The MIT Press.

Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L., Zaid, A., Antoine, C., Arnaud, S., Arun, R., & Manan, D. (2022). Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv, 2110.08207. https://doi.org/10.48550/arXiv.2110.08207

Saravia, E. (2022). Prompt Engineering Guide. GitHub. https://github.com/dair-ai/Prompt-Engineering-Guide>

Sartor, G., & Lagioia, F. (2020). The Impact of the General Data Protection Regulation (GDPR) on Artificial Intelligence. Study. Panel for the Future of Science and Technology. EPRS, European Parliamentary Research Service. https://doi.org/10.2861/293

Sharma, S. (2019). Data Privacy and GDPR Handbook. Hoboken, NJ: Wiley. https://doi.org/10.1002/9781119594307

Spirling, A. (2023). Why Open-source Generative AI Models Are an Ethical Way Forward for Science. Nature, 616, 413. https://doi.org/10.1038/d41586-023-01295-4

Tan, Z., Li, D., Wang, S., Beigi, A., Jiang, B., Bhattacharjee, A., Karami, M., Li, J., Cheng, L., & Liu, H. (2024). Large Language Models for Data Annotation: A Survey. arXiv, 2402.13446. https://doi.org/10.48550/arXiv.2402.13446

Törnberg, P. (2024a). How to Use Large-Language Models for Text Analysis. London: Sage. https://doi.org/10.4135/9781529683707

Törnberg, P. (2024b). Large Language Models Outperform Expert Coders and Supervised Classifiers at Annotating Political Social Media Messages. Social Science Computer Review, https://doi.org/10.1177/08944393241286471

Törnberg, P., & Uitermark, J. (2021). For a Heterodox Computational Social Science. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211047725

Weber, M., & Reichardt, M. (2023). Evaluation Is All You Need. Prompting Generative Large Language Models for Annotation Tasks in the Social Sciences. A Primer Using Open Models. arXiv, 2401.00284. https://doi.org/10.48550/arXiv.2401.00284

Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., & Le, Q.V. (2022). Finetuned Language Models Are Zero-Shot Learners. arXiv, 2109.01652. https://doi.org/10.48550/arXiv.2109.01652

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2024). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems (pp. 24824–24837). New Orleans, LA: Curran.

White, J., Hays, S., Fu, Q., Spencer-Smith, J., Schmidt, D.C. (2024). ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. In A. Nguyen-Duc, P. Abrahamsson, F. Khomh (Eds.), Generative AI for Effective Software Development. Cham: Springer. https://doi.org/10.1007/978-3-031-55642-5_4

Yan, C.T., Birks, M., & Francis, K. (2019). Grounded Theory Research: A Design Framework for Novice Researchers. Sage Open Medicine, 7, 205031211882292. https://doi.org/10.1177/2050312118822927

Yu, H., Yang, Z., Pelrine, K., Godbout, J.F., & Rabbany, R. (2023). Open, Closed, or Small Language Models for Text Classification?. arXiv, 2308.10092. https://doi.org/10.48550/arXiv.2308.10092

Zhao, Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate Before Use: Improving Few-Shot Performance of Language Models. In M. Meila & T. Zhang (Eds.), Proceedings of the International Conference on Machine Learning (pp. 12697–12706). https://doi.org/10.48550/arXiv.2102.09690

Ziegler, D.M., Stiennon, N., Wu, J., Brown, T.B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). Fine-Tuning Language Models from Human Preferences. arXiv, 1909.08593. https://doi.org/10.48550/arXiv.1909.08593

Zimmer, M. (2020). “But the Data Is Already Public”: On the Ethics of Research in Facebook. In K.W. Miller & M. Taddeo (Eds.), The Ethics of Information Technologies (pp. 229–241). London: Routledge. https://doi.org/10.4324/9781003075011-17

Downloads

Published

2024-10-30

How to Cite

Törnberg, P. (2024). Best Practices for Text Annotation with Large Language Models. Sociologica, 18(2), 67–85. https://doi.org/10.6092/issn.1971-8853/19461

Issue

Section

Symposium