Measuring LLM Self-consistency: Unknown Unknowns in Knowing Machines
DOI:
https://doi.org/10.6092/issn.1971-8853/19488Keywords:
Large language models, robustness analysis, prompt engineering, critical technical practice, knowledge analysisAbstract
This essay critically examines some limitations and misconceptions of Large Language Models (LLMs) in relation to knowledge and self-knowledge, particularly in the context of social sciences and humanities (SSH) research. Using an experimental approach, we evaluate the self-consistency of LLM responses by introducing variations in prompts during knowledge retrieval tasks. Our results indicate that self-consistency tends to align with correct responses, yet errors persist, questioning the reliability of LLMs as “knowing” agents. Drawing on epistemological frameworks, we argue that LLMs exhibit the capacity to know only when random factors, or epistemic luck, can be excluded, yet they lack self-awareness of their inconsistencies. Whereas human ignorance often involves many “known unknowns”, LLMs exhibit a form of ignorance manifested through inconsistency, where the ignorance remains a complete “unknown unknown”. LLMs always “assume” they “know”. We repurpose these insights into a pedagogical experiment, encouraging SSH scholars and students to critically engage with LLMs in educational settings. We propose a hands-on approach based on critical technical practice, aiming to balance the practical utility with an informed understanding of their limitations. This approach equips researchers with the skills to use LLMs effectively while promoting a deeper understanding of their operational principles and epistemic constraints.
References
Agre, P.E. (1997). Toward a Critical Technical Practice: Lessons Learned in Trying to Reform AI. In G.C. Bowker, S.L. Star, L. Gasser, W. Turner (Eds.), Social Science, Technical Systems, and Cooperative Work: Beyond the Great Divide. Computers, Cognition, and Work (pp. 131–157). Mahwah, NJ: Erlbaum.
Bachimont, B. (2004). Arts et sciences du numérique: ingénierie des connaissances et critique de la raison computationnelle. Compiègne: Mémoire de HDR.
Barman, D., Guo, Z., & Conlan, O. (2024). The Dark Side of Language Models: Exploring the Potential of LLMs in Multimedia Disinformation Generation and Dissemination. Machine Learning with Applications, 16, 100545. https://doi.org/10.1016/j.mlwa.2024.100545
Barrie, C., Palaiologou, E., & Törnberg, P. (2024). Prompt Stability Scoring for Text Annotation with Large Language Models. arXiv, 2407.02039. https://doi.org/10.48550/arXiv.2407.02039
Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and ransparency (pp. 610–623). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922
Borra, E. (2023). ErikBorra/PromptCompass: V0.4 (v0.4) [software]. Zenodo. https://doi.org/10.5281/zenodo.10252681
Cambridge. (2023). The Cambridge Dictionary Word of the Year 2023. Archive.Is, 20 November. https://archive.is/9Z0gO
Cangelosi, O. (2024). Can AI Know?. Philosophy & Technology, 37(3), 81. https://doi.org/10.1007/s13347-024-00776-2
Chang, K.K., Cramer, M., Soni, S., & Bamman, D. (2023a). Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. In H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 7312–7327). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.453
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P.S., Yang, Q., & Xie, X. (2023b). A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 15(3), 39, 1–45. https://doi.org/10.1145/3641289
Cheng, Q., Sun, T., Liu, X., Zhang, W., Yin, Z., Li, S., Li, L., He, Z., Chen, K., & Qiu, X. (2024). Can AI Assistants Know What They Don’t Know?. arXiv, 2401.13275. https://doi.org/10.48550/arXiv.2401.13275
Chiang, T. (2023). ChatGPT Is a Blurry JPEG of the Web. The New Yorker, 9 February. https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry
Delétang, G., Ruoss, A., Duquenne, P.A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L.K., Aitchison, M., Orseau, L., Hutter, M. & Veness, J. (2023). Language Modeling Is Compression. arXiv, 2309.10668. https://doi.org/10.48550/arXiv.2309.10668
Dennett, D. (2009). Intentional Systems Theory. In B.P. McLaughlin, A. Beckermann & S. Walter (Eds.), The Oxford Handbook of Philosophy of Mind (pp. 339–350). New York, NY: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199262618.003.0020
Dreyfus, G.B.J. (1997). Recognizing Reality: Dharmakirti’s Philosophy and its Tibetan Interpretations. Albany, NY: SUNY Press.
Fierro, C., Li, J., & Søgaard, A. (2024). Does Instruction Tuning Make LLMs More Consistent?. arXiv, 2404.15206. https://doi.org/10.48550/arXiv.2404.15206
Gettier, E.L. (1963). Is Justified True Belief Knowledge?. Analysis, 23(6), 121–123. https://doi.org/10.1093/analys/23.6.121
Garfinkel, H. (1967). Studies in Ethnomethodology. Englewood Cliffs, NJ: Prentice-Hall.
Goffman, E. (1964). Behavior in Public Places. New York, NY: Free Press.
Goyal, S., Doddapaneni, S., Khapra, M.M., & Ravindran, B. (2023). A Survey of Adversarial Defenses and Robustness in NLP. ACM Computing Surveys, 55(14s), 1–39. https://doi.org/10.1145/3593042
Hayles, N.K. (2022). Inside the Mind of an AI: Materiality and the Crisis of Representation. New Literary History, 54(1), 635–666. https://doi.org/10.1353/nlh.2022.a898324
Ichikawa, J.J., & Steup, M. (2024). The Analysis of Knowledge. The Stanford Encyclopedia of Philosophy, 8 September. https://plato.stanford.edu/archives/fall2024/entries/knowledge-analysis/
Jacomy, M. (2020). Science Tools Are Not Made for Their Users [Billet]. Reticular, 27 February. https://reticular.hypotheses.org/1387
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730
Kapoor, S., Gruver, N., Roberts, M., Collins, K., Pal, A., Bhatt, U., Weller, A., Dooley, S., Goldblum, M., & Wilson, A.G. (2024). Large Language Models Must Be Taught to Know What They Don’t Know. arXiv, 2406.08391. https://doi.org/10.48550/arXiv.2406.08391
Leidinger, A., van Rooij, R., & Shutova, E. (2023). The Language of Prompting: What Linguistic Properties Make a Prompt Successful?. In H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 9210–9232). Association for Computational Linguistics. http://arxiv.org/abs/2311.01967
Manning, C.D. (2022). Human Language Understanding & Reasoning. Daedalus, 151(2), 127–138. https://doi.org/10.1162/daed_a_01905
Manning, B.S., Zhu, K., & Horton, J.J. (2024). Automated Social Science: Language Models as Scientist and Subjects. arXiv, 2404.11794. https://doi.org/10.3386/w32381
Mitchell, M., & Krakauer, D.C. (2023). The Debate Over Understanding in AI’s Large Language Models. Proceedings of the National Academy of Sciences of the United States of America, 120(13), e2215907120. https://doi.org/10.1073/pnas.2215907120
Mollick, E. (2024). Co-Intelligence: Living and Working with AI. New York, NY: Portfolio/Penguin.
Moradi, M., & Samwald, M. (2021). Evaluating the Robustness of Neural Language Models to Input Perturbations. In M.-F. Moens, X. Huang, L. Specia, S. Wen-tau Yi (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1558–1570). https://doi.org/10.18653/v1/2021.emnlp-main.117
Munk, A.K., Olesen, A.G., & Jacomy, M. (2022). The Thick Machine: Anthropological AI between explanation and explication. Big Data & Society, 9(1). https://doi.org/10.1177/20539517211069891
Munk, A.K., Madsen, A.K., & Jacomy, M. (2019). Thinking Through the Databody: Sprints as Experimental Situations. In Å. Mäkitalo, T. Nicewonger, & M. Elam (Eds.), Designs for Experimentation and Inquiry: Approaching Learning and Knowing in Digital Transformation (pp. 110–128). London: Routledge. https://doi.org/10.4324/9780429489839-7
Newell, A. (1982). The Knowledge Level. Artificial Intelligence, 18(1), 87–127. https://doi.org/10.1016/0004-3702(82)90012-1
Nozick, R. (1981). Philosophical Explanations. Cambridge, MA: Harvard University Press.
Pedersen, M.A. (2023). Editorial Introduction: Towards a Machinic Anthropology. Big Data & Society, 10(1). https://doi.org/10.1177/20539517231153803
Peels, R. (2017). Ignorance. In T. Crane (Ed.), Routledge Encyclopedia of Philosophy. London: Routledge. https://doi.org/10.4324/9780415249126-P065-1
Prabhakaran, V., Hutchinson, B., & Mitchell, M. (2019). Perturbation Sensitivity Analysis to Detect Unintended Model Biases. In K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 5740–5745). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1578
Pritchard, D. (2005). Epistemic Luck. Oxford: Oxford University Press.
Qi, J., Fernández, R., & Bisazza, A. (2023). Cross-lingual Consistency of Factual Knowledge in Multilingual Language Models. In H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 10650–10666). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.658
Ramsay, S. (2014). The Hermeneutics of Screwing Around; or What You Do with a Million Books. In K. Kee (Ed.), Pastplay: Teaching and Learning History with Technology (pp. 111–119). Michigan, MI: University of Michigan Press. https://doi.org/10.2307/j.ctv65swr0.9
Ravetz, J.R. (1993). The Sin of Science: Ignorance of Ignorance. Knowledge, 15(2), 157–165. https://doi.org/10.1177/107554709301500203
Rees, T. (2022). Non-Human Words: On GPT-3 as a Philosophical Laboratory. Daedalus, 151(2), 168–182. https://doi.org/10.1162/daed_a_01908
Rieder, B., & Röhle, T. (2017). Digital Methods: From Challenges to Bildung. In M.T. Schäfer, K. van Es (Eds.), The Datafied Society: Studying Culture through Data (pp. 109–124). Amsterdam: Amsterdam University Press. https://doi.org/10.25969/mediarep/12558
Rieder, B., Peeters, S., & Borra, E. (2022). From Tool to Tool-Making: Reflections on Authorship in Social Media Research Software. Convergence, 30(1), 216–235. https://doi.org/10.1515/9789048531011-010
Saba, W.S. (2023). Stochastic LLMs Do Not Understand Language: Towards Symbolic, Explainable and Ontologically Based LLMs. In J.P.A. Almeida, J. Borbinha, G. Guizzardi, S. Link, J. Zdravkovic (Eds.), Conceptual Modeling (pp. 3–19). Cham: Springer. https://doi.org/10.1007/978-3-031-47262-6_1
Schreiber, G. (2008). Knowledge Engineering. Foundations of Artificial Intelligence, 3, 929–946. https://doi.org/10.1016/S1574-6526(07)03025-8
Shah, C., & Bender, E.M. (2024). Envisioning Information Access Systems: What Makes for Good Tools and a Healthy Web?. Association for Computing Machinery, 18(3), 1–24. https://doi.org/10.1145/3649468
Simpson, E.H. (1949). Measurement of Diversity. Nature, 163(4148), 688. https://doi.org/10.1038/163688a0
Sosa, E. (1999). How to Defeat Opposition to Moore. Philosophical Perspectives, 33(13s), 141–153. https://doi.org/10.1111/0029-4624.33.s13.7
Stewart, L. (2024). What is Inter-Coder Reliability? Explanation & Strategies. ATLAS.Ti, 5 May. https://atlasti.com/research-hub/measuring-inter-coder-agreement-why-ce
Tonmoy, S.M.T.I., Zaman, S.M.M., Jain, V., Rani, A., Rawte, V., Chadha, A., & Das, A. (2024). A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv, 2401.01313. https://doi.org/10.48550/arXiv.2401.01313
Törnberg, P. (2024). Best Practices for Text Annotation with Large Language Models. Sociologica, 18(2), 67–85. https://doi.org/10.6092/issn.1971-8853/19461
van Geenen, D., van Es, K., & Gray, J.W. (2024). Pluralising Critical Technical Practice. Convergence, 30(1), 7–28. https://doi.org/10.1177/13548565231192105
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-consistency Improves Chain of Thought Reasoning in Language Models. arXiv, 2203.11171. https://doi.org/10.48550/arXiv.2203.11171
Wang, Y., Li, P., Sun, M., & Liu, Y. (2023). Self-knowledge Guided Retrieval Augmentation for Large Language Models. In H. Bouamor, Houda, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.691
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2024). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems (pp. 24824–24837). New Orleans, LA: Curran.
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L.A., Isaac, W., Legassick, S., Irving,G., & Gabriel, I. (2021). Ethical and Social Risks of Harm From Language Models. arXiv, 2112.04359. https://doi.org/10.48550/arXiv.2112.04359
Yin, Z., Sun, Q., Guo, Q., Wu, J., Qiu, X., & Huang, X. (2023). Do Large Language Models Know What They Don’t Know?. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.551
Zhao, Y., Yan, L., Sun, W., Xing, G., Meng, C., Wang, S., Cheng, Z., Ren, Z., & Yin, D. (2023). Knowing What LLMs Do Not Know: A Simple Yet Effective Self-Detection Method. In K.Duh, H. Gomez, S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,(Vol. 1), 7051–7063. Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.390
Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024). Can Large Language Models Transform Computational Social Science?. Computational Linguistics, 50(1), 237–291. https://doi.org/10.1162/coli_a_00502
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Mathieu Jacomy, Erik Borra
This work is licensed under a Creative Commons Attribution 4.0 International License.