The Problems of LLM-generated Data in Social Science Research
DOI:
https://doi.org/10.6092/issn.1971-8853/19576Keywords:
LLM, synthetic data, social science, research methodsAbstract
Beyond being used as fast and cheap annotators for otherwise complex classification tasks, LLMs have seen a growing adoption for generating synthetic data for social science and design research. Researchers have used LLM-generated data for data augmentation and prototyping, as well as for direct analysis where LLMs acted as proxies for real human subjects. LLM-based synthetic data build on fundamentally different epistemological assumptions than previous synthetically generated data and are justified by a different set of considerations. In this essay, we explore the various ways in which LLMs have been used to generate research data and consider the underlying epistemological (and accompanying methodological) assumptions. We challenge some of the assumptions made about LLM-generated data, and we highlight the main challenges that social sciences and humanities need to address if they want to adopt LLMs as synthetic data generators.
References
Abowd, J.M., & Vilhuber, L. (2008). How Protective Are Synthetic Data? In J. Domingo-Ferrer & Y. Saygın (Eds.), Privacy in Statistical Databases (pp. 239–246). New York, NY: Springer. https://doi.org/10.1007/978-3-540-87471-3_20
Agnew, W., Bergman, A.S., Chien, J., Díaz, M., El-Sayed, S., Pittman, J., Mohamed, S., & McKee, K.R. (2024). The Illusion of Artificial Inclusion. arXiv, 2401.08572. https://doi.org/10.48550/arXiv.2401.08572
Aher, G.V., Arriaga, R.I., & Kalai, A.T. (2023). Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the 40th International Conference on Machine Learning (pp. 337–371). https://proceedings.mlr.press/v202/aher23a.html
Almeida, G.F., Nunes, J.L., Engelmann, N., Wiegmann, A., & de Araújo, M. (2024). Exploring the Psychology of LLMs’ Moral and Legal Reasoning. Artificial Intelligence, 333, 104145. https://doi.org/10.1016/j.artint.2024.104145
Araujo, T., Ausloos, J., van Atteveldt, W., Loecherbach, F., Moeller, J., Ohme, J., Trilling, D., van de Velde, B., de Vreese, C., & Welbers, K. (2022). OSD2F: An Open-source Data Donation Framework. Computational Communication Research, 4(2), 372–387. https://doi.org/10.5117/CCR2022.2.001.ARAU
Argyle, L.P., Busby, E.C., Fulda, N., Gubler, J.R., Rytting, C., & Wingate, D. (2023). Out of One, Many: Using Language Models to Simulate Human Samples. Political Analysis, 31(3), 337–351. https://doi.org/10.1017/pan.2023.2
Atil, B., Chittams, A., Fu, L., Ture, F., Xu, L., & Baldwin, B. (2024). LLM Stability: A Detailed Analysis with Some Surprises. arXiv, 2408.04667. https://doi.org/10.48550/arXiv.2408.04667
Bail, C.A. (2024). Can Generative AI Improve Social Science?. Proceedings of the National Academy of Sciences, PNAS, 121(21), e2314021121. https://doi.org/10.1073/pnas.231402112
Belgodere, B., Dognin, P., Ivankay, A., Melnyk, I., Mroueh, Y., Mojsilovic, A., Navratil, J., Nitsure, A., Padhi, I., Rigotti, M., Ross, J., Schiff, Y., Vedpathak, R., & Young, R.A. (2023). Auditing and Generating Synthetic Data with Controllable Trust Trade-offs. arXiv, 2304.10819. https://doi.org/10.48550/arXiv.2304.10819
Bisbee, J., Clinton, J., Dorff, C., Kenkel, B., & Larson, J. (2023). Synthetic Replacements for Human Survey Data? The Perils of Large Language Models. SocArXiv, May 4. https://doi.org/10.31235/osf.io/5ecfa
Brand, J., Israeli, A., & Ngwe, D. (2023). Using LLMs for Market Research (Harvard Business School Marketing Unit Working Paper No. 23-062). Social Science Research Network. https://doi.org/10.2139/ssrn.4395751
Breum, S.M., Egdal, D.V., Mortensen, V.G., Møller, A.G., & Aiello, L.M. (2023). The Persuasive Power of Large Language Models. arXiv, 2312.15523. https://doi.org/10.48550/arXiv.2312.15523
Box, G.E.P. (1976). Science and Statistics. Journal of the American Statistical Association, 71(356), 791–799. https://doi.org/10.1080/01621459.1976.10480949
Bowker, G.C. (2008). Memory Practices in the Sciences. Cambridge, MA: MIT Press.
Bowker, G.C. (2013). Data Flakes: An Afterword to “Raw Data” Is an Oxymoron. In L. Gitelman (Ed.), “Raw Data” Is an Oxymoron (pp. 167–172). Cambridge, MA: MIT Press.
Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In S.A. Friedler & C. Wilson (Eds.), Proceedings of the 1st Conference on Fairness, Accountability, and Transparency, PMLR, 81, 77–91. http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
Cao, Y., Zhou, L., Lee, S., Cabello, L., Chen, M., & Hershcovich, D. (2023). Assessing Cross-Cultural Alignment between ChatGPT and Human Societies: An Empirical Study. In S. Dev, V. Prabhakaran, D. Adelani, D. Hovy, & L. Benotti (Eds.), Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (pp. 53–67). Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.c3nlp-1.7)
Couldry, N., & Mejias, U.A. (2019). Data Colonialism: Rethinking Big Data’s Relation to the Contemporary Subject. Television & New Media, 20(4), 336–349. https://doi.org/10.1177/1527476418796632
Cui, R., Lee, S., Hershcovich, D., & Søgaard, A. (2023). What Does the Failure to Reason with “Respectively” in Zero/Few-Shot Settings Tell Us about Language Models?. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Vol.1 (pp. 8786–8800). Singapore: Association for Computational Linguistics.
Demuro, E., & Gurney, L. (2024). Artificial Intelligence and the Ethnographic Encounter: Transhuman Language Ontologies, or What It Means “To Write like a Human, Think like a Machine”. Language & Communication, 96, 1–12. https://doi.org/10.1016/j.langcom.2024.02.002
D’Ignazio, C., & Klein, L.F. (2023). Data Feminism. Cambridge, MA: MIT Press.
Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI Language Models Replace Human Participants? Trends in Cognitive Sciences, 27(7), 597–600. https://doi.org/10.1016/j.tics.2023.04.008
Eigenschink, P., Reutterer, T., Vamosi, S., Vamosi, R., Sun, C., & Kalcher, K. (2023). Deep Generative Models for Synthetic Data: A Survey. IEEE Access, 11, 47304–47320. https://doi.org/10.1109/ACCESS.2023.3275134
Elsayed-Ali, S., Bonsignore, E., & Chan, J. (2023). Exploring Challenges to Inclusion in Participatory Design From the Perspectives of Global North Practitioners. In J. Nichols (Ed.), Proceedings of the ACM on Human-Computer Interaction (p. 7). New York, NY: Association for Computing Machinery https://doi.org/10.1145/3579606
Fang, X., Che, S., Mao, M., Zhang, H., Zhao, M., & Zhao, X. (2024). Bias of AI-generated Content: An Examination of News Produced by Large Language Models. Scientific Reports, 14(1), 5224, 1–20. https://doi.org/10.1038/s41598-024-55686-2
Figueira, A., & Vaz, B. (2022). Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics, 10(15), 1–41. https://doi.org/10.3390/math10152733
Finkelstein, S. (2008). Google, Links, and Popularity versus Authority. In J. Turow & L. Tsui (Eds.), The Hyperlinked Society: Questioning Connections in the Digital Age (pp. 104–120). Ann Arbor, MI: University of Michigan Press.
Goodman, J.D., & Sandoval, E. (2024). Google Chatbot’s A.I. Images Put People of Color in Nazi-era Uniforms. The New York Times, 22 February. https://www.nytimes.com/2024/02/22/technology/google-gemini-german-uniforms.html
Gordon, R. (2023). Large Language Models Are Biased. Can Logic Help Save Them? MIT News, 3 March. https://news.mit.edu/2023/large-language-models-are-biased-can-logic-help-save-them-0303
Greenwood, J. (2018). How Would People Behave in Milgram’s Experiment Today. Behavioral Scientist, 24 July. https://behavioralscientist.org/how-would-people-behave-in-milgrams-experiment-today
Grossmann, I., Feinberg, M., Parker, D.C., Christakis, N.A., Tetlock, P.E., & Cunningham, W.A. (2023). AI and the Transformation of Social Science Research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778
Groves, C. (2015). Logic of Choice or Logic of Care? Uncertainty, Technological Mediation and Responsible Innovation. NanoEthics, 9(3), 321–333. https://doi.org/10.1007/s11569-015-0238-x
Hämäläinen, P., Tavast, M., & Kunnari, A. (2023). Evaluating Large Language Models in Generating Synthetic HCI Research Data: A Case Study. In A. Schmidt, K. Väänänen, T. Goyal, P.O. Kristensson, A. Peters, S. Mueller, J.R. Williamson, & M.L. Wilson (Eds.), Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1–19). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3544548.3580688
Haraway, D. (1988). Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective. Feminist Studies, 14(3), 575–599. https://doi.org/10.2307/3178066
Helmond, A. (2014). “Raw Data” Is an Oxymoron. Information, Communication & Society, 17(9), 1171–1173. https://doi.org/10.1080/1369118X.2014.920042
Henrich, J., Heine, S.J., & Norenzayan, A. (2010). The Weirdest People in the World? Behavioral and Brain Sciences, 33(2–3), 61–83. https://doi.org/10.1017/S0140525X0999152X
Horton, J.J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? (Working Paper 31122). National Bureau of Economic Research. https://doi.org/10.3386/w31122
Jacobsen, B.N. (2023). Machine Learning and the Politics of Synthetic Data. Big Data & Society, 10(1), 20539517221145372, 1–12. https://doi.org/10.1177/20539517221145372
Jakesch, M., Hancock, J.T., & Naaman, M. (2023). Human Heuristics for AI-Generated Language Are Flawed. Proceedings of the National Academy of Sciences, PNAS, 120(11), e2208839120, 1–7. https://doi.org/10.1073/pnas.220883912
Jansen, B.J., Jung, S., & Salminen, J. (2023). Employing Large Language Models in Survey Research. Natural Language Processing Journal, 4, 100020, 1–7. https://doi.org/10.1016/j.nlp.2023.100020
Johnson, E., & Hajisharif, S. (2024). The Intersectional Hallucinations of Synthetic Data. AI & Society, 1–3. https://doi.org/10.1007/s00146-024-02017-8
Jones, C.R., Trott, S., & Bergen, B. (2023). Epitome: Experimental Protocol Inventory for Theory of Mind Evaluation. Proceedings of the First Workshop on Theory of Mind in Communicating Agents, PMLR, 202. https://openreview.net/pdf?id=e5Yky8Fnvj
Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S.N., & Weller, A. (2022). Synthetic Data. What, Why and How? arXiv, 2205.03257. https://doi.org/10.48550/arXiv.2205.03257
Jungherr, A., Jürgens, P., & Schoen, H. (2012). Why the Pirate Party Won the German Election of 2009 or the Trouble with Predictions: A Response to Tumasjan, A., Sprenger, T.O., Sander, P.G., & Welpe, I.M. “Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment”. Social Science Computer Review, 30(2), 229–234. https://doi.org/10.1177/0894439311404119
Kitchin, R., & Lauriault, T.P. (2018). Toward Critical Data Studies: Charting and Unpacking Data Assemblages and Their Work. In J. Thatcher, J. Eckert & A. Shears (Eds.), Thinking Big Data in Geography: New Regimes, New Research (pp. 3–20). Lincoln, NE: University of Nebraska Press.
Köbis, N., & Mossink, L.D. (2021). Artificial Intelligence versus Maya Angelou: Experimental Evidence that People Cannot Differentiate AI-generated from Human-written Poetry. Computers in Human Behavior, 114, 106553, 1–13. https://doi.org/10.1016/j.chb.2020.106553
McKenna, N., Li, T., Cheng, L., Hosseini, M., Johnson, M., & Steedman, M. (2023). Sources of Hallucination by Large Language Models on Inference Tasks. Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 2758–2774). Singapore: Association for Computational Linguistics https://doi.org/10.18653/v1/2023.findings-emnlp.182
Moats, D., & Seaver, N. (2019). “You Social Scientists Love Mind Games”: Experimenting in the “Divide” between Data Science and Critical Algorithm Studies. Big Data & Society, 6(1), 2053951719833404, 1–11. https://doi.org/10.1177/2053951719833404
Møller, A.G., Pera, A., Dalsgaard, J., & Aiello, L. (2024). The Parrot Dilemma: Human-labeled vs. LLM-augmented Data in Classification Tasks. In Y. Graham, & M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 179–192). Singapore: Association for Computational Linguistics. https://aclanthology.org/2024.eacl-short.17/
Motoki, F., Pinho Neto, V., & Rodrigues, V. (2024). More Human than Human: Measuring ChatGPT Political Bias. Public Choice, 198(1–2), 3–23. https://doi.org/10.1007/s11127-023-01097-2
Nikolenko, S.I. (2021). Synthetic Data for Deep Learning. Cham: Springer International Publishing.
Park, J.S., Popowski, L., Cai, C., Morris, M.R., Liang, P., & Bernstein, M.S. (2022). Social Simulacra: Creating Populated Prototypes for Social Computing Systems. In M. Agrawala, J.O. Wobbrock, E. Adar, & V. Setlur (Eds.), Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (pp. 1–18). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3526113.3545616
Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. In S. Follmer, J. Han, J. Steimle, & N. Henry Riche (Eds.), Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (pp. 1–22). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3586183.3606763
Perrigo, B. (2023). OpenAI Used Kenyan Workers on Less Than $2 Per Hour: Exclusive. Time, 18 January. https://time.com/6247678/openai-chatgpt-kenya-workers/
Phelps, S., & Russell, Y.I. (2023). Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics. arXiv, 2305.07970. https://doi.org/10.48550/arXiv.2305.07970
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models Are Unsupervised Multitask Learners. OpenAI, 1(8), 1–24. https://api.semanticscholar.org/CorpusID:160025533
Raghunathan, T.E. (2021). Synthetic Data. Annual Review of Statistics and Its Application, 8(1), 129–140. https://doi.org/10.1146/annurev-statistics-040720-031848
Rahman, N., & Santacana, E. (2023). Beyond Fair Use: Legal Risk Evaluation for Training LLMs on Copyrighted Text. Proceedings of the 40th International Conference on Machine Learning (pp. 1–5). https://blog.genlaw.org/CameraReady/57.pdf
Rhee, J., (2018). The Robotic Imaginary: The Human and the Price of Dehumanized Labor. Minneapolis, MN: University of Minnesota Press.
Rossetti, G., Stella, M., Cazabet, R., Abramski, K., Cau, E., Citraro, S., Failla, A., Improta, R., Morini, V., & Pansanella, V. (2024). Y Social: An LLM-powered Social Media Digital Twin. arXiv, 2408.00818. https://doi.org/10.48550/arXiv.2408.00818
Salminen, J., Guan, K.W., Jung, S.G., & Jansen, B. (2022). Use Cases for Design Personas: A Systematic Review and New Frontiers. In S. Barbosa, C. Lampe, C. Appert, D.A. Shamma, S. Drucker, J. Williamson, & K. Yatani (Eds.), Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (pp. 1–21). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3491102.3517589
Samuelson, W., & Zeckhauser, R. (1988). Status Quo Bias in Decision Making. Journal of Risk and Uncertainty, 1, 7–59. https://doi.org/10.1007/BF00055564
Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023). Whose Opinions Do Language Models Reflect? In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the 40th International Conference on Machine Learning, PMLR, 202, 29971–30004. https://proceedings.mlr.press/v202/santurkar23a.html
Saracco, R. (2023). How Much Bigger Can/Should LLMs Become?. IEEE Future Directions, April 24. Retrieved from https://cmte.ieee.org/futuredirections/2023/04/24/how-much-bigger-can-should-llms-become/
Savage, N. (2023). Synthetic Data Could Be Better than Real Data. Nature, d41586-023-01445-01448, 27 April. https://www.nature.com/articles/d41586-023-01445-8
Schaeffer, R., Miranda, B., & Koyejo, S. (2024). Are Emergent Abilities of Large Language Models a Mirage? Proceedings of the 37ìth International Conference on Neural Information Processing Systems (pp. 55565–55581). https://doi.org/10.48550/arXiv.2304.15004
Simmons, G., & Savinov, V. (2024). Assessing Generalization for Subpopulation Representative Modeling via In-Context Learning. arXiv, 2402.07368. https://doi.org/10.48550/ARXIV.2402.07368
Sin, J., Franz, R.L., Munteanu, C., & Barbosa Neves, B. (2021). Digital Design Marginalization: New Perspectives on Designing Inclusive Interfaces. In Y. Kitamura & A. Quigley (Eds.), Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–11). https://doi.org/10.1145/3411764.3445180
Spirling, A. (2023). Why Open-source Generative AI Models Are an Ethical Way Forward for Science. Nature, 616((7957), 413–413. https://doi.org/10.1038/d41586-023-01295-4
Sun, H. (2023). Reinforcement Learning in the Era of LLMs: What Is Essential? What Is Needed? An RL Perspective on RLHF, Prompting, and Beyond. arXiv, 2310.06147. https://doi.org/10.48550/arXiv.2310.06147
Tjuatja, L., Chen, V., Wu, S.T., Talwalkar, A., & Neubig, G. (2023). Do LLMs Exhibit Human-like Response Biases? A Case Study in Survey Design. arXiv, 2311.04076. https://doi.org/10.48550/ARXIV.2311.04076
Törnberg, P., Valeeva, D., Uitermark, J., & Bail, C. (2023). Simulating Social Media Using Large Language Models to Evaluate Alternative News Feed Algorithms. arXiv, 2310.05984. https://doi.org/10.48550/arXiv.2310.05984
Törnberg, P. (2024). Best Practices for Text Annotation with Large Language Models. Sociologica, 18(2), 67–85. https://doi.org/10.6092/issn.1971-8853/19461
van der Schaar, M., & Qian, Z. (2023). AAAI Lab for Innovative Uses of Synthetic Data. Association for the Advancement of Artificial Intelligence. https://www.vanderschaar-lab.com/wp-content/uploads/2022/08/AAAI_Synthetic-Data-Tutorial.pdf
von der Heyde, L., Haensch, A.-C., & Wenz, A. (2023). Assessing Bias in LLM-Generated Synthetic Datasets: The Case of German Voter Behavior. SocArXiv. https://EconPapers.repec.org/RePEc:osf:socarx:97r8s
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E.H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent Abilities of Large Language Models. arXiv, 2206.07682. https://doi.org/10.48550/arXiv.2206.07682
Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P., Mellor, J., Glaese, A., Cheng, M., Balle, B., Kasirzadeh, A., Biles, C., Brown, S., Kenton, Z., Hawkins, W., Stepleton, T., Birhane, A., Hendricks, L.A., Rimell, L., Isaac, W., … Gabriel, I. (2022). Taxonomy of Risks Posed by Language Models. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 214–229). https://doi.org/10.1145/3531146.3533088
Whitney, C.D., & Norman, J. (2024). Real Risks of Fake Data: Synthetic Data, Diversity-washing and Consent Circumvention. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 1733–1744). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3630106.3659002
Wu, Z., Peng, R., Han, X., Zheng, S., Zhang, Y., & Xiao, C. (2023). Smart Agent-Based Modeling: On the Use of Large Language Models in Computer Simulations (arXiv: 2311.06330). arXiv. https://doi.org/10.48550/arXiv.2311.06330
Yan, B., Li, K., Xu, M., Dong, Y., Zhang, Y., Ren, Z., & Cheng, X. (2024). On Protecting the Data Privacy of Large Language Models (LLMs): A Survey. arXiv, 2403.05156. https://doi.org/10.48550/ARXIV.2403.05156
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Luca Rossi, Katherine Harrison, Irina Shklovski
This work is licensed under a Creative Commons Attribution 4.0 International License.