The Problems of LLM-generated Data in Social Science Research

Luca Rossi; Katherine Harrison; Irina Shklovski

doi:10.6092/issn.1971-8853/19576

Authors

Luca Rossi NERDS research group, Department of Digital Design, IT University of Copenhagen https://orcid.org/0000-0002-3629-2039

Luca Rossi is an Associate Professor of Digital Media and Networks at the Department of Digital Design of IT, University of Copenhagen (Denmark). He coordinates the Human Centered Data Science research group, and he is member of the Networks Data and Society (NERDS) research group. He teaches Network analysis and Digital Data Analysis.
Katherine Harrison Department of Thematic Studies — Gender Studies, Linköping University https://orcid.org/0000-0002-8325-4051

Katherine Harrison, Ph.D., is an Associate Professor in Gender Studies at Linköping University (Sweden). Her research sits at the intersection of Science & Technology Studies, media studies, and feminist theory, bringing critical perspectives on knowledge production to studies of different digital technologies.
Irina Shklovski Department of Computer Science, Department of Communication, University of Copenhagen; Department of Thematic Studies — Gender Studies, Linköping University (Sweden) https://orcid.org/0000-0003-1874-0958

Irina Shklovski is a Professor of Communication and Computing in the Department of Computer Science and the Department of Communication at the University of Copenhagen (Denmark). She holds a WASP-HS visiting professorship at Linköping University (Sweden). Her research areas include speculative AI futures, AI ethics, data quality, synthetic data, explainability, privacy, and creepy technologies.

DOI:

https://doi.org/10.6092/issn.1971-8853/19576

Keywords:

LLM, synthetic data, social science, research methods

Abstract

Beyond being used as fast and cheap annotators for otherwise complex classification tasks, LLMs have seen a growing adoption for generating synthetic data for social science and design research. Researchers have used LLM-generated data for data augmentation and prototyping, as well as for direct analysis where LLMs acted as proxies for real human subjects. LLM-based synthetic data build on fundamentally different epistemological assumptions than previous synthetically generated data and are justified by a different set of considerations. In this essay, we explore the various ways in which LLMs have been used to generate research data and consider the underlying epistemological (and accompanying methodological) assumptions. We challenge some of the assumptions made about LLM-generated data, and we highlight the main challenges that social sciences and humanities need to address if they want to adopt LLMs as synthetic data generators.

References

Abowd, J.M., & Vilhuber, L. (2008). How Protective Are Synthetic Data? In J. Domingo-Ferrer & Y. Saygın (Eds.), Privacy in Statistical Databases (pp. 239–246). New York, NY: Springer. https://doi.org/10.1007/978-3-540-87471-3_20

Agnew, W., Bergman, A.S., Chien, J., Díaz, M., El-Sayed, S., Pittman, J., Mohamed, S., & McKee, K.R. (2024). The Illusion of Artificial Inclusion. arXiv, 2401.08572. https://doi.org/10.48550/arXiv.2401.08572

Aher, G.V., Arriaga, R.I., & Kalai, A.T. (2023). Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the 40th International Conference on Machine Learning (pp. 337–371). https://proceedings.mlr.press/v202/aher23a.html

Almeida, G.F., Nunes, J.L., Engelmann, N., Wiegmann, A., & de Araújo, M. (2024). Exploring the Psychology of LLMs’ Moral and Legal Reasoning. Artificial Intelligence, 333, 104145. https://doi.org/10.1016/j.artint.2024.104145

Araujo, T., Ausloos, J., van Atteveldt, W., Loecherbach, F., Moeller, J., Ohme, J., Trilling, D., van de Velde, B., de Vreese, C., & Welbers, K. (2022). OSD2F: An Open-source Data Donation Framework. Computational Communication Research, 4(2), 372–387. https://doi.org/10.5117/CCR2022.2.001.ARAU

Argyle, L.P., Busby, E.C., Fulda, N., Gubler, J.R., Rytting, C., & Wingate, D. (2023). Out of One, Many: Using Language Models to Simulate Human Samples. Political Analysis, 31(3), 337–351. https://doi.org/10.1017/pan.2023.2

Atil, B., Chittams, A., Fu, L., Ture, F., Xu, L., & Baldwin, B. (2024). LLM Stability: A Detailed Analysis with Some Surprises. arXiv, 2408.04667. https://doi.org/10.48550/arXiv.2408.04667

Bail, C.A. (2024). Can Generative AI Improve Social Science?. Proceedings of the National Academy of Sciences, PNAS, 121(21), e2314021121. https://doi.org/10.1073/pnas.231402112

Belgodere, B., Dognin, P., Ivankay, A., Melnyk, I., Mroueh, Y., Mojsilovic, A., Navratil, J., Nitsure, A., Padhi, I., Rigotti, M., Ross, J., Schiff, Y., Vedpathak, R., & Young, R.A. (2023). Auditing and Generating Synthetic Data with Controllable Trust Trade-offs. arXiv, 2304.10819. https://doi.org/10.48550/arXiv.2304.10819

Bisbee, J., Clinton, J., Dorff, C., Kenkel, B., & Larson, J. (2023). Synthetic Replacements for Human Survey Data? The Perils of Large Language Models. SocArXiv, May 4. https://doi.org/10.31235/osf.io/5ecfa

Brand, J., Israeli, A., & Ngwe, D. (2023). Using LLMs for Market Research (Harvard Business School Marketing Unit Working Paper No. 23-062). Social Science Research Network. https://doi.org/10.2139/ssrn.4395751

Breum, S.M., Egdal, D.V., Mortensen, V.G., Møller, A.G., & Aiello, L.M. (2023). The Persuasive Power of Large Language Models. arXiv, 2312.15523. https://doi.org/10.48550/arXiv.2312.15523

Box, G.E.P. (1976). Science and Statistics. Journal of the American Statistical Association, 71(356), 791–799. https://doi.org/10.1080/01621459.1976.10480949

Bowker, G.C. (2008). Memory Practices in the Sciences. Cambridge, MA: MIT Press.

Bowker, G.C. (2013). Data Flakes: An Afterword to “Raw Data” Is an Oxymoron. In L. Gitelman (Ed.), “Raw Data” Is an Oxymoron (pp. 167–172). Cambridge, MA: MIT Press.

Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In S.A. Friedler & C. Wilson (Eds.), Proceedings of the 1^st Conference on Fairness, Accountability, and Transparency, PMLR, 81, 77–91. http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf

Cao, Y., Zhou, L., Lee, S., Cabello, L., Chen, M., & Hershcovich, D. (2023). Assessing Cross-Cultural Alignment between ChatGPT and Human Societies: An Empirical Study. In S. Dev, V. Prabhakaran, D. Adelani, D. Hovy, & L. Benotti (Eds.), Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (pp. 53–67). Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.c3nlp-1.7)

Couldry, N., & Mejias, U.A. (2019). Data Colonialism: Rethinking Big Data’s Relation to the Contemporary Subject. Television & New Media, 20(4), 336–349. https://doi.org/10.1177/1527476418796632

Cui, R., Lee, S., Hershcovich, D., & Søgaard, A. (2023). What Does the Failure to Reason with “Respectively” in Zero/Few-Shot Settings Tell Us about Language Models?. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61^st Annual Meeting of the Association for Computational Linguistics, Vol.1 (pp. 8786–8800). Singapore: Association for Computational Linguistics.

Demuro, E., & Gurney, L. (2024). Artificial Intelligence and the Ethnographic Encounter: Transhuman Language Ontologies, or What It Means “To Write like a Human, Think like a Machine”. Language & Communication, 96, 1–12. https://doi.org/10.1016/j.langcom.2024.02.002

D’Ignazio, C., & Klein, L.F. (2023). Data Feminism. Cambridge, MA: MIT Press.

Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI Language Models Replace Human Participants? Trends in Cognitive Sciences, 27(7), 597–600. https://doi.org/10.1016/j.tics.2023.04.008

Eigenschink, P., Reutterer, T., Vamosi, S., Vamosi, R., Sun, C., & Kalcher, K. (2023). Deep Generative Models for Synthetic Data: A Survey. IEEE Access, 11, 47304–47320. https://doi.org/10.1109/ACCESS.2023.3275134

Elsayed-Ali, S., Bonsignore, E., & Chan, J. (2023). Exploring Challenges to Inclusion in Participatory Design From the Perspectives of Global North Practitioners. In J. Nichols (Ed.), Proceedings of the ACM on Human-Computer Interaction (p. 7). New York, NY: Association for Computing Machinery https://doi.org/10.1145/3579606

Fang, X., Che, S., Mao, M., Zhang, H., Zhao, M., & Zhao, X. (2024). Bias of AI-generated Content: An Examination of News Produced by Large Language Models. Scientific Reports, 14(1), 5224, 1–20. https://doi.org/10.1038/s41598-024-55686-2

Figueira, A., & Vaz, B. (2022). Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics, 10(15), 1–41. https://doi.org/10.3390/math10152733

Finkelstein, S. (2008). Google, Links, and Popularity versus Authority. In J. Turow & L. Tsui (Eds.), The Hyperlinked Society: Questioning Connections in the Digital Age (pp. 104–120). Ann Arbor, MI: University of Michigan Press.

Goodman, J.D., & Sandoval, E. (2024). Google Chatbot’s A.I. Images Put People of Color in Nazi-era Uniforms. The New York Times, 22 February. https://www.nytimes.com/2024/02/22/technology/google-gemini-german-uniforms.html

Gordon, R. (2023). Large Language Models Are Biased. Can Logic Help Save Them? MIT News, 3 March. https://news.mit.edu/2023/large-language-models-are-biased-can-logic-help-save-them-0303

Greenwood, J. (2018). How Would People Behave in Milgram’s Experiment Today. Behavioral Scientist, 24 July. https://behavioralscientist.org/how-would-people-behave-in-milgrams-experiment-today

Grossmann, I., Feinberg, M., Parker, D.C., Christakis, N.A., Tetlock, P.E., & Cunningham, W.A. (2023). AI and the Transformation of Social Science Research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778

Groves, C. (2015). Logic of Choice or Logic of Care? Uncertainty, Technological Mediation and Responsible Innovation. NanoEthics, 9(3), 321–333. https://doi.org/10.1007/s11569-015-0238-x

Hämäläinen, P., Tavast, M., & Kunnari, A. (2023). Evaluating Large Language Models in Generating Synthetic HCI Research Data: A Case Study. In A. Schmidt, K. Väänänen, T. Goyal, P.O. Kristensson, A. Peters, S. Mueller, J.R. Williamson, & M.L. Wilson (Eds.), Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1–19). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3544548.3580688

Haraway, D. (1988). Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective. Feminist Studies, 14(3), 575–599. https://doi.org/10.2307/3178066

Helmond, A. (2014). “Raw Data” Is an Oxymoron. Information, Communication & Society, 17(9), 1171–1173. https://doi.org/10.1080/1369118X.2014.920042

Henrich, J., Heine, S.J., & Norenzayan, A. (2010). The Weirdest People in the World? Behavioral and Brain Sciences, 33(2–3), 61–83. https://doi.org/10.1017/S0140525X0999152X

Horton, J.J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? (Working Paper 31122). National Bureau of Economic Research. https://doi.org/10.3386/w31122

Jacobsen, B.N. (2023). Machine Learning and the Politics of Synthetic Data. Big Data & Society, 10(1), 20539517221145372, 1–12. https://doi.org/10.1177/20539517221145372

Jakesch, M., Hancock, J.T., & Naaman, M. (2023). Human Heuristics for AI-Generated Language Are Flawed. Proceedings of the National Academy of Sciences, PNAS, 120(11), e2208839120, 1–7. https://doi.org/10.1073/pnas.220883912

Jansen, B.J., Jung, S., & Salminen, J. (2023). Employing Large Language Models in Survey Research. Natural Language Processing Journal, 4, 100020, 1–7. https://doi.org/10.1016/j.nlp.2023.100020

Johnson, E., & Hajisharif, S. (2024). The Intersectional Hallucinations of Synthetic Data. AI & Society, 1–3. https://doi.org/10.1007/s00146-024-02017-8

Jones, C.R., Trott, S., & Bergen, B. (2023). Epitome: Experimental Protocol Inventory for Theory of Mind Evaluation. Proceedings of the First Workshop on Theory of Mind in Communicating Agents, PMLR, 202. https://openreview.net/pdf?id=e5Yky8Fnvj

Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S.N., & Weller, A. (2022). Synthetic Data. What, Why and How? arXiv, 2205.03257. https://doi.org/10.48550/arXiv.2205.03257

Jungherr, A., Jürgens, P., & Schoen, H. (2012). Why the Pirate Party Won the German Election of 2009 or the Trouble with Predictions: A Response to Tumasjan, A., Sprenger, T.O., Sander, P.G., & Welpe, I.M. “Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment”. Social Science Computer Review, 30(2), 229–234. https://doi.org/10.1177/0894439311404119

Kitchin, R., & Lauriault, T.P. (2018). Toward Critical Data Studies: Charting and Unpacking Data Assemblages and Their Work. In J. Thatcher, J. Eckert & A. Shears (Eds.), Thinking Big Data in Geography: New Regimes, New Research (pp. 3–20). Lincoln, NE: University of Nebraska Press.

Köbis, N., & Mossink, L.D. (2021). Artificial Intelligence versus Maya Angelou: Experimental Evidence that People Cannot Differentiate AI-generated from Human-written Poetry. Computers in Human Behavior, 114, 106553, 1–13. https://doi.org/10.1016/j.chb.2020.106553

McKenna, N., Li, T., Cheng, L., Hosseini, M., Johnson, M., & Steedman, M. (2023). Sources of Hallucination by Large Language Models on Inference Tasks. Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 2758–2774). Singapore: Association for Computational Linguistics https://doi.org/10.18653/v1/2023.findings-emnlp.182

Moats, D., & Seaver, N. (2019). “You Social Scientists Love Mind Games”: Experimenting in the “Divide” between Data Science and Critical Algorithm Studies. Big Data & Society, 6(1), 2053951719833404, 1–11. https://doi.org/10.1177/2053951719833404

Møller, A.G., Pera, A., Dalsgaard, J., & Aiello, L. (2024). The Parrot Dilemma: Human-labeled vs. LLM-augmented Data in Classification Tasks. In Y. Graham, & M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 179–192). Singapore: Association for Computational Linguistics. https://aclanthology.org/2024.eacl-short.17/

Motoki, F., Pinho Neto, V., & Rodrigues, V. (2024). More Human than Human: Measuring ChatGPT Political Bias. Public Choice, 198(1–2), 3–23. https://doi.org/10.1007/s11127-023-01097-2

Nikolenko, S.I. (2021). Synthetic Data for Deep Learning. Cham: Springer International Publishing.

Park, J.S., Popowski, L., Cai, C., Morris, M.R., Liang, P., & Bernstein, M.S. (2022). Social Simulacra: Creating Populated Prototypes for Social Computing Systems. In M. Agrawala, J.O. Wobbrock, E. Adar, & V. Setlur (Eds.), Proceedings of the 35^th Annual ACM Symposium on User Interface Software and Technology (pp. 1–18). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3526113.3545616

Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. In S. Follmer, J. Han, J. Steimle, & N. Henry Riche (Eds.), Proceedings of the 36^th Annual ACM Symposium on User Interface Software and Technology (pp. 1–22). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3586183.3606763

Perrigo, B. (2023). OpenAI Used Kenyan Workers on Less Than $2 Per Hour: Exclusive. Time, 18 January. https://time.com/6247678/openai-chatgpt-kenya-workers/

Phelps, S., & Russell, Y.I. (2023). Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics. arXiv, 2305.07970. https://doi.org/10.48550/arXiv.2305.07970

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models Are Unsupervised Multitask Learners. OpenAI, 1(8), 1–24. https://api.semanticscholar.org/CorpusID:160025533

Raghunathan, T.E. (2021). Synthetic Data. Annual Review of Statistics and Its Application, 8(1), 129–140. https://doi.org/10.1146/annurev-statistics-040720-031848

Rahman, N., & Santacana, E. (2023). Beyond Fair Use: Legal Risk Evaluation for Training LLMs on Copyrighted Text. Proceedings of the 40th International Conference on Machine Learning (pp. 1–5). https://blog.genlaw.org/CameraReady/57.pdf

Rhee, J., (2018). The Robotic Imaginary: The Human and the Price of Dehumanized Labor. Minneapolis, MN: University of Minnesota Press.

Rossetti, G., Stella, M., Cazabet, R., Abramski, K., Cau, E., Citraro, S., Failla, A., Improta, R., Morini, V., & Pansanella, V. (2024). Y Social: An LLM-powered Social Media Digital Twin. arXiv, 2408.00818. https://doi.org/10.48550/arXiv.2408.00818

Salminen, J., Guan, K.W., Jung, S.G., & Jansen, B. (2022). Use Cases for Design Personas: A Systematic Review and New Frontiers. In S. Barbosa, C. Lampe, C. Appert, D.A. Shamma, S. Drucker, J. Williamson, & K. Yatani (Eds.), Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (pp. 1–21). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3491102.3517589

Samuelson, W., & Zeckhauser, R. (1988). Status Quo Bias in Decision Making. Journal of Risk and Uncertainty, 1, 7–59. https://doi.org/10.1007/BF00055564

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023). Whose Opinions Do Language Models Reflect? In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the 40th International Conference on Machine Learning, PMLR, 202, 29971–30004. https://proceedings.mlr.press/v202/santurkar23a.html

Saracco, R. (2023). How Much Bigger Can/Should LLMs Become?. IEEE Future Directions, April 24. Retrieved from https://cmte.ieee.org/futuredirections/2023/04/24/how-much-bigger-can-should-llms-become/

Savage, N. (2023). Synthetic Data Could Be Better than Real Data. Nature, d41586-023-01445-01448, 27 April. https://www.nature.com/articles/d41586-023-01445-8

Schaeffer, R., Miranda, B., & Koyejo, S. (2024). Are Emergent Abilities of Large Language Models a Mirage? Proceedings of the 37ì^th International Conference on Neural Information Processing Systems (pp. 55565–55581). https://doi.org/10.48550/arXiv.2304.15004

Simmons, G., & Savinov, V. (2024). Assessing Generalization for Subpopulation Representative Modeling via In-Context Learning. arXiv, 2402.07368. https://doi.org/10.48550/ARXIV.2402.07368

Sin, J., Franz, R.L., Munteanu, C., & Barbosa Neves, B. (2021). Digital Design Marginalization: New Perspectives on Designing Inclusive Interfaces. In Y. Kitamura & A. Quigley (Eds.), Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–11). https://doi.org/10.1145/3411764.3445180

Spirling, A. (2023). Why Open-source Generative AI Models Are an Ethical Way Forward for Science. Nature, 616((7957), 413–413. https://doi.org/10.1038/d41586-023-01295-4

Sun, H. (2023). Reinforcement Learning in the Era of LLMs: What Is Essential? What Is Needed? An RL Perspective on RLHF, Prompting, and Beyond. arXiv, 2310.06147. https://doi.org/10.48550/arXiv.2310.06147

Tjuatja, L., Chen, V., Wu, S.T., Talwalkar, A., & Neubig, G. (2023). Do LLMs Exhibit Human-like Response Biases? A Case Study in Survey Design. arXiv, 2311.04076. https://doi.org/10.48550/ARXIV.2311.04076

Törnberg, P., Valeeva, D., Uitermark, J., & Bail, C. (2023). Simulating Social Media Using Large Language Models to Evaluate Alternative News Feed Algorithms. arXiv, 2310.05984. https://doi.org/10.48550/arXiv.2310.05984

Törnberg, P. (2024). Best Practices for Text Annotation with Large Language Models. Sociologica, 18(2), 67–85. https://doi.org/10.6092/issn.1971-8853/19461

van der Schaar, M., & Qian, Z. (2023). AAAI Lab for Innovative Uses of Synthetic Data. Association for the Advancement of Artificial Intelligence. https://www.vanderschaar-lab.com/wp-content/uploads/2022/08/AAAI_Synthetic-Data-Tutorial.pdf

von der Heyde, L., Haensch, A.-C., & Wenz, A. (2023). Assessing Bias in LLM-Generated Synthetic Datasets: The Case of German Voter Behavior. SocArXiv. https://EconPapers.repec.org/RePEc:osf:socarx:97r8s

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E.H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent Abilities of Large Language Models. arXiv, 2206.07682. https://doi.org/10.48550/arXiv.2206.07682

Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P., Mellor, J., Glaese, A., Cheng, M., Balle, B., Kasirzadeh, A., Biles, C., Brown, S., Kenton, Z., Hawkins, W., Stepleton, T., Birhane, A., Hendricks, L.A., Rimell, L., Isaac, W., … Gabriel, I. (2022). Taxonomy of Risks Posed by Language Models. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 214–229). https://doi.org/10.1145/3531146.3533088

Whitney, C.D., & Norman, J. (2024). Real Risks of Fake Data: Synthetic Data, Diversity-washing and Consent Circumvention. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 1733–1744). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3630106.3659002

Wu, Z., Peng, R., Han, X., Zheng, S., Zhang, Y., & Xiao, C. (2023). Smart Agent-Based Modeling: On the Use of Large Language Models in Computer Simulations (arXiv: 2311.06330). arXiv. https://doi.org/10.48550/arXiv.2311.06330

Yan, B., Li, K., Xu, M., Dong, Y., Zhang, Y., Ren, Z., & Cheng, X. (2024). On Protecting the Data Privacy of Large Language Models (LLMs): A Survey. arXiv, 2403.05156. https://doi.org/10.48550/ARXIV.2403.05156

The Problems of LLM-generated Data in Social Science Research

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information

Make a Submission

Keywords

Current Issue