Data-driven research is currently being strengthened at the University of Eastern Finland, with social media offering enormous opportunities for it.
“Yes, this is an unapologetically populist topic of research. Yet, it is also an excellent example of the opportunities that big data from social media can offer for our research,” says Professor of English Language and Culture Mikko Laitinen, describing a recent study that examined whether people swear more on social media when interacting with close friends, or with acquaintances.
Small sample sizes often pose a challenge for humanities research, as they may limit the extent to which findings can be generalised. English even has a term for this, WEIRD. It refers to the words Western, Educated, Industrialised, Rich and Democratic, highlighting how samples often overrepresent the perspectives of Western, white, educated and relatively affluent individuals.
This is why Laitinen’s research utilising social media datasets represents something new entirely.
“60 per cent of the world’s population already uses social media. For researchers, this offers enormous potential for approaching any research question in the humanities on an unprecedented scale.”
Building on this idea, Laitinen’s team are currently conducting fundamental research into how people use language on social media. What emerges from people’s linguistic behaviour will open new perspectives to many other areas as well.
Access to virtually limitless data
The profanity study by Laitinen and his team was covered by hundreds of news outlets around the world. Described as “shamelessly populist” by the authors of the study, the seemingly light topic shed light on something much more significant. So, to recap, what did the profanity analysis reveal?
“Our primary interest lies in how language is used in different networks. Swearing is just one example of language usage, but a very illustrative one. However, much more will be learnt about people’s linguistic behaviour in social networks."
Humanity revolves around networks, and social relationships are a major determinant of our well-being.
“In linguistics, however, social networks have often been neglected simply because collecting such data has been extremely challenging.”
Social media now solves that problem. It offers virtually limitless sets of data that are easy to access, with social network information readily embedded within them.
Expertise from different fields is crucial
The profanity study drew on social media updates from nearly half a million individuals across thousands of networks. Using various computational methods, the researchers were able to assess how tightly or loosely connected the networks were. The insights and methods can be applied much more broadly.
“Because the datasets are so vast and complex, we also need expertise from outside linguistics. In the profanity study, and in our research project more broadly, we collaborate closely with computer scientists.”
Laitinen points out that working with extensive sets of data requires a range of competences: Someone must know how to collect data, someone must know how to process and enrich them, and someone must be able to mine what is essential and analyse the results.
“And most importantly, someone must know how to pose the correct research questions.”
Because the datasets are so vast and complex, we also need expertise from outside linguistics.
Mikko Laitinen
Professor
New career opportunities
Big data research is also expanding linguists’ horizons.
“This is something I’m happy to talk about. Careers paths for language students have long followed the same pattern. In today’s world, however, there is a growing need for linguists who are not only fascinated by language but also by numbers.”
Laitinen refers to the fact that interdisciplinary collaboration involving big data is most effective when everyone involved understands certain key concepts.
“Such capabilities and collaborative skills would prepare language students for the needs of modern society. I would argue that linguists and computer scientists have a great deal to offer across a wide range of research topics.”
Finding answers to broad societal questions
As both academic research and labour market expectations are changing in response to the technological transformation, the University of Eastern Finland has made a strategic decision to strengthen the conditions for data-driven research. A research infrastructure for data-intensive humanities and social sciences, known as DITLab, is currently being established at the university.
DITLab aims to bring together expertise in data-intensive research from two faculties. At the same time, it strengthens and supports education and research in the field.
“At DITLab, researchers process large datasets using efficient analyses combined with computational expertise. Its centralised data-related services allow researchers to focus on actual research, so no one needs to reinvent the wheel in terms of technical solutions,” Laitinen explains.
The service infrastructure also supports researchers in applying for competitive research funding.
“Our university’s expertise in data-driven research is growing all the time. As linguists, we will soon be able to offer even better answers to pressing societal questions, thanks to multidisciplinary collaboration.”
Analysing large linguistic datasets requires an understanding of linguistic components and structures, language variation, and insight on what must be taken into account or ignored in the analysis.
“In many fields, tools for the automatic analysis of linguistic data have so far been developed mainly from the perspective of computer science, for example for use in surveys or customer feedback analysis,” Mikko Laitinen says.
One such tool is sentiment analysis, tasked with classifying the text being analysed, or parts of it, as positive, negative or neutral.
“This kind of purely mechanical categorisation of language does not necessarily work. To extract the desired information from the collected data, the analysis must also involve experts in the language in question.”
As an example, Laitinen highlights the word “dead”. In sentiment analysis, it would most likely be categorised as a negative word, meaning that texts containing the word would be interpreted as relating to something unpleasant or undesirable.
“Yet in the English-speaking world, the phrase ‘dead funny’ flips the meaning of the term, rendering it very positive.”
Although the example is very simple, it illustrates how important it is for linguists to be involved from the outset when formulating research questions for mining linguistic datasets.
“Every field has its own terminology. The study of massive datasets could well be extended to almost any discipline, provided experts in the field are included in the work. If knowledge of the text type is omitted, the study can easily get derailed.”
The story refers to the COMET project led by Mikko Laitinen and funded by the Research Council of Finland. The four-year project investigates language change in digital social networks. The profanity study was part of the project.
DITLab is a digital service infrastructure that supports research and education in the Philosophical Faculty and in the Faculty of Social Sciences and Business Studies. It provides a framework for data-driven methodological development in the SSH fields at the University of Eastern Finland. DITLab focuses particularly on textual data, identified as a key data type by the university. In 2026–2027, DITLab will be led by University Lecturer Kimmo Elo and Professor Mikko Laitinen. The project also involves Professor Kati Launis and Staff Scientist Tomi Oinas.