Preslav Nakov has established himself as one of the leading experts on the use of AI against propaganda and disinformation. He has been very influential in the field of natural language processing and text mining, publishing hundreds of peer reviewed research papers. He spoke to us about his work dealing with the ongoing problem of online misinformation.
1. What do you think about the on-going infodemic? With your extensive work on fake news, do you think there will be a point where we can see a decrease in such content?
Indeed, the global COVID-19 pandemic has also brought us the first global social media infodemic. At the beginning of the pandemic, the World Health Organization already realized the importance of the problem and put fighting the infodemic at rank two in its list of top-5 priorities. The infodemic represents an interesting blending of political and medical misinformation and disinformation. Now, a year and a half later, both the pandemic and the infodemic persist. Yet, I am an optimist. What has fueled the infodemic initially was that so little was known about COVID-19, and there was a lot of void to be filled. Later on, with the emergence of the vaccines, the infodemic got a new boost by the re-emerging anti-vaxxer movement, which has grown to be much more powerful than before. However, as the severity of the pandemic has now started to decrease, for example, we see full stadiums at EURO 2020 with no masks and little social distancing, to a large extent thanks to the vaccines, I expect that the infodemic will soon follow a similar downward trajectory. Yet, it will not die out completely, just decrease.
2. What drove you to pursue research in the fake news and misinformation domain?
As part of a collaboration between the Qatar Computing Research Institute (QCRI) and MIT, I was working on question answering in community forums, where the goal was to detect which answers in the forum were good, in the sense of trying to answer the question directly, as opposed to giving indirectly related information, discussing other topics, or talking to other users. We developed a strong system, which we deployed for use in production in a local forum, Qatar Living, where it is operational to date, but we soon realized that not all good answers were factually true. This got me interested in the factuality of user-generated content. Soon, along came the 2016 US Presidential election, and fake news and factuality became a global concern. Thus, I started the Tanbih mega-project, which we are developing at QCRI in collaboration with MIT, and other partners. The aim of the project is to help fight the fake news, propaganda, and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking. At Checkstep, we’re currently building AI-first tools to tackle hate speech, spam and misinformation.
3. What do you think about the upcoming regulations — EU’s DSA and OSB-UK?
These upcoming EU and U.K. regulations (and related proposals that are being discussed in the USA and other countries), have the potential to become transformative in the way GDPR was. Platforms would suddenly become responsible for their content, and would have a legal obligation to enforce their own terms of service as well as to comply with legislation on certain kinds of malicious content. They would also have an obligation to be able to explain their moderation decisions to their users as well as to external regulatory authorities. I see this as a hugely positive development, the way that GDPR was.
Legislators should be careful though to keep a good balance between trying to limit the spread of malicious content and protecting free speech. Moreover, we all should be cautious and remember that fake news and hate speech are complex problems and that legislation is only part of the overall solution. We would still need human moderators, research and development of tools that can help automate the process of content moderation at scale, fact-checking initiatives, high-quality journalism, teaching media literacy, and cooperation with platforms where user-generated content spreads.
4. How should platforms better prepare themselves?
Big Tech companies are already taking this seriously and have been developing in-house solutions for years. However, complying with the new legislation would be a challenge for small and mid-size companies (though it is also true that it affects them less), as well as for large ones for which user-generated content is important, but is not their core business. For example, a small fitness club that also has a forum on their website could not afford to hire and train its own content moderators. Such companies face two main options: (a) shut down their fora to avoid any issues, or (b) try to outsource content moderation, partially or completely. When it comes to content moderation at scale, there is a clear need for automation, which can take care of a large number of easy cases, but the final decision in hard cases should be taken by humans, not machines.
5. Any recent talks/research you’d like to talk about? Can also mention future talks?
Fighting the infodemic is typically thought of in terms of factuality, but it is a much broader problem. In February 2020, MIT Technology Review had an article that pointed out certain characteristics of the infodemic that go beyond factuality, such as fueling panic and racism.
Indeed, if the 2016 U.S. Presidential election gave us the term “fake news”, the 2020 one got the USA and the world concerned about a range of other types of malicious content online. The infodemic has demonstrated that this is part of the same problem, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreading racism, xenophobia, and panic. Addressing these issues requires solving a number of challenging problems such as identifying messages making claims, determining their check-worthiness and factuality, and their potential to do harm as well as the nature of that harm, to mention just a few. Thus, as part of Tanbih, we have been working on a system that can analyze user-generated content in Arabic, Bulgarian, English, and Dutch, which covers all these aspects and combines the perspectives and the interests of journalists, fact-checkers, social media platforms, policy makers, and society as a whole. A preliminary version of this work appeared in ICWSM-2021 last week.
We have been also looking into supporting fact-checkers and journalists by developing tools for predicting which claims are check-worthy and which ones have been previously fact-checked. We have an upcoming survey paper at IJCAI-2021, which surveys the AI technology that can help the fact-checkers. This has been the mission of the CLEF CheckThat! lab, which we have been organizing for four years now; also look at our recent ECIR-2021 paper about the lab.
Another research line I was involved in aims to detect the use of propaganda techniques in memes. Memes are very important as a large fraction of propaganda in social media is multimodal, mixing textual with visual content. Moreover, by focusing on the specific techniques (e.g., name calling, loaded language, flag-waving, whataboutism, black & white fallacy, etc.), we can train people to recognize how they are being manipulated. Recognizing twenty-two such techniques in memes has been the subject of a recent SemEval-2021 shared task; there is also an upcoming paper at ACL-2021.
In terms of content moderation, we recently wrote a survey that studied the dichotomy between what types of abusive language online platforms seek to curb vs. what research efforts there are to automatically detect abusive language.
6. Any personal anecdotes where you fell prey to fake news?
I have fallen prey to fake news many times, and I keep being fooled from time to time. Many friends and relatives send me articles asking me: is this fake news? In most cases, it is easy to tell, for example, maybe the article is just two to three sentences long and doesn’t give much support to the claim in the title, or maybe the website is a known fake news or satirical one, or a simple reverse image search reveals that the photo in the article is from a different event, or maybe the claim was previously fact-checked and known to be true/false, etc. Yet, in many cases, this is very hard, and my answer is: I am sorry, but I do not have a crystal ball. In fact, several studies in different countries have shown the same thing: most people cannot distinguish fake from real news; in the EU, this is true for 75% of young people.
Yet, with proper training, people can improve quite a bit. Indeed, two years ago Finland declared that they have won the war on fake news thanks to their massive media literacy program targeting all levels of the society, but primarily the schools. It took them five years, which shows that real results are possible and achievable in a realistic time period. We should be careful when setting our expectations though: the goal should not be to eradicate all fake news online; it should rather be to limit its impact, thus making it irrelevant. This has already happened to spam, which is still around, but is not the same kind of a problem that it used to be some 15–20 years ago; now Finland has shown that we can achieve the same with fake news as well. Thus, while the short-term solution should focus on content moderation and on limiting the spread of malicious content, the mid-term and long-term solution would do better to look at explainability and training users: this is fake news because …, this is hateful/offensive language because …, etc.
An edited version of this story originally appeared in The Checkstep Round-up newsletter https://checkstep.substack.com/p/calls-for-more-transparency-and-safety