How we’re using AI to scale up global fact checking
With much of the world adapting to a new normal, it is more important than ever that the public is aware of the dangers of false or misleading information on the coronavirus pandemic.
Fact checking organisations around the world are stepping up and standing up for our right to high-quality information, holding to account those who mislead us and helping citizens to make informed decisions. Since its outbreak, Full Fact has used technology to detect well over 500,000 claims made about the virus this year in the UK media alone.
Our ability to monitor at this scale is because for the past few years, we have been developing AI tools to help increase the speed, reach and impact of our fact checking, which we intend to share with like minded organisations around the world.
Now we are teaming together with globally renowned fact checkers Africa Check and Chequeado to help people, media outlets, civil society, platforms and public policy makers scale the work of fact checks and to bring the benefits of our tools to everyone.
In this project, we are not attempting to replace fact checkers with AI, but to empower fact checkers with the best AI driven tools. We expect most fact checks to be completed by a highly trained human, but we want to use technology to help fact checkers:
- Know the most important thing to be fact checking each day
- Know when someone repeats something they already know to be false
- Check things in as close to real-time as possible
Collecting and monitoring the data
We start by collecting a range of data from leading news sites and social media platforms that may contain claims we want to fact check. Data we collect can be taken from speech on live TV (including BBC), online news sites, and social media pages.
Once we have all the input information available as text we split everything down to individual sentences, which are our atomic unit for fact checks. The sentences are then passed through a number of steps to enrich them and make them more and more useful in the process of fact checking.
Identifying and labelling claims
We define a claim as the checkable part of any sentence which is made by a politician, journalist or online.
There are many different types of claims - ranging from claims about quantities (“GDP has risen by x%”), claims about cause and effect (“this policy leads to y”), predictive claims about the future (“the economy will grow by z”) and more.
We have developed a claim-type classifier to guide fact checkers towards claims that might be worth investigating. It helps us to identify and label every new sentence according to what type of claim it contains (whether it is about cause and effect, quantities, etc.).
We started building this with the recent BERT model published by Google Research and fine-tuned it using our own annotated data. BERT is a tool released by Google Research that has been pre trained with hundreds of millions of sentences in over 100 languages. This makes it a broad statistical model of language as it is actually used.
Labelling claims in this way filters the volume of data we could fact check from hundreds of thousands to tens of thousands. It is a vital first step in ensuring that the users of our tools have a chance to make sense of all the information.
Once we have labelled claims, sentences are checked to see if they are a match to something we have previously fact checked. Some claims are easier to model than others due to specificity and ambiguity in the language used to describe them.
The plan is to train a BERT-style model to predict match/no-match for sentences and then add in entity analysis (e.g. count if both sentences contain the sample numbers, people, organizations etc.). In combination, we hope these two stages will find repeats of a claim even if different words are used to describe it
Taking matching and identifying a step further
Finally, we use external processes to help spot more claims and further identify patterns of language that can be automatically checked.
Given a sentence, our tool attempts to identify the topic, trend, values, dates and location. If that succeeds, it compares the extracted information with the corresponding data via the UK Office For National Statistics API. It knows about 15 topics and c.60 verbs that define trends (e.g. rising, falling). This means our technology can automatically match with significantly more data to identify whether it’s correct.
Additionally, we semantically enrich the content to help our model detect semantically similar words and phrases. The first step is to identify people, places and other valuable entities, identifying the entities of interest and matching them to external URIs. We then deduplicate the information across multiple sentences, to identify and group together semantically similar references (e.g. ‘the prime minister’ and ‘Boris Johnson’). This allows us to extract greater value from the data we process and means we can make sophisticated interfaces showing all statements made by individuals.
Once the claim has been identified
The fact checking process is often undertaken offline. We then publish the results on our website. We also describe each fact check with some very specific markup, called ClaimReview. This is part of the wider schema.org project. It describes content on a range of topics in domain specific terms. This is important for us as describing our content so specifically helps ensure that our fact checks can travel further than our own platforms.
These tools are already effective, but we still have ground to cover.
We are already using these tools with fact checkers on a daily basis. In the UK they sped up the process of fact checking party manifestos to help voters last winter. In South Africa an Africa Check fact check identified by our AI has informed live debate around health care choices. In Argentina, half of the fact checks published each week are to address claims identified by AI tools. However, parts of this process described above are very experimental and we are trying to do the engineering hard yards to make this robust. This is important to the fact checkers already using it and also because of the scale of our ambition.
We want these tools to be used by many more fact checkers. We are part of a community of 200 fact checkers worldwide. Together we have published over 50,000 fact checks, but we believe our tools can make our global effort more effective. This leaves us with one big challenge: how do we do this at true web scale?
Want to learn more about the fight against bad information? Get weekly updates and sign up to our newsletter.