The Big Monitoring Project

21 February 2015 | Will Moy

[Last updated 9 Mar: progress is here but still need lots of help].

There’s a myth in politics that what the Prime Minister says matters. Actually, as he knows, and everybody who wants his job knows, it only matters much if he says it enough times. 'Long term economic plan’, anyone?

The Big Monitoring Project

So how we will spot and get ahead of all those claims? We need your help to build the tools that can monitor this election properly and level the playing field between big-spending political parties and the rest of us.

Here’s some things we want to cover, and how you can help if you've got technical skills —

  • Television. To pull out subtitles, we need someone to spend some time getting dvbstream, VLC or similar to work with CCExtractor. Not too hard but needs patience and decent linux skills. We also want to save the actual recordings too. (UPDATE have a working machine, now need to stitch the software together)
  • Email. A tool that will scrape an email account, given account details (we’d like to record what the party and candidate mailing lists are saying). Not too hard.
  • Twitter. A robust scraper for ongoing tweets, given a list of twitter handles and API keys. (UPDATE We have working but not terribly robust code for this, that I will get on github asap).
  • Past tweets. A robust scraper for past tweets, given a list of twitter handles and API keys. You can go back about 3,000 tweets with the API. (UPDATE Ditto).
  • Facebook. A scraper for facebook pages, if that’s possible.
  • Pictures. Pictures are going to be important: infographics, twitpics, and other wannabe-viral content. Maybe Tesseract can pull out the text to make them more searchable.
  • Websites. We think the software archive.org has built might be able to handle this.
  • What websites look like. You sometimes see full length screen shots of websites. Regular visual records of news websites would be immensely valuable for tracking how the priority things get changes with the news agenda. (UPDATE: here's a start: http://partypages.johnre.es/ by @john_rees)
  • Blogs / RSS. Scrape an RSS feed, given a URL.

Thanks to YourNextMP.com we have a decent list of candidates and their URLs to target these tools at. ElectionLeaflets.org and other projects are taking care of tracking things that can't be automated. But developers — for the things that can be automated, your country needs you.

What difference will it make?

We can't speak for all the other people who might use this data: journalists, campaigners, researchers, and anyone else who's curious.

During the election we're running an 18 hour/day election centre doing rapid analysis of the campaigns' claims. We'll use this data to make sure we're focusing on the things that make most difference, and to make sure that when we've spotted something that doesn't stack up, people know and can choose not to repeat it.

It’s much easier to stop people repeating a mistake than to stop them making it for the first time. We can get the original source to clarify, and journalists can choose not to cover a claim when they see we’ve checked it and aren't impressed with what we found:

Nice DH people call to explain source of J Hunt's £200m health tourism figure. Turns out it's a report from 2003: links to Full Fact

The Big Monitoring Project will let us do that at scale, and make sure the election campaigns are grounded in reality.

Get stuck in

Pick a topic, set up a github repository, create a project and put your code in. We prefer Python 3 but more than that we prefer working code. Have a README file. Have a directory called data/raw which you download the raw content into, data/staging for the processed version, and data/working for anything in between the two.

And what format would we like for the processed version? Here's an example, a simple adaptation of House of Commons Hansard from mySociety. Basically, each item should be an xml file, with <item> as the root tag, broken into <section>s where necessary (for example if the URL changes or a different person is known to be speaking) and each separate sentence in its own <s> tag — for which we used nltk.

Shout at us with any questions: team@fullfact.org or @FullFact.

 

* H/T Emily Randall for 'The Big Monitoring Project'


Full Fact fights bad information

Bad information ruins lives. It promotes hate, damages people’s health, and hurts democracy. You deserve better.