[Last updated 5 Mar: still looking for lots of help, and someone to wrangle linux drivers].
There’s a myth in politics that what the Prime Minister says matters. Actually, as he knows, and everybody who wants his job knows, it only matters much if he says it enough times. ‘Long term economic plan’, anyone?
The Big Monitoring Project
So how we will spot and get ahead of all those claims? We need your help to build the tools that can monitor this election properly and level the playing field between big-spending political parties and the rest of us.
Here’s some things we want to cover, and how you can help if you’ve got technical skills –
- Television. To pull out subtitles, we need someone to spend some time getting dvbstream, VLC or similar to work with CCExtractor. Not too hard but needs patience and decent linux skills. We also want to save the actual recordings too. (UPDATE currently blocked on getting a TBS 6284 working under linux — any takers?)
- Email. A tool that will scrape an email account, given account details (we’d like to record what the party and candidate mailing lists are saying). Not too hard.
- Twitter. A robust scraper for ongoing tweets, given a list of twitter handles and API keys. (UPDATE We have working but not terribly robust code for this, that I will get on github asap).
- Past tweets. A robust scraper for past tweets, given a list of twitter handles and API keys. You can go back about 3,000 tweets with the API. (UPDATE Ditto).
- Facebook. A scraper for facebook pages, if that’s possible.
- Pictures. Pictures are going to be important: infographics, twitpics, and other wannabe-viral content. Maybe Tesseract can pull out the text to make them more searchable.
- Websites. We think the software archive.org has built might be able to handle this.
- What websites look like. You sometimes see full length screen shots of websites. Regular visual records of news websites would be immensely valuable for tracking how the priority things get changes with the news agenda. (UPDATE: here’s a start: http://partypages.johnre.es/ by @john_rees)
- Blogs / RSS. Scrape an RSS feed, given a URL.
Thanks to YourNextMP.com we have a decent list of candidates and their URLs to target these tools at. ElectionLeaflets.org and other projects are taking care of tracking things that can’t be automated. But developers — for the things that can be automated, your country needs you.
What difference will it make?
We can’t speak for all the other people who might use this data: journalists, campaigners, researchers, and anyone else who’s curious.
During the election we’re running an 18 hour/day election centre doing rapid analysis of the campaigns’ claims. We’ll use this data to make sure we’re focusing on the things that make most difference, and to make sure that when we’ve spotted something that doesn’t stack up, people know and can choose not to repeat it.
It’s much easier to stop people repeating a mistake than to stop them making it for the first time. We can get the original source to clarify, and journalists can choose not to cover a claim when they see we’ve checked it and aren’t impressed with what we found:
The Big Monitoring Project will let us do that at scale, and make sure the election campaigns are grounded in reality.
Get stuck in
Pick a topic, set up a github repository, create a project and put your code in. We prefer Python 3 but more than that we prefer working code. Have a README file. Have a directory called data/raw which you download the raw content into, data/staging for the processed version, and data/working for anything in between the two.
And what format would we like for the processed version? Here’s an example, a simple adaptation of House of Commons Hansard from mySociety. Basically, each item should be an xml file, with <item> as the root tag, broken into <section>s where necessary (for example if the URL changes or a different person is known to be speaking) and each separate sentence in its own <s> tag — for which we used nltk.
Shout at us with any questions: email@example.com or @FullFact.
* H/T Emily Randall for ‘The Big Monitoring Project’