Winning a Debate: Scraping the Intelligence Squared Dataset

This is a brief companion to the post analyzing the methods of assigning a winner to a debate, using the Intelligence Squared dataset. I will briefly outline here how I assemble that dataset, for trasparency.

Compiling the Pages

The results from each Intelligence Squared debate are posted online in pages such as this, including video of the debate, a description of the major positions of each side, the qualifications of the debaters, and most importantly, the results of the audience polling. Unfortunately, there doesn’t seem to be a central hub page that neatly lists all the URLs. However, the desired dataset isn’t huge (about 90 total debates), so there’s no substitute for the occasional work simply manually trawling through the website, and recording the date, name, and URL of each debate in question.

Scraping the numbers

Once we have a full list of all the relevant URLs, luckily, the results themselves are generally presented in a consistent format. Thus, some simple work with regular expressions gathers the data that we need. One such example.


In case anyone wants to borrow this sort of simple scrape for their own projects, you can find the code here, although the approach is extremely messy. Luckily, with R it’s more important to be fast than it is to be clean, and you can use very awkward code as long as you find it readable and clear. Regular expressions like this can grab the relevant numbers that we need, and we store it in one large data frame.

gsub(".*\"f\":\\{\"f\":(\\d+\\.*\\d*),\"a\":(\\d+\\.*\\d*),\"u\":(\\d+\\.*\\d*).*", "\\1 \\2 \\3", post)

where we identify the numbers that we’re interested in. With the numbers compiled into one large data frame (which can be viewed in raw form here, for those interested in examining the data themselves). In total, there are 88 debates stretching back to 2012 which have all the information needed. The program itself stretches back further, but they only began tracking the subgroup movements more recently.

In the next post, we can actually dive into the data itself.

Written on July 26, 2018