Grokking The Election
I started watching the VP debates last night, and, after about 15 minutes of enjoying the sniping back and forth I cam two two inescapable conclusions:
- I want to see More of Sarah Palin
- Hoo, I'm Bored!
In my line of work I sometimes have to take immense blocks of data, distill them down to their essential components, and make decisions on their contents. I also enjoy visualizing information to make things easier to get an overview of their contents.
So, I grabbed a transcript of the Obama / McCain debate (since the VP debate was still going on) and do a little math on it. Here's my results.
Raw Data
the first thing I did was pull the full text of the debate from here. This became my raw information pool for the task at hand.
Next, I grouped all the text together as per speaker (there's three, all helpfully labeled).
The next step, I grepped the name of each speaker and placed that into an individual file. Since there's three (Lehrer, Obama, and Mccain), I ended up with three files to work with.
Readability
The first thing I ended up doing was dropping the files on my web server and sending the text to Juicy Studio's Readability Test, just for giggles to see how it would look to a machine.
Lehrer

Obama

McCain

The main bit to pay attention to is the “Gunning-Fog Index”. Here's what the fine folks over at Juicy say about it:
The result is your Gunning-Fog index, which is a rough measure of how many years of schooling it would take someone to understand the content. The lower the number, the more understandable the content will be to your visitors. Results over seventeen are reported as seventeen, where seventeen is considered post-graduate level.
So, Overall it's a wash. Lehrer's scores are a bit lower than the rest since all he was around for was to ask questions and break them up whenever someone was on the ropes.
Word Clouds
I've always enjoyed word clouds and how they represent information. In short, word clouds take a volume of words and perform a frequency analysis on them. Once the analysis is completed, the words are ranked by the number of times they appear in the text.
The larger the word, the more times it appears in the target text and conversely, the smaller the word the less that appears in the target text.
Instead of writing my own parser, I run across Tagcrowd, who can parse raw text, URL, or even an uploaded file. This was extremely handy in parsing all of that text. The output was quite attractive as well.
For each of the candidates as well as the full text of the debate, I performed with these settings:

Word Cloud Results
Conclusion
I'll hold off in giving political insight on the word clouds. Just check them out, and let your eyes hit all the key words you find interesting. You may come to conclusions of your own which are different than mine.
Anyway, hope you enjoyed this bit. I'll probably start hunting down the Biden / Palin transcript and try the same bit on it as well, just to see if there's anything out of the ordinary with the baseline that I'm releasing today.
Now, if you want to try stuff on your own, here's download of the text that I was working with. Should help you out a bit in finding text online and working on that.
Enjoy,
tom