Lexical analysis of debates finds Obama and McCain startlingly similar

My colleague, Dr. Sidicious Bonesparkle, had a bit of sport last week with a Global Language Monitor analysis showing that Sarah Palin spoke at a higher grade level in the VP “debate” than did Joe Biden. As it turns out, the GLM isn’t alone in examining the language used by the presidential and vice presidential candidates.

A new report from Martin Krzywinski of the Genome Sciences Centre in Vancouver examines … well, let’s let him explain it:

The analysis presented here explores word usage in the 2008 US presidential and vice-presidential debates. In other words, I try to get at

* – who said what?
* – what was unique about each candidate’s speech?
* – what was word and concept frequency for each candidate?
* – did any part of a candidate’s speech pattern stand out?


The transcript for each debate is parsed: for each line in the transcript the speaker is identified, stop words (words such as “do”, “and” and “it”) are removed and the remaining non-stop words are assigned a part of speech (this process is called tagging).

The parsed and tagged transcript is then analyzed to characterize the following

* – word frequency distribution for each candidate
* – sentence size and proportion of unique words
* – words exclusive to a candidate and words used by both candidates
* – frequency of concepts, as defined by part of speech pairings (e.g. noun/verb)

I’m still trying to sort out what the results tell us. Less, I think, than I’d hoped, but at least one interesting bit does jump out at me.

The central theme that quickly become evident when I was doing this analysis was that the candidates’s speech conformed to nearly identical lexical patterns. For example, vocabular size for Obama and McCain (number of unique non-stop words used) is idendical at 1,243. Their non-stop word fraction is also nearly identical at 43.4% and 44.3% for Obama and McCain, respectively. Likewise, the difference in their unique word fraction and average word frequency is only 4.3% and 4.8%, respectively.

The reason for such conformity is anyone’ guess. It could be a product of political selection, or premedidated wordsmithing aimed at reaching a segment of the population whose comprehension has been profiled in detail. I suspect that while both play a role, all candidates spend significant time rehersing the exact word combinations for answers to likely questions, with the word combinations chosen for simplicity and effect.

If we look at some of the word charts generated by the cool software he uses it’s clear that both Obama and McCain are pushing particular buttons, but the structural lexical similarity is fascinating. Do the respective camps and their “debate” coaches have the audience dialed in that precisely?

Or maybe it’s just a fascinating coincidence?

Have a look for yourself…

4 replies »

  1. Both sides are puppet masters, and have their respective constituencies dancing on a string, whether they believe it or not.


  2. Eh. It made me uncomfortable that they omitted words like “and”, since that’s a basic logical operator, but I can see what sort of insights they could glean from this.

    The similarity probably has something to do with the two-party duopoly in America.