So, we have these historical match archives, and we've learned how to get started inspecting the data they contain. But inspecting individual matches only takes us so far -- we really want to analyze many matches, to discover unexpected insights, or confirm hypotheses. How can we do that?

Let's look at an example of an analysis that we might run. This example will use the jq JSON processor command line tool. You can do similar things using the built-in JSON library in Python. Similar tools exist for many other languages.

Say we want to know which opening moves are most popular in Connect Four. Let's also consider only matches signed by the hosting system, so we know where they came from. This can be done with the following command:

$ cat matchesFrom2012 | jq 'if .data.gameMetaURL == "" and .data.matchHostPK != null then .data.moves[0][0] else null end' | grep -v null | sort | uniq -c

Before we run this, how does it work? First, it lists all of the matches from 2012. For every match where Connect Four was played, and a signature was recorded, we output the first move for the first player. For every other match, we output null. Then we drop all of the nulls. Then we sort the output, and print out a count for each unique entry: a histogram, essentially.

Okay, let's run that command:

$ cat matchesFrom2012 | jq 'if .data.gameMetaURL == "" and .data.matchHostPK != null then .data.moves[0][0] else null end' | grep -v null | sort | uniq -c
    340 "( drop 1 )"
    356 "( drop 2 )"
    357 "( drop 3 )"
   1672 "( drop 4 )"
   1644 "( drop 5 )"
    333 "( drop 6 )"
    311 "( drop 7 )"
    321 "( drop 8 )"

Success! And best of all, these results make sense -- the two center columns are the strongest opening plays in Connect Four, and this confirms that they're chosen most frequently. Also, since the game is symmetric, it makes sense that there's a symmetric distribution of opening moves.

Let's compare that result with the distribution of opening moves for Connect Four from 2011.

$ cat matchesFrom2011 | jq 'if .data.gameMetaURL == "" and .data.matchHostPK != null then .data.moves[0][0] else null end' | grep -v null | sort | uniq -c
   1447 "( drop 1 )"
   1413 "( drop 2 )"
   1446 "( drop 3 )"
   1592 "( drop 4 )"
   1506 "( drop 5 )"
   1400 "( drop 6 )"
   1478 "( drop 7 )"
   1470 "( drop 8 )"

That's different! What happened?

Well, to begin with, there are similarities between 2011 and 2012. The distributions of observed moves are roughly symmetric in both. And both have a peak in the center, for the moves we know to be stronger. But in 2012 the peak is much more pronounced than in 2011. It's almost as though the players in 2012 were noticeably better than those in 2011...

... and indeed, that's true. For the majority of 2011, there were several robot players on Tiltyard playing randomly, doing load testing of the hosting, publishing, archiving, etc. These random players were shut off once real players began to play on Tiltyard consistently. That change explains the sudden improvement in the choice of opening moves in Connect Four between 2011 and 2012. This can be confirmed by looking at the data for 2013, which matches the data for 2012.

Thus concludes our first discovery based on the historical match archives!