## Hall of Fame ballots: some quick and dirty clustering

Since all the public Hall of Fame ballots are now available online in machine-readable form, thanks to Ryan Thibodeaux, I thought I’d mess around with the built-in clustering functions in sklearn and see what I could see.

The natural metric on ballots is Hamming distance, I guess.   I first tried the AgglomerativeClustering package.  I didn’t tell it what metric to use on the ballots, but I’m assuming it’s using Hamming distance, aka Euclidean in this case.  I asked AgglomerativeClustering to break the Hall of Fame voters into 2 clusters.  Guess what I found?  There’s a cluster of 159 voters who almost entirely voted for Barry Bonds and Roger Clemens, and a cluster of 83 who universally didn’t.  You won’t be surprised to hear that those who voted for Bonds and Clemens were also a lot more likely to vote for Manny Ramirez, Sammy Sosa, and Curt Schilling than the other cluster.

Which candidate was most notably unpopular among the Bonds-forgivers?  That would be Omar Vizquel.  He was on 53% of the steroid-rejector ballots!  Only 24% of the Bonds cluster thought Omar deserved Cooperstown.

Then I tried asking AgglomerativeClustering for four clusters.  The 83 anti-steroids folks all stayed together.  But the bigger group now split into Cluster 1 (61 ballots), Cluster 2 (46), and Cluster 3 (52).  Cluster 1 is the Joe Posnanski cluster.  Cluster 2 is the Nick Cafardo cluster.  Cluster 3 is the Tim Kurkjian cluster.

What differentiates these?  Cluster 1 is basically “people who voted for Larry Walker.”  The difference between Cluster 2 and Cluster 3 is more complicated.  The Cluster 2 ballots were much more likely to have:

Manny Ramirez, Sammy Sosa

and much less likely to have

Mike Mussina, Edgar Martinez, Curt Schilling

I’m not sure how to read this!  My best guess is that there’s no animus towards pitchers and DHs here; if you’re voting for Bonds and Clemens and Sosa and Ramirez and the guys who basically everybody voted for, you just don’t have that many votes left.  So I’d call Cluster 2 the “2000s-slugger loving cluster” and Cluster 3 everybody else.

Maybe I should say how you actually do this?  OK, first of all you munge the spreadsheet until you have a 0-1 matrix X where the rows are voters and the columns are baseball players.  Then your code looks like:

import sklearn

model = AgglomerativeClustering(n_clusters=4)

modplay.labels_

which outputs

array([1, 0, 3, 1, 1, 1, 0, 0, 0, 0, 2, 1, 2, 1, 3, 0, 0, 0, 2, 1, 0, 3, 2,
1, 2, 1, 1, 3, 1, 3, 3, 0, 2, 2, 0, 1, 1, 1, 0, 2, 0, 0, 1, 2, 1, 3,
2, 2, 1, 3, 0, 2, 0, 3, 1, 0, 0, 2, 0, 2, 1, 2, 1, 0, 0, 0, 1, 0, 2,
0, 1, 1, 2, 0, 1, 3, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 1, 0,
0, 0, 3, 1, 1, 0, 1, 0, 3, 1, 3, 3, 2, 0, 2, 1, 0, 2, 2, 3, 2, 3, 1,
3, 0, 3, 1, 0, 2, 1, 0, 0, 0, 1, 3, 1, 1, 3, 2, 3, 3, 2, 2, 0, 3, 3,
1, 0, 0, 2, 2, 3, 1, 3, 1, 2, 0, 1, 3, 1, 0, 0, 2, 3, 0, 2, 1, 0, 2,
1, 3, 3, 0, 1, 3, 1, 1, 0, 0, 2, 0, 1, 2, 0, 2, 1, 0, 0, 3, 3, 1, 1,
2, 3, 2, 0, 2, 0, 0, 1, 2, 1, 0, 3, 1, 3, 0, 3, 3, 3, 0, 3, 3, 3, 0,
2, 0, 3, 3, 0, 1, 0, 1, 2, 3, 2, 2, 0, 0, 0, 1, 3, 3, 1, 0, 0, 1, 3,
0, 2, 3, 1, 0, 0, 0, 0, 0, 3, 3, 3])

i.e. a partition of the voters into four groups.

(Agglomerative clustering naturally generates a hierarchical clustering, i.e. a tree with the HoF voters on the leaves; there must be some way to get sklearn to output this directly, but I don’t know it!

Of course, if you have a 0-1 matrix, you don’t have to cluster the rows — you can cluster the columns! This time, just for kicks, I used the hierarchical clustering package in scipy.  I think this one is just agglomerating too.  But it has a nice output package!  Here, Y is the transpose of X above, a 0-1 matrix telling us which players were on which ballots.  Then:

>> import scipy
>>> Dend = scipy.cluster.hierarchy.dendrogram(Z,labels=(a list of player names))
>>> plt.xticks(ha=’right’)
>>> plt.show()

gives

Not bad! You can see that Bonds and Clemens form their own little cluster in red.  There is not that much structure here — maybe worth noting that this method may be dominated by the variable “number of votes received.”  Still, the two stories we’ve told here do seem to have some coarse features in common:  Bonds/Clemens are a cluster, and Larry Walker voting is kind of its own variable independent of the rest of the ballot.

OK, this picture was nice so I couldn’t resist doing one for the voters:

Pretty hard to read!  I think that black cluster on the end is probably the no-Bonds-no-Clemens gang.  But what I want you to notice is that there’s one outlying node all the way over to the left, which the clustering algorithm identifies as the weirdest ballot made public.  It’s Sadiel LeBron, who voted for Clemens, Sosa, and Ramirez, but not Bonds.  And also he voted for Andruw Jones and Omar Vizquel!  Yeah, that’s a weird ballot.

I’m sure this isn’t the right way to visualize a 0-1 matrix.  Here’s what I’d do if I felt like spending a little more time:  write something up to look for a positive definite rank-2 matrix A such that

$A_{ij} > A_{ik}$

whenever voter i voted for player j but not player k.  That models the idea of each player being a point in R^2 and each voter being a point in R^2 and then voters vote for every player whose dot product with them is large enough.  My intuition is that this would be a better way of plotting ballots in a plane than just doing PCA on the 0-1 matrix.  But maybe it would come out roughly the same, who knows?

Presumably there are known best practices for clustering subsets of a fixed set (or, more generally, finding good embeddings into visualizable metric spaces like the plane.)  Tell me about them!

## The steroid era is over but nobody told the hitters

Via the invaluable High Heat Stats, a chart of home runs as a proportion of plate appearances over time.  Home run hitting increased sharply in the mid-1990s, and has stayed elevated ever since.  If you think that increase marks the beginning of “The Steroid Era,” do you think The Steroid Era is still going on?  Because home runs are still flying out of the park at an elevated rate.

There’s good stuff on this at Fangraphs, too.

Tagged ,