Hall of Fame ballots: some quick and dirty clustering

Since all the public Hall of Fame ballots are now available online in machine-readable form, thanks to Ryan Thibodeaux, I thought I’d mess around with the built-in clustering functions in sklearn and see what I could see.

The natural metric on ballots is Hamming distance, I guess.   I first tried the AgglomerativeClustering package.  I didn’t tell it what metric to use on the ballots, but I’m assuming it’s using Hamming distance, aka Euclidean in this case.  I asked AgglomerativeClustering to break the Hall of Fame voters into 2 clusters.  Guess what I found?  There’s a cluster of 159 voters who almost entirely voted for Barry Bonds and Roger Clemens, and a cluster of 83 who universally didn’t.  You won’t be surprised to hear that those who voted for Bonds and Clemens were also a lot more likely to vote for Manny Ramirez, Sammy Sosa, and Curt Schilling than the other cluster.

Which candidate was most notably unpopular among the Bonds-forgivers?  That would be Omar Vizquel.  He was on 53% of the steroid-rejector ballots!  Only 24% of the Bonds cluster thought Omar deserved Cooperstown.

Then I tried asking AgglomerativeClustering for four clusters.  The 83 anti-steroids folks all stayed together.  But the bigger group now split into Cluster 1 (61 ballots), Cluster 2 (46), and Cluster 3 (52).  Cluster 1 is the Joe Posnanski cluster.  Cluster 2 is the Nick Cafardo cluster.  Cluster 3 is the Tim Kurkjian cluster.

What differentiates these?  Cluster 1 is basically “people who voted for Larry Walker.”  The difference between Cluster 2 and Cluster 3 is more complicated.  The Cluster 2 ballots were much more likely to have:

Manny Ramirez, Sammy Sosa

and much less likely to have

Mike Mussina, Edgar Martinez, Curt Schilling

I’m not sure how to read this!  My best guess is that there’s no animus towards pitchers and DHs here; if you’re voting for Bonds and Clemens and Sosa and Ramirez and the guys who basically everybody voted for, you just don’t have that many votes left.  So I’d call Cluster 2 the “2000s-slugger loving cluster” and Cluster 3 everybody else.

Maybe I should say how you actually do this?  OK, first of all you munge the spreadsheet until you have a 0-1 matrix X where the rows are voters and the columns are baseball players.  Then your code looks like:

import sklearn

model = AgglomerativeClustering(n_clusters=4)

modplay.labels_

which outputs

array([1, 0, 3, 1, 1, 1, 0, 0, 0, 0, 2, 1, 2, 1, 3, 0, 0, 0, 2, 1, 0, 3, 2,
1, 2, 1, 1, 3, 1, 3, 3, 0, 2, 2, 0, 1, 1, 1, 0, 2, 0, 0, 1, 2, 1, 3,
2, 2, 1, 3, 0, 2, 0, 3, 1, 0, 0, 2, 0, 2, 1, 2, 1, 0, 0, 0, 1, 0, 2,
0, 1, 1, 2, 0, 1, 3, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 1, 0,
0, 0, 3, 1, 1, 0, 1, 0, 3, 1, 3, 3, 2, 0, 2, 1, 0, 2, 2, 3, 2, 3, 1,
3, 0, 3, 1, 0, 2, 1, 0, 0, 0, 1, 3, 1, 1, 3, 2, 3, 3, 2, 2, 0, 3, 3,
1, 0, 0, 2, 2, 3, 1, 3, 1, 2, 0, 1, 3, 1, 0, 0, 2, 3, 0, 2, 1, 0, 2,
1, 3, 3, 0, 1, 3, 1, 1, 0, 0, 2, 0, 1, 2, 0, 2, 1, 0, 0, 3, 3, 1, 1,
2, 3, 2, 0, 2, 0, 0, 1, 2, 1, 0, 3, 1, 3, 0, 3, 3, 3, 0, 3, 3, 3, 0,
2, 0, 3, 3, 0, 1, 0, 1, 2, 3, 2, 2, 0, 0, 0, 1, 3, 3, 1, 0, 0, 1, 3,
0, 2, 3, 1, 0, 0, 0, 0, 0, 3, 3, 3])

i.e. a partition of the voters into four groups.

(Agglomerative clustering naturally generates a hierarchical clustering, i.e. a tree with the HoF voters on the leaves; there must be some way to get sklearn to output this directly, but I don’t know it!

Of course, if you have a 0-1 matrix, you don’t have to cluster the rows — you can cluster the columns! This time, just for kicks, I used the hierarchical clustering package in scipy.  I think this one is just agglomerating too.  But it has a nice output package!  Here, Y is the transpose of X above, a 0-1 matrix telling us which players were on which ballots.  Then:

>> import scipy
>>> Z = scipy.cluster.hierarchy.linkage(Y)
>>> Dend = scipy.cluster.hierarchy.dendrogram(Z,labels=(a list of player names))
>>> plt.xticks(ha=’right’)
>>> plt.show()

gives

Not bad! You can see that Bonds and Clemens form their own little cluster in red.  There is not that much structure here — maybe worth noting that this method may be dominated by the variable “number of votes received.”  Still, the two stories we’ve told here do seem to have some coarse features in common:  Bonds/Clemens are a cluster, and Larry Walker voting is kind of its own variable independent of the rest of the ballot.

OK, this picture was nice so I couldn’t resist doing one for the voters:

Pretty hard to read!  I think that black cluster on the end is probably the no-Bonds-no-Clemens gang.  But what I want you to notice is that there’s one outlying node all the way over to the left, which the clustering algorithm identifies as the weirdest ballot made public.  It’s Sadiel LeBron, who voted for Clemens, Sosa, and Ramirez, but not Bonds.  And also he voted for Andruw Jones and Omar Vizquel!  Yeah, that’s a weird ballot.

I’m sure this isn’t the right way to visualize a 0-1 matrix.  Here’s what I’d do if I felt like spending a little more time:  write something up to look for a positive definite rank-2 matrix A such that

A_{ij} > A_{ik}

whenever voter i voted for player j but not player k.  That models the idea of each player being a point in R^2 and each voter being a point in R^2 and then voters vote for every player whose dot product with them is large enough.  My intuition is that this would be a better way of plotting ballots in a plane than just doing PCA on the 0-1 matrix.  But maybe it would come out roughly the same, who knows?

Presumably there are known best practices for clustering subsets of a fixed set (or, more generally, finding good embeddings into visualizable metric spaces like the plane.)  Tell me about them!

 

Tagged , , , ,

5 thoughts on “Hall of Fame ballots: some quick and dirty clustering

  1. Tom Church says:

    “Cluster 1 is the Joe Posnanski cluster. Cluster 2 is the Nick Cafardo cluster. Cluster 3 is the Tim Kurkjian cluster.” is a bit confusing if one doesn’t know that Posnanski, Cafardo, and Kurkjian are journalists rather than players, since the next paragraph then doesn’t make sense.

  2. JSE says:

    Sorry — usually nobody reads my baseball posts so I figured the die-hards who got that far would know those names…!

  3. Even though this is a 0-1 matrix, another thing to try is PCA: compute the correlation matrix for the players C=\frac{1}{m}A^t.A (perhaps subtract the mean first) where m is the number of voters.

    I guess the first few eigenvectors would show the same patterns you found.

  4. JSE says:

    I actually did that but then I accidentally closed the window and lost it and felt too lazy to do it all again so I could post the pic. But yep, projecting on the two biggest eigenvalues you see the two big clusters (presumably pro-Bonds/Clemens and anti) and you do see one dot that’s kind of isolated which I’m guessing was LeBron.

  5. mcsc2016 says:

    I guess that there’s more information on how to cluster the votes given: (a) Journalists’ votes on previous years (b) Players’ votes on previous years (c) Years in which player is in the ballot, given how some will adapt their votes to prevailing public sentiment (e.g. Tim Raines)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: