## Hall of Fame ballots: some quick and dirty clustering

Since all the public Hall of Fame ballots are now available online in machine-readable form, thanks to Ryan Thibodeaux, I thought I’d mess around with the built-in clustering functions in sklearn and see what I could see.

The natural metric on ballots is Hamming distance, I guess.   I first tried the AgglomerativeClustering package.  I didn’t tell it what metric to use on the ballots, but I’m assuming it’s using Hamming distance, aka Euclidean in this case.  I asked AgglomerativeClustering to break the Hall of Fame voters into 2 clusters.  Guess what I found?  There’s a cluster of 159 voters who almost entirely voted for Barry Bonds and Roger Clemens, and a cluster of 83 who universally didn’t.  You won’t be surprised to hear that those who voted for Bonds and Clemens were also a lot more likely to vote for Manny Ramirez, Sammy Sosa, and Curt Schilling than the other cluster.

Which candidate was most notably unpopular among the Bonds-forgivers?  That would be Omar Vizquel.  He was on 53% of the steroid-rejector ballots!  Only 24% of the Bonds cluster thought Omar deserved Cooperstown.

Then I tried asking AgglomerativeClustering for four clusters.  The 83 anti-steroids folks all stayed together.  But the bigger group now split into Cluster 1 (61 ballots), Cluster 2 (46), and Cluster 3 (52).  Cluster 1 is the Joe Posnanski cluster.  Cluster 2 is the Nick Cafardo cluster.  Cluster 3 is the Tim Kurkjian cluster.

What differentiates these?  Cluster 1 is basically “people who voted for Larry Walker.”  The difference between Cluster 2 and Cluster 3 is more complicated.  The Cluster 2 ballots were much more likely to have:

Manny Ramirez, Sammy Sosa

and much less likely to have

Mike Mussina, Edgar Martinez, Curt Schilling

I’m not sure how to read this!  My best guess is that there’s no animus towards pitchers and DHs here; if you’re voting for Bonds and Clemens and Sosa and Ramirez and the guys who basically everybody voted for, you just don’t have that many votes left.  So I’d call Cluster 2 the “2000s-slugger loving cluster” and Cluster 3 everybody else.

Maybe I should say how you actually do this?  OK, first of all you munge the spreadsheet until you have a 0-1 matrix X where the rows are voters and the columns are baseball players.  Then your code looks like:

import sklearn

model = AgglomerativeClustering(n_clusters=4)

modplay.labels_

which outputs

array([1, 0, 3, 1, 1, 1, 0, 0, 0, 0, 2, 1, 2, 1, 3, 0, 0, 0, 2, 1, 0, 3, 2,
1, 2, 1, 1, 3, 1, 3, 3, 0, 2, 2, 0, 1, 1, 1, 0, 2, 0, 0, 1, 2, 1, 3,
2, 2, 1, 3, 0, 2, 0, 3, 1, 0, 0, 2, 0, 2, 1, 2, 1, 0, 0, 0, 1, 0, 2,
0, 1, 1, 2, 0, 1, 3, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 1, 0,
0, 0, 3, 1, 1, 0, 1, 0, 3, 1, 3, 3, 2, 0, 2, 1, 0, 2, 2, 3, 2, 3, 1,
3, 0, 3, 1, 0, 2, 1, 0, 0, 0, 1, 3, 1, 1, 3, 2, 3, 3, 2, 2, 0, 3, 3,
1, 0, 0, 2, 2, 3, 1, 3, 1, 2, 0, 1, 3, 1, 0, 0, 2, 3, 0, 2, 1, 0, 2,
1, 3, 3, 0, 1, 3, 1, 1, 0, 0, 2, 0, 1, 2, 0, 2, 1, 0, 0, 3, 3, 1, 1,
2, 3, 2, 0, 2, 0, 0, 1, 2, 1, 0, 3, 1, 3, 0, 3, 3, 3, 0, 3, 3, 3, 0,
2, 0, 3, 3, 0, 1, 0, 1, 2, 3, 2, 2, 0, 0, 0, 1, 3, 3, 1, 0, 0, 1, 3,
0, 2, 3, 1, 0, 0, 0, 0, 0, 3, 3, 3])

i.e. a partition of the voters into four groups.

(Agglomerative clustering naturally generates a hierarchical clustering, i.e. a tree with the HoF voters on the leaves; there must be some way to get sklearn to output this directly, but I don’t know it!

Of course, if you have a 0-1 matrix, you don’t have to cluster the rows — you can cluster the columns! This time, just for kicks, I used the hierarchical clustering package in scipy.  I think this one is just agglomerating too.  But it has a nice output package!  Here, Y is the transpose of X above, a 0-1 matrix telling us which players were on which ballots.  Then:

>> import scipy
>>> Dend = scipy.cluster.hierarchy.dendrogram(Z,labels=(a list of player names))
>>> plt.xticks(ha=’right’)
>>> plt.show()

gives

Not bad! You can see that Bonds and Clemens form their own little cluster in red.  There is not that much structure here — maybe worth noting that this method may be dominated by the variable “number of votes received.”  Still, the two stories we’ve told here do seem to have some coarse features in common:  Bonds/Clemens are a cluster, and Larry Walker voting is kind of its own variable independent of the rest of the ballot.

OK, this picture was nice so I couldn’t resist doing one for the voters:

Pretty hard to read!  I think that black cluster on the end is probably the no-Bonds-no-Clemens gang.  But what I want you to notice is that there’s one outlying node all the way over to the left, which the clustering algorithm identifies as the weirdest ballot made public.  It’s Sadiel LeBron, who voted for Clemens, Sosa, and Ramirez, but not Bonds.  And also he voted for Andruw Jones and Omar Vizquel!  Yeah, that’s a weird ballot.

I’m sure this isn’t the right way to visualize a 0-1 matrix.  Here’s what I’d do if I felt like spending a little more time:  write something up to look for a positive definite rank-2 matrix A such that

$A_{ij} > A_{ik}$

whenever voter i voted for player j but not player k.  That models the idea of each player being a point in R^2 and each voter being a point in R^2 and then voters vote for every player whose dot product with them is large enough.  My intuition is that this would be a better way of plotting ballots in a plane than just doing PCA on the 0-1 matrix.  But maybe it would come out roughly the same, who knows?

Presumably there are known best practices for clustering subsets of a fixed set (or, more generally, finding good embeddings into visualizable metric spaces like the plane.)  Tell me about them!

## Stephen Burt interviewed in Publishers Weekly

Steve Burt interviewed in the PW series, “The Art of the Review:”

Classes can reveal the properties of their members more fully (to understand the differences between calcium and magnesium, for example, you should know why they are both alkaline earths) but classes can also obscure them (the Pagans and the Germs were both American punk rock bands, but to me their songs sound nothing alike). Classes should be used with care everywhere; there’s probably no way to fully avoid them.

But you aren’t asking about classes in general; you are asking why poetry critics and reviewers seem to classify and classify, whereas fiction reviews try to avoid it. Perhaps it’s because few books of poetry can count on a buzz produced by their authors, or by a publicity campaign, or by grassroots, independent-bookstore-sales-driven chatter, all of which can justify (to assigning editors, to casual readers) space and time for extensive reviews of single volumes. Poetry reviewers, poetry critics, even very academic ones, need other pegs on which to hang their claims.

Novelists, necessarily, work in sustained solitude, when they are working (however gregarious they become otherwise), whereas poets can work in solitude in short bursts and then come together to discuss—and make programs and slogans about—what they made.

Poets also seem to attach themselves and their work more often either to their peer group, or to their teachers; some poets can tell you where and with whom they studied almost in the way that classical musicians can tell you about their teachers, and their teachers’ teachers.  If novelists do that, I haven’t seen it.

For more, buy Steve’s book, Close Calls With Nonsense.

## Active clustering

Just a note — the paper “Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities,” by Eriksson, Dasarathy, Singh, and Nowak, describing the algorithm which I (but not they) refer to as “compressed clustering,” is now on the arXiv.

## Narcissistic personality disorder, the NRC rankings, and finite metric spaces in Slate

I have a piece in Slate today about the classification of personality disorders in the new DSM, and the NRC graduate school rankings.  OK, they don’t really let me mention finite metric spaces in Slate.  But that’s what’s going on behind the lines, and it’s a problem I’ve been wrestling with.  Let’s say you have a finite metric space M; that is, a finite set of points with assigned distances between them.  Now there’s a whole world of algorithms (multidimensional scaling and its many cousins) to embed M in a Euclidean space of some reasonably small dimension without messing up the metric too much.  And there’s a whole world of heirarchical clustering algorithms that embed M in the set of leaves of a tree.

But I don’t really know a principled way to decide which one of these things to do.

Stuff there wasn’t room for in the piece — I should have mentioned Ian Hacking’s book Mad Travelers, which gives a very rich humanistic account of the process by which categories of mental illness are generated.  And when I talked about the difficulty of crushing a finite metric down to one dimension, I should have linked to Cosma Shalizi’s “g, a statistical myth”

## MALBEC: Jerry Zhu, Michael Coen, how to say snake in gibbon

Jerry Zhu will give the  last MALBEC seminar of the year tomorrow (Wednesday) afternoon, at 4pm, in Van Vleck B102:

Jerry Zhu (UW, computer sciences)

HAMLET (Human, Animal, and Machine Learning: Experiment and Theory)

Machine learning studies the principles governing all learning systems. Human beings and animals are learning systems too, and can be explored using the same mathematical tools.  This approach has been fruitful in the last few decades with standard tools such as reinforcement learning, artificial neural networks, and non-parametric Bayesian statistics.  We bring the approach one step further with some latest tools in machine learning, and uncover new quantitative findings.  In this talk, I will present three examples: (1) Human semi-supervised learning. Consider a child learning animal names.  Dad occasionally points to an animal and says “Dog!” (labeled data). But mostly the child observes the world by herself without explicit feedback (unlabeled data).  We show that humans learn from both labeled and unlabeled data, and that a simple Gaussian Mixture Model trained using the EM algorithm provides a nice fit to human behaviors.  (2) Human active learning.  The child may ask “What’s that?”, i.e. actively selecting items to query the target labels.  We show that humans are able to perform good active learning, achieving fast exponential error convergence as predicted by machine learning theory.  In contrast, when passively given i.i.d. training data humans learn much slower (polynomial convergence), also predicted by learning theory.  (3) Monkey online learning.  Rhesus monkeys can learn a “target concept”, in the form of a certain shape or color.  What if the target concept keeps changing?  Adversarial online learning model provides a polynomial mistake bound.  Although monkeys perform worse than theory, anecdotal evidence suggests that they follow the concepts better than some graduate students. Finally, I will speculate on a few lessons learned in order to create better machine learning algorithms.

In the third MALBEC lecture, Michael Coen talked about his work on clustering; he asked me afterwards whether I thought the talk was “mathy enough” for the audience, which was funny, because I thought it was 100% math from start to finish!  Here’s a cartoon of the main idea.  When presented with a giant set of data points, one of the first things you might want to do is cluster it:  that is, partition the points into some disjoint collection of subsets, each one of which consists of points which resemble their clustermates more than they do the points in the other clusters.  You might, for instance, want to identify clusters among U.S. legislators, or images, or gene expression patterns. As is so often the case, Cosma Shalizi supplies a good, succinct introduction to the topic from a statistician’s perspective.

How do you know when your clustering algorithm is good?  Sometimes there’s a natural way to evaluate; if your algorithm for clustering legislators reliably separates Democrats from Republicans, you’re probably doing something right.  But with other data, you might not have any pre-existing classification that helps you gauge the reasonableness of your clustering.  Let’s say, for instance, you have lots of short recordings of a gibbon; maybe you think that rather than being scattered structurelessly around the space of 1-second sound clips, they fall into a small finite set of clusters, which you would naturally be tempted to call phonemes. You can run a clustering algorithm on the clips, and you’ll get an answer.  But is it meaningful?  It’s hard to tell without a population of clips which are classified in advance.  Unfortunately, there’s no corpus of written gibbon texts which you can ask gibbons to read aloud.  So you have to do something else.

The trick, as Coen observes, is to replace the difficult and not-very-well-defined question “Is clustering X good?” with the much more tractable question “Are clusterings X and Y similar?”  Coen presented a really nice, principled way of answering this latter question, which allows him to do something like the following:  given your set of audio clips, apply your clustering algorithm separately  to two random samples of 50% of the data points.  These two samples will overlap in around 25% of the data.  Now you can use Coen’s measure to compare the two clusterings induced on this 25% subsample.  If you do this a lot, and you always get two clusterings which are almost exactly the same in Coen’s sense, that’s a good sign that your clustering algorithm is actually capturing real features of the data.

So it turns out that gibbon utterances really do seem to be organized into phonemes.  (A cursory google search suggests that this contradicts conventional wisdom about primate vocalization — can any primatologists weigh in?)  Once you have this finding, and the ability to classify the sound clips, you can do some remarkable things:  for instance, you can look at what combinations of phonemes gibbons emit when a snake comes by.  It turns out that the vocalization elicited by a snake isn’t a consistent combination of phonemes, as it would be in a human language.  Rather, you can write down a finite state automaton, any one of whose outputs seems to be a legitimate gibbon word for “snake”!

Coen had a picture of the automaton on a slide, which is truly cool, but which he is keeping to himself until the paper’s published.  I promise to tell you exactly how to say “snake” in gibbon in a later post.