## Hall of Fame ballots: some quick and dirty clustering

Since all the public Hall of Fame ballots are now available online in machine-readable form, thanks to Ryan Thibodeaux, I thought I’d mess around with the built-in clustering functions in sklearn and see what I could see.

The natural metric on ballots is Hamming distance, I guess.   I first tried the AgglomerativeClustering package.  I didn’t tell it what metric to use on the ballots, but I’m assuming it’s using Hamming distance, aka Euclidean in this case.  I asked AgglomerativeClustering to break the Hall of Fame voters into 2 clusters.  Guess what I found?  There’s a cluster of 159 voters who almost entirely voted for Barry Bonds and Roger Clemens, and a cluster of 83 who universally didn’t.  You won’t be surprised to hear that those who voted for Bonds and Clemens were also a lot more likely to vote for Manny Ramirez, Sammy Sosa, and Curt Schilling than the other cluster.

Which candidate was most notably unpopular among the Bonds-forgivers?  That would be Omar Vizquel.  He was on 53% of the steroid-rejector ballots!  Only 24% of the Bonds cluster thought Omar deserved Cooperstown.

Then I tried asking AgglomerativeClustering for four clusters.  The 83 anti-steroids folks all stayed together.  But the bigger group now split into Cluster 1 (61 ballots), Cluster 2 (46), and Cluster 3 (52).  Cluster 1 is the Joe Posnanski cluster.  Cluster 2 is the Nick Cafardo cluster.  Cluster 3 is the Tim Kurkjian cluster.

What differentiates these?  Cluster 1 is basically “people who voted for Larry Walker.”  The difference between Cluster 2 and Cluster 3 is more complicated.  The Cluster 2 ballots were much more likely to have:

Manny Ramirez, Sammy Sosa

and much less likely to have

Mike Mussina, Edgar Martinez, Curt Schilling

I’m not sure how to read this!  My best guess is that there’s no animus towards pitchers and DHs here; if you’re voting for Bonds and Clemens and Sosa and Ramirez and the guys who basically everybody voted for, you just don’t have that many votes left.  So I’d call Cluster 2 the “2000s-slugger loving cluster” and Cluster 3 everybody else.

Maybe I should say how you actually do this?  OK, first of all you munge the spreadsheet until you have a 0-1 matrix X where the rows are voters and the columns are baseball players.  Then your code looks like:

import sklearn

model = AgglomerativeClustering(n_clusters=4)

modplay.labels_

which outputs

array([1, 0, 3, 1, 1, 1, 0, 0, 0, 0, 2, 1, 2, 1, 3, 0, 0, 0, 2, 1, 0, 3, 2,
1, 2, 1, 1, 3, 1, 3, 3, 0, 2, 2, 0, 1, 1, 1, 0, 2, 0, 0, 1, 2, 1, 3,
2, 2, 1, 3, 0, 2, 0, 3, 1, 0, 0, 2, 0, 2, 1, 2, 1, 0, 0, 0, 1, 0, 2,
0, 1, 1, 2, 0, 1, 3, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 1, 0,
0, 0, 3, 1, 1, 0, 1, 0, 3, 1, 3, 3, 2, 0, 2, 1, 0, 2, 2, 3, 2, 3, 1,
3, 0, 3, 1, 0, 2, 1, 0, 0, 0, 1, 3, 1, 1, 3, 2, 3, 3, 2, 2, 0, 3, 3,
1, 0, 0, 2, 2, 3, 1, 3, 1, 2, 0, 1, 3, 1, 0, 0, 2, 3, 0, 2, 1, 0, 2,
1, 3, 3, 0, 1, 3, 1, 1, 0, 0, 2, 0, 1, 2, 0, 2, 1, 0, 0, 3, 3, 1, 1,
2, 3, 2, 0, 2, 0, 0, 1, 2, 1, 0, 3, 1, 3, 0, 3, 3, 3, 0, 3, 3, 3, 0,
2, 0, 3, 3, 0, 1, 0, 1, 2, 3, 2, 2, 0, 0, 0, 1, 3, 3, 1, 0, 0, 1, 3,
0, 2, 3, 1, 0, 0, 0, 0, 0, 3, 3, 3])

i.e. a partition of the voters into four groups.

(Agglomerative clustering naturally generates a hierarchical clustering, i.e. a tree with the HoF voters on the leaves; there must be some way to get sklearn to output this directly, but I don’t know it!

Of course, if you have a 0-1 matrix, you don’t have to cluster the rows — you can cluster the columns! This time, just for kicks, I used the hierarchical clustering package in scipy.  I think this one is just agglomerating too.  But it has a nice output package!  Here, Y is the transpose of X above, a 0-1 matrix telling us which players were on which ballots.  Then:

>> import scipy
>>> Dend = scipy.cluster.hierarchy.dendrogram(Z,labels=(a list of player names))
>>> plt.xticks(ha=’right’)
>>> plt.show()

gives

Not bad! You can see that Bonds and Clemens form their own little cluster in red.  There is not that much structure here — maybe worth noting that this method may be dominated by the variable “number of votes received.”  Still, the two stories we’ve told here do seem to have some coarse features in common:  Bonds/Clemens are a cluster, and Larry Walker voting is kind of its own variable independent of the rest of the ballot.

OK, this picture was nice so I couldn’t resist doing one for the voters:

Pretty hard to read!  I think that black cluster on the end is probably the no-Bonds-no-Clemens gang.  But what I want you to notice is that there’s one outlying node all the way over to the left, which the clustering algorithm identifies as the weirdest ballot made public.  It’s Sadiel LeBron, who voted for Clemens, Sosa, and Ramirez, but not Bonds.  And also he voted for Andruw Jones and Omar Vizquel!  Yeah, that’s a weird ballot.

I’m sure this isn’t the right way to visualize a 0-1 matrix.  Here’s what I’d do if I felt like spending a little more time:  write something up to look for a positive definite rank-2 matrix A such that

$A_{ij} > A_{ik}$

whenever voter i voted for player j but not player k.  That models the idea of each player being a point in R^2 and each voter being a point in R^2 and then voters vote for every player whose dot product with them is large enough.  My intuition is that this would be a better way of plotting ballots in a plane than just doing PCA on the 0-1 matrix.  But maybe it would come out roughly the same, who knows?

Presumably there are known best practices for clustering subsets of a fixed set (or, more generally, finding good embeddings into visualizable metric spaces like the plane.)  Tell me about them!

## Kevin Jamieson, hyperparameter optimization, playoffs

Kevin Jamieson gave a great seminar here on Hyperband, his algorithm for hyperparameter optimization.

Here’s the idea.  Doing machine learning involves making a lot of choices.  You set up your deep learning neural thingamajig but that’s not exactly one size fits all:  How many layers do you want in your net?  How fast do you want your gradient descents to step?  And etc. and etc.  The parameters are the structures your thingamajig learns.  The hyperparameters are the decisions you make about your thingamajig before you start learning.  And it turns out these decisions can actually affect performance a lot.  So how do you know how to make them?

Well, one option is to pick N choices of hyperparameters at random, run your algorithm on your test set with each choice, and see how you do.  The problem is, thingamajigs take a long time to converge.  This is expensive to do, and when N is small, you’re not really seeing very much of hyperparameter space (which might have dozens of dimensions.)

A more popular choice is to place some prior on the function

F:[hyperparameter space] -> [performance on test set]

You make a choice of hyperparameters, you run the thingamajig, based on the output you update your distribution on F, based on your new distribution you choose a likely-to-be-informative hyperparameter and run again, etc.

This is called “Bayesian optimization of hyperparameters” — it works pretty well — but really only about as well as taking twice as many random choices of hyperparameters, in practice.  A 2x speedup is nothing to sneeze at, but it still means you can’t get N large enough to search much of the space.

Kevin thinks you should think of this as a multi-armed bandit problem.  You have a hyperparameter whose performance you’d like to judge.  You could run your thingamajig with those parameters until it seems to be converging, and see how well it does.  But that’s expensive.  Alternatively, you could run your thingamajig (1/c) times as long; then you have time to consider Nc values of the hyperparameters, much better.  But of course you have a much less accurate assessment of the performance:  maybe the best performer in that first (1/c) time segment is actually pretty bad, and just got off to a good start!

So you do this instead.  Run the thingamajig for time (1/c) on Nc values.  That costs you N.  Then throw out all values of the hyperparameters that came in below median on performance.  You still have (1/2)Nc values left, so continue running those processes for another time (1/c).  That costs you (1/2)N.  Throw out everything below the median.  And so on.  When you get to the end you’ve spent N log Nc, not bad at all but instead of looking at only N hyperparameters, you’ve looked at Nc, where c might be pretty big.  And you haven’t wasted lots of processor time following unpromising choices all the way to the end; rather, you’ve mercilessly culled the low performers along the way.

But how do you choose c?  I insisted to Kevin that he call c a hyperhyperparameter but he wasn’t into it.  No fun!  Maybe the reason Kevin resisted my choice is that he doesn’t actually choose c; he just carries out his procedure once for each c as c ranges over 1,2,4,8,…. N; this costs you only another log N.

In practice, this seems to find hyperparameters just as well as more fancy Bayesian methods, and much faster.  Very cool!  You can imagine doing the same things in simpler situations (e.g. I want to do a gradient descent, where should I start?) and Kevin says this works too.

In some sense this is how a single-elimination tournament works!  In the NCAA men’s basketball finals, 64 teams each play a game; the teams above the median are 1-0, while the teams below the median, at 0-1, get cut.  Then the 1-0 teams each play one more game:  the teams above the median at 2-0 stay, the teams below the median at 1-1 get cut.

What if the regular season worked like this?  Like if in June, the bottom half of major league baseball just stopped playing, and the remaining 15 teams duked it out until August, then down to 8… It would be impossible to schedule, of course.  But in a way we have some form of it:  at the July 31 trade deadline, teams sufficiently far out of the running can just give up on the season and trade their best players for contending teams’ prospects.  Of course the bad teams keep playing games, but in some sense, the competition has narrowed to a smaller field.

## Ranking mathematicians by hinge loss

As I mentioned, I’m reading Ph.D. admission files.  Each file is read by two committee members and thus each file has two numerical scores.

How to put all this information together into a preliminary ranking?

The traditional way is to assign to each applicant their mean score.  But there’s a problem: different raters have different scales.  My 7 might be your 5.

You could just normalize the scores by subtracting that rater’s overall mean.  But that’s problematic too.  What if one rater actually happens to have looked at stronger files?  Or even if not:  what if the relation between rater A’s scale and rater B’s scale isn’t linear?  Maybe, for instance, rater A gives everyone she doesn’t think should get in a 0, while rater A uses a range of low scores to express the same opinion, depending on just how unsuitable the candidate seems.

Here’s what I did last year.  If (r,a,a’) is a triple with r is a rater and a and a’ are two applicants, such that r rated a higher than a’, you can think of that as a judgment that a is more admittable than a’.  And you can put all those judgments from all the raters in a big bag, and then see if you can find a ranking of the applicants (or, if you like, a real-valued function f on the applicants) such that, for every judgment a > a’, we have f(a) > f(a’).

Of course, this might not be possible — two raters might disagree!  Or there might be more complicated incompatibilities generated by multiple raters.  Still, you can ask:  what if I tried to minimize the number of “mistakes”, i.e. the number of judgments in your bag that your choice of ranking contradicts?

Well, you can ask that, but you may not get an answer, because that’s a highly non-convex minimization problem, and is as far as we know completely intractable.

But here’s a way out, or at least a way part of the way out — we can use a convex relaxation.  Set it up this way.  Let V be the space of real-valued functions on applicants.  For each judgment j, let mistake_j(f) be the step function

mistake_j(f) = 1 if f(a) < f(a’) + 1

mistake_j(f) = 0 if f(a) >= f(a’) + 1

Then “minimize total number of mistakes” is the problem of minimizing

M = sum_j mistake_j(f)

over V.  And M is terribly nonconvex.  If you try to gradient-descend (e.g. start with a random ranking and then switch two adjacent applicants whenever doing so reduces the total number of mistakes) you are likely to get caught in a local minimum that’s far from optimal.  (Or at least that can happen; whether this typically actually happens in practice, I haven’t checked!)

So here’s the move:  replace mistake_j(f) with a function that’s “close enough,” but is convex.  It acts as a sort of tractable proxy for the optimization you’re actually after.  The customary choice here is the hinge loss:

hinge_j(f) = min(0, f(a)-f(a’) -1).

Then H := sum_j hinge_j(f) is a convex function on f, which you can easily minimize in Matlab or python.  If you can actually find an f with H(f) = 0, you’ve found a ranking which agrees with every judgment in your bag.  Usually you can’t, but that’s OK!  You’ve very quickly found a function H which does a decent job aggregating the committee scores. and which you can use as your starting point.

Now here’s a paper by Nihal Shah and Martin Wainwright commenter Dustin Mixon linked in my last ranking post.  It suggests doing something much simpler:  using a linear function as a proxy for mistake_j.  What this amounts to is:  score each applicant by the number of times they were placed above another applicant.  Should I be doing this instead?  My first instinct is no.  It looks like Shah and Wainwright assume that each pair of applicants is equally likely to be compared; I think I don’t want to assume that, and I think (but correct me if I’m wrong!) the optimality they get may not be robust to that?

Anyway, all thoughts on this question — or suggestions as to something totally different I could be doing — welcome, of course.

## Gendercycle: a dynamical system on words

By the way, here’s another fun word2vec trick.  Following Ben Schmidt, you can try to find “gender-neutralized synonyms” — words which are close to each other except for the fact that they have different gender connotations.   A quick and dirty way to “mascify” a word is to find its nearest neighbor which is closer to “he” than “she”:

def mascify(y): return [x[0] for x in model.most_similar(y,topn=200) if model.similarity(x[0],’she’) < model.similarity(x[0],’he’)][0]

“femify” is defined similarly.  We could put a threshold away from 0 in there, if we wanted to restrict to more strongly gender-coded words.

Anyway, if you start with a word and run mascify and femify alternately, you can ask whether you eventually wind up in a 2-cycle:  a pair of words which are each others gender counterparts in this loose sense.

e.g.

gentle -> easygoing -> chatty -> talkative -> chatty -> …..

So “chatty” and “talkative” are a pair, with “chatty” female-coded and “talkative” male-coded.

beautiful -> magnificent -> wonderful -> marvelous -> wonderful -> …

So far, I keep hitting 2-cycles, and pretty quickly, though I don’t see why a longer cycle wouldn’t be possible or likely.  Update:  Kevin in comments explains very nicely why it has to terminate in a 2-cycle!

Some other pairs, female-coded word first:

overjoyed / elated

strident / vehement

fearful / worried

furious / livid

distraught / despondent

hilarious / funny

exquisite / sumptuous

thought_provoking / insightful

Sometimes it’s basically giving the same word, in two different forms or with one word misspelled:

intuitive / intuitively

buoyant / bouyant

You can do this for names, too, though you have to make the “topn” a little longer to find matches.  I found:

Jamie / Chris

Deborah / Jeffrey

Fran / Pat

Mary / Joseph

Pretty good pairs!  Note that you hit a lot of gender-mixed names (Jamie, Chris, Pat), just as you might expect:  the male-biased name word2vec-closest to a female name is likely to be a name at least some women have!  You can deal with this by putting in a threshold:

>> def mascify(y): return [x[0] for x in model.most_similar(y,topn=1000) if model.similarity(x[0],’she’) < model.similarity(x[0],’he’) – 0.1][0]

This eliminates “Jamie” and “Pat” (though “Chris” still reads as male.)

Now we get some new pairs:

Jody / Steve (this one seems to have a big basis of attraction, it shows up from a lot of initial conditions)

Kasey / Zach

Peter / Catherine (is this a Russia thing?)

Nicola / Dominic

Alison / Ian

## Bobrowski-Kahle-Skraba on the null hypothesis in persistent homology

I really like persistent homology; it’s a very beautiful idea, a way to look for structure in data when you really don’t have any principled way to embed it in Euclidean space (or, even when it does come embedded in Euclidean space, to find the kind of structure that doesn’t depend too much on the embedding.)

But because I like it, I want to see it done well, so I have some minor complaints!

Complaint one:  Persistent homology, applied to H_0 only, is clustering, and we know a lot about clustering already.  (Update:  As commenters point out, this is really only so for persistent homology computed on the Vietoris-Rips complex of a point cloud, the “classical case…”!)  Not to say that the ideas of persistence can’t be useful here at all (I have some ideas about directed graphs I want to eventually work out) but my sense is that people are not craving new clustering algorithms.  I really like the work that tries to grapple with the topology of the data in its fullness; I was really charmed, for instance, by Ezra Miller’s piece about the persistent homology of fruit fly wings.  (There’s a lot of nice stuff about geometric probability theory, too — e.g., how do you take the “average” of a bunch of graded modules for k[x,y], which you may think of as noisy measurements of some true module you want to estimate?)

My second complaint is the lack of understanding of the null hypothesis.  You have some point cloud, you make a barcode, you see some bars that look long, you say they’re features — but why are you so sure?  How long would bars be under the null hypothesis that the data has no topological structure at all?  You kind of have to know this in order to do good inference.  Laura Balzano and I did a little numerical investigation of this years ago but now Omer Bobrowski, Matthew Kahle, and Primoz Skraba have proved a theorem!  (Kahle’s cool work in probabilistic topology has appeared several times before on Quomodocumque…)

They show that if you sample points from a uniform Poisson process on the unit cube of intensity n (i.e. you expect n points) the longest bar in the H_k barcode has

That is really short!  And it makes me feel like there actually is something going on, when you see a long barcode in practice.

## Raw polling data as playground

This is a picture of the American electorate!

More precisely; this is a scatterplot I just made using the dataset recently released by PPP, a major political polling firm.  (They’re the outfit that did the “is your state hot or not” poll I blogged about last week.)  PPP has made available the raw responses from 46 polls with 1000 responses each, conducted more or less weekly over the course of 2011.  Here’s the whole thing as a .zip file.

Analyzing data sets like this is in some sense not hard.  But there’s a learning curve.  Little things, like:  you have to know that the .csv format is beautifully portable and universal — it’s the ASCII of data.  You have to know how to get your .csv file into your math package of choice (in my case, python, but I think I could easily have done this in r or MatLab as well) and you have to know where to get a PCA package, if it’s not already installed.  And you have to know how to output a new .csv file and make a graphic from it when you’re done.  (As you can see, I haven’t quite mastered this last part, and have presented you with a cruddy Excel scatterplot.)  In total, this probably took me about three hours to do, and now that I have a data-to-picture path I understand how to use, I think I could do it again in about 30 minutes.  It’s fun and I highly recommend it.  There’s a lot of data out there.

So what is this picture?  The scatterplot has 1000 points, one for each person polled in the December 15, 2011 PPP survey.  The respondents answered a bunch of questions, mostly about politics:

Q1: Do you have a favorable or unfavorable opinion of Barack Obama?
Q2: Do you approve or disapprove of Barack Obama‚Äôs job performance?
Q3: Do you think Barack Obama is too liberal, too conservative, or about right?
Q4: Do you approve or disapprove of the job Harry Reid is doing?
Q5: Do you approve or disapprove of the job Mitch McConnell is doing?
Q6: Do you have a favorable or unfavorable opinion of the Democratic Party?
Q7: Do you have a favorable or unfavorable opinion of the Republican Party?
Q8: Generally speaking, if there was an election today, would you vote to reelect Barack Obama, or would you vote for his Republican opponent?
Q9: Are you very excited, somewhat excited, or not at all excited about voting in the 2012 elections?
Q10: If passed into law one version of immigration reform that people have discussed would secure the border and crack down on employers who hire illegal immigrants. It would also require illegal immigrants to register for legal immigration status, pay back taxes, and learn English in order to be eligible for U.S. citizenship. Do you favor or oppose Congress passing this version of immigration reform?
Q11: Have you heard about the \$10,000 bet Mitt Romney challenged Rick Perry to in last week‚Äôs Republican Presidential debate?
Q12: (Asked only of those who say ‘yes’ to Q11:) Did Romney‚Äôs bet make you more or less likely to vote for him next year, or did it not make a difference either way?
Q13: Do you believe that there’s a “War on Christmas” or not?
Q14: Do you consider yourself to be a liberal, moderate, or conservative?
Q15: Do you consider yourself to be a supporter of the Tea Party or not?
Q16: Are you or is anyone in your household a member of a labor union?
Q17: If you are a woman, press 1. If a man, press 2.
Q18: If you are a Democrat, press 1. If a Republican, press 2. If you are an independent or a member of another party, press 3.
Q19: If you are Hispanic, press 1. If white, press 2. If African American, press 3. If Asian, press 4. If you are an American Indian, press 5. If other, press 6.
Q20: (Asked only of people who say American Indian on Q19:) Are you enrolled in a federally recognized tribe?
Q21: If you are 18 to 29 years old, press 1. If 30 to 45, press 2. If 46 to 65, press 3. If you are older than 65, press 4.
Q22: What part of the country do you live in NOW – the Northeast, the Midwest, the South, or the West?
Q23: What is your household’s annual income?

The answers to these questions, which are coded as integers, now give us 1000 points in R^{23}.  Our eyes are not good at looking at point clouds in 23-dimensional space.  So it’s useful to project down to R^2, that mos bloggable of Euclidean spaces.  But how?  We could just look at two coordinates and see what we get.  But this requires careful choice.  Suppose I map the voters onto the plane via their answers to Q1 and Q2.  The problem is, almost everyone who has a favorable opinion of Barack Obama approves of his job performance, and vice versa.  Considering these two features is hardly better than considering only one feature.  Better would be to look at Q8 and Q21; these two variables are surely less correlated, and studying both together would give us good information on how support for Obama varies with age.  But still, we’re throwing out a lot.  Principal component analysis is a very popular quick-n-dirty method of dimension reduction; it finds the projection onto R^2 (or a Euclidean space of any desired dimension) which best captures the variance in the original dataset.  In particular, the two axes in the PCA projection have correlation zero with each other.

A projection from R^23 to R^2 can be expressed by two vectors, each one of which is some linear combination of the original 23 variables.  The hope is always that, when you stare at the entries of these vectors, the corresponding axis has some “meaning” that jumps out at you.  And that’s just what happens here.

The horizontal axis is “left vs. right.”  It assigns positive weight to approving of Obama, identifying as a liberal, and approving of the Democratic Party, and negative weight to supporting the Tea Party and believing in a “War on Christmas.”  It would be very weird if any analysis of this kind of polling data didn’t pull out political affiliation as the dominant determinant of poll answers.

The second axis is “low-information voter vs. high-information voter,” I think.  It assigns a negative value to all answers of the form “don’t know / won’t answer,” and positive value to saying you are “very excited to vote” and having heard about Mitt Romney’s \$10,000 bet.  (Remember that?)

And now the picture already tells you something interesting.  These two variables are uncorrelated, by definition, but they are not unrelated.  The voters split roughly into two clusters, the Democrats and the Republicans.  But the plot is “heart-shaped” — the farther you go into the low-information voters, the less polarization there is between the two parties, until in the lower third of the graph it is hard to tell there are two parties at all.  This phenomenon is not surprising — but I think it’s pretty cool that it pops right out of a completely automatic process.

(I am less sure about the third-strongest axis, which I didn’t include in the plot.  High scorers here, like low scorers on axis 2, tend to give a lot of “don’t know” answers, except when asked about Harry Reid and Mitch McConnell, whom they dislike.  They are more likely to say they’re “not at all excited to vote” and more likely to be independents.  So I think one might call this the “to hell with all those crooks” axis.)

A few technical notes:  I removed questions, like “region of residence,” that didn’t really map on a linear scale, and others, like “income,” that not everyone answered.  I normalized all the columns to have equal variance.  I made new 0-1-valued columns to record “don’t know” answers.  Yes, I know that many people consider it bad news to run PCA on binary variables, but I decided that since I was just trying to draw pictures and not infer anything, it would be OK.

## Why not data science?

Addendum to the previous post: if the goal — surely a worthwhile one — is to promote NSF-DMS funding for data sciences, why not change the name to Division of Mathematical and Data Sciences?  My experience at the very interesting “High-dimensional phenomena” workshop at IMA was that good work in this area is being done not only by self-described statisticians, but by mathematicians, computer scientists, and electrical engineers; it seems reasonable to use a name that doesn’t suggest the field is the property of a single academic department.

Also, a colleague points out to me that DMSS would inevitably pronounced “Dumbass.”  So there’s that.

Tagged , ,