## Hall of Fame ballots: some quick and dirty clustering

Since all the public Hall of Fame ballots are now available online in machine-readable form, thanks to Ryan Thibodeaux, I thought I’d mess around with the built-in clustering functions in sklearn and see what I could see.

The natural metric on ballots is Hamming distance, I guess.   I first tried the AgglomerativeClustering package.  I didn’t tell it what metric to use on the ballots, but I’m assuming it’s using Hamming distance, aka Euclidean in this case.  I asked AgglomerativeClustering to break the Hall of Fame voters into 2 clusters.  Guess what I found?  There’s a cluster of 159 voters who almost entirely voted for Barry Bonds and Roger Clemens, and a cluster of 83 who universally didn’t.  You won’t be surprised to hear that those who voted for Bonds and Clemens were also a lot more likely to vote for Manny Ramirez, Sammy Sosa, and Curt Schilling than the other cluster.

Which candidate was most notably unpopular among the Bonds-forgivers?  That would be Omar Vizquel.  He was on 53% of the steroid-rejector ballots!  Only 24% of the Bonds cluster thought Omar deserved Cooperstown.

Then I tried asking AgglomerativeClustering for four clusters.  The 83 anti-steroids folks all stayed together.  But the bigger group now split into Cluster 1 (61 ballots), Cluster 2 (46), and Cluster 3 (52).  Cluster 1 is the Joe Posnanski cluster.  Cluster 2 is the Nick Cafardo cluster.  Cluster 3 is the Tim Kurkjian cluster.

What differentiates these?  Cluster 1 is basically “people who voted for Larry Walker.”  The difference between Cluster 2 and Cluster 3 is more complicated.  The Cluster 2 ballots were much more likely to have:

Manny Ramirez, Sammy Sosa

and much less likely to have

Mike Mussina, Edgar Martinez, Curt Schilling

I’m not sure how to read this!  My best guess is that there’s no animus towards pitchers and DHs here; if you’re voting for Bonds and Clemens and Sosa and Ramirez and the guys who basically everybody voted for, you just don’t have that many votes left.  So I’d call Cluster 2 the “2000s-slugger loving cluster” and Cluster 3 everybody else.

Maybe I should say how you actually do this?  OK, first of all you munge the spreadsheet until you have a 0-1 matrix X where the rows are voters and the columns are baseball players.  Then your code looks like:

import sklearn

model = AgglomerativeClustering(n_clusters=4)

modplay.labels_

which outputs

array([1, 0, 3, 1, 1, 1, 0, 0, 0, 0, 2, 1, 2, 1, 3, 0, 0, 0, 2, 1, 0, 3, 2,
1, 2, 1, 1, 3, 1, 3, 3, 0, 2, 2, 0, 1, 1, 1, 0, 2, 0, 0, 1, 2, 1, 3,
2, 2, 1, 3, 0, 2, 0, 3, 1, 0, 0, 2, 0, 2, 1, 2, 1, 0, 0, 0, 1, 0, 2,
0, 1, 1, 2, 0, 1, 3, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 1, 0,
0, 0, 3, 1, 1, 0, 1, 0, 3, 1, 3, 3, 2, 0, 2, 1, 0, 2, 2, 3, 2, 3, 1,
3, 0, 3, 1, 0, 2, 1, 0, 0, 0, 1, 3, 1, 1, 3, 2, 3, 3, 2, 2, 0, 3, 3,
1, 0, 0, 2, 2, 3, 1, 3, 1, 2, 0, 1, 3, 1, 0, 0, 2, 3, 0, 2, 1, 0, 2,
1, 3, 3, 0, 1, 3, 1, 1, 0, 0, 2, 0, 1, 2, 0, 2, 1, 0, 0, 3, 3, 1, 1,
2, 3, 2, 0, 2, 0, 0, 1, 2, 1, 0, 3, 1, 3, 0, 3, 3, 3, 0, 3, 3, 3, 0,
2, 0, 3, 3, 0, 1, 0, 1, 2, 3, 2, 2, 0, 0, 0, 1, 3, 3, 1, 0, 0, 1, 3,
0, 2, 3, 1, 0, 0, 0, 0, 0, 3, 3, 3])

i.e. a partition of the voters into four groups.

(Agglomerative clustering naturally generates a hierarchical clustering, i.e. a tree with the HoF voters on the leaves; there must be some way to get sklearn to output this directly, but I don’t know it!

Of course, if you have a 0-1 matrix, you don’t have to cluster the rows — you can cluster the columns! This time, just for kicks, I used the hierarchical clustering package in scipy.  I think this one is just agglomerating too.  But it has a nice output package!  Here, Y is the transpose of X above, a 0-1 matrix telling us which players were on which ballots.  Then:

>> import scipy
>>> Dend = scipy.cluster.hierarchy.dendrogram(Z,labels=(a list of player names))
>>> plt.xticks(ha=’right’)
>>> plt.show()

gives

Not bad! You can see that Bonds and Clemens form their own little cluster in red.  There is not that much structure here — maybe worth noting that this method may be dominated by the variable “number of votes received.”  Still, the two stories we’ve told here do seem to have some coarse features in common:  Bonds/Clemens are a cluster, and Larry Walker voting is kind of its own variable independent of the rest of the ballot.

OK, this picture was nice so I couldn’t resist doing one for the voters:

Pretty hard to read!  I think that black cluster on the end is probably the no-Bonds-no-Clemens gang.  But what I want you to notice is that there’s one outlying node all the way over to the left, which the clustering algorithm identifies as the weirdest ballot made public.  It’s Sadiel LeBron, who voted for Clemens, Sosa, and Ramirez, but not Bonds.  And also he voted for Andruw Jones and Omar Vizquel!  Yeah, that’s a weird ballot.

I’m sure this isn’t the right way to visualize a 0-1 matrix.  Here’s what I’d do if I felt like spending a little more time:  write something up to look for a positive definite rank-2 matrix A such that

$A_{ij} > A_{ik}$

whenever voter i voted for player j but not player k.  That models the idea of each player being a point in R^2 and each voter being a point in R^2 and then voters vote for every player whose dot product with them is large enough.  My intuition is that this would be a better way of plotting ballots in a plane than just doing PCA on the 0-1 matrix.  But maybe it would come out roughly the same, who knows?

Presumably there are known best practices for clustering subsets of a fixed set (or, more generally, finding good embeddings into visualizable metric spaces like the plane.)  Tell me about them!

## The greatest Astro/Dodger

The World Series is here and so it’s time again to figure out which player in the history of baseball has had the most distinguished joint record of contributions to both teams in contention for the title.  (Last year:  Riggs Stephenson was the greatest Cub/Indian.)  Astros history just isn’t that long, so it’s a little surprising to find we come up with a really solid winner this year:  Jimmy Wynn, “The Toy Cannon,” a longtime Astro who moved to LA in 1974 and had arguably his best season, finishing 5th in MVP voting and leading the Dodgers to a pennant.  Real three-true-outcomes guy:  led the league in walks twice and strikeouts once, and was top-10 in the National League in home runs four times in the AstrodomeCareer total of 41.4 WAR for the Astros, and 12.3 for the Dodgers in just two years there.

As always, thanks to the indispensable Baseball Reference Play Index for making this search possible.

Other contenders:  Don Sutton is clearly tops among pitchers.  Sutton was the flip side of Wynn; he had just two seasons for Houston but they were pretty good.  Beyond that it’s slim pickings.  Jeff Kent put in some years for both teams.  So did Joe Ferguson.

Who are we rooting for?  On the “ex-Orioles on the WS Roster” I guess the Dodgers have the advantage, with Rich Hill and Justin Turner (I have to admit I have no memory of Turner playing for the Orioles at all, even though it wasn’t that long ago!  It was in 2009, a season I have few occasions to recall.)  But both these teams are stocked with players I just plain like:  Kershaw, Puig, Altuve, the great Carlos Beltran…

## Game report: Cubs 5, Brewers 0

• I guess the most dominant pitching performance I’ve seen in person?  Quintana never seemed dominant.  The Brewers hit a lot of balls hard.  But a 3-hit complete game shutout is a 3-hit complete game shutout.
• A lot of Cubs fans. A lot a lot.  My kids both agreed there were more Cubs than Brewers fans there, in a game that probably mattered more to Milwaukee.
• For Cubs fans to boo Ryan Braun in Wrigley Field is OK, I guess.  To come to Miller Park and boo Ryan Braun is classless.  Some of those people were wearing Sammy Sosa jerseys!
• This is the first time I’ve sat high up in the outfield.  And the view was great, as it’s been from every other seat I’ve ever occupied there.  A really nice design.  If only the food were better.
## The greatest Cub/Indian

Congratulations to the Cubs, the Indians, and their fanbases, one of which will enjoy a long-awaited championship!

Now here’s the question.  Which player in baseball history was the best combined Cub/Indian?  My methodology, as it was last year, is to draw the top 200 position players and pitchers from each team by Wins Above Replacement, using the Baseball Reference Play Index.  Then I find the players with the highest value of

(WAR for team 1 * WAR for team 2)

Now I have to admit I couldn’t actually think of a player who played for both the Cubs and the Indians!  And this was borne out by the Play Index results:  there were only five position players and no pitchers who ranked in the top 200 all-time contributors to each team.  Pretty surprising, considering how long both teams have been around!  And here are your top five Cub/Indians:

1.  Riggs Stephenson (193.6)
2.  Andre Thornton (98.8)
3.  Jose Cardenal (47.4)
4.  Mel Hall (9.0)
5.  Mitch Webster (7.8)

I almost wonder whether I did something wrong here.  There was so much more overlap last year between the Royals and the Mets!  But until you tell me otherwise, it’s the Riggs Stephenson Series.

## Orioles postmortem 2016

What is there to say?  Should Showalter have used Britton?  Probably.  When?  Probably when O’Day came in, when the Orioles desperately needed a double play to keep the game tied.  (But O’Day got the double play ball anyway.)  Barring that, you bring Britton in to start the 11th, I think, because Britton doesn’t give up home runs and you’ve got the home run guys coming up.  But what difference does it make?  The Orioles weren’t hitting, not off Liriano, not off anybody.  Jimenez would have been come in to pitch the 12th or 13th anyway.  The real mistake was pinch hitting Reimold for Kim.  Why?  Kim is the only guy on the team who gets on base.  Maybe he walks and Machado comes up and you have an actual chance.  Reimold is a bad defender, too; his misplay in the 11th, letting Devon Travis get to third, could have been decisive if Encarnacion had hit a single instead of a home run.

I said on Twitter it reminded me of the last game of the 1997 ALCS, but when I think it over, this one was a lot less heartbreaking.  In that game, Mike Mussina delivered one of the best Orioles playoff starts of my lifetime, and we wasted it.  Ten hits and five walks and we couldn’t push one run across.

And here’s the thing.  That 1997 team was the best Orioles squad in 15 years, and you had the real sense it was a one-shot deal.  The next year we were back to losing.  The 2016 team is probably the third-best in the last five years, and the main contributors will all be back next year.  It’s been a big adjustment, rooting for a team that’s consistently good, but my ability to absorb this loss makes me think it’s finally starting to sink in.

## Orioles 1, Red Sox 0

Was this it?  The game of the year, the game we’ll remember?  Gausman v. Porcello, both starters going 8 innings.  Gausman didn’t allow a run, Porcello just one, a home run to Mark Trumbo (of course, Trumbo).  Adam Jones got all of a Porcello pitch in the 3rd that looked to clear the Green Monster with yards to spare, but a brutal inbound wind knocked it out of the sky like a snipe.  Gausman hit 96 on his 109th pitch.  Jonathan Schoop backhanded a tough chance that took a weird hop and rolled partway up his wrist, and still managed to somehow flip the ball into his hand, like David Bowie in Labyrinth, and get the runner at first.  Schoop has the sweetest little “I made the play” smile in baseball, I think.  Manny Machado tagged up from first on a very deep fly by Chris Davis; Mookie Betts’s astonishingly throw got to second base with Machado no more than 2/3 of the way there.  He almost seemed to laugh at how out he was.  Zach Britton (of course, Britton) came in for the bottom of the 9th.  Battled with David Ortiz for 8 pitches, finally getting him to ground out to Chris Davis, who raced Ortiz, slow man versus slower man, to the bag.  Slow man won.  With two outs, Britton faced Hanley Ramirez, who swung three times, each time at a pitch farther removed from his person.  Orioles 1, Red Sox 0.  Nothing but must-win series from here onwards.

## Marilyn Sachs, Amy and Laura, how to date a Communist

While I was in Seattle for the Joint Meeting, I stopped in to see my cousin Marilyn Sachs, the children’s author, who’s now 88.  She signed a copy of CJ’s favorite book of hers, Amy and Laura.  I re-read it on the plane and it made me cry just like it did when I was a kid.

We talked about writing and the past.  She and her husband, Morris, started dating in 1946, in Brooklyn.  Morris had recently returned from the war in the Pacific and was a Communist.  He thought movies were too expensive, so on their dates they went block to block ringing doorbells, trying to get signatures on a petition demanding that the Dodgers bring up Jackie Robinson from their minor-league affiliate in Montreal.  Now that is how you date, young people.

## The greatest Royal/Met

A while ago I wrote a little Python code that used career data from Baseball-Reference Play Index (the best \$36/year a number-loving baseball fan can spend) to answer the question:  given a pair of teams, which player contributed the most to both teams?  My metric for this is

(WAR for team 1 * WAR for team 2)

in order to privilege players who balanced their contributions to both teams.

So who was the greatest Royal/Met?  In retrospect, this should have been obvious.  How many of the top 5 can you guess?

## Why does Indianapolis like the Cubs?

When two baseball teams share a city, one of them dominates the geographic region with the city as its center.  Greater New York, upper Jersey, lower CT like the Yankees, not the Mets.  Northern California likes the Giants, not the A’s.  In SoCal you won’t find many Angels fans outside Orange County itself.  And the whole mid-northern Midwest, from Iowa across to central Indiana, roots for the Cubs, not the White Sox, whose fanbase consists of southern Chicago and a few adjacent suburbs.

(Go here for an amazing, data-rich, zoomable interactive of this NYT UpShot map, but be prepared to be depressed about how many Yankee fans there are freaking everywhere.)

Why?  For NYC, LA, SF it’s pretty clear; one team is older and has a historic base that the other lacks.  But for Chicago it’s less clear.

One friend suggested that Iowa has a, um, relevant ethnic similarity with the part of Chicago containing Wrigley Field.  But Chicagoan tell me that the ethnic identification of White Sox fandom is historically Irish, not African-American.

My best guess is that it’s WGN, a mainstay of basic cable for decades which may have spread Cubs fandom across the nation the way TBS did for the Braves.  But then one asks:  in 1950, before TV, was there more parity between Cubs and White Sox fans?  Who did people in Des Moines and Indianapolis (and for that matter Milwaukee and Minneapolis) mostly cheer for back then?

And what about New York, back when there were three native teams of about the same age?  Did fans in Poughkeepsie and Rahway split evenly between Yankees, Dodgers, and Giants?  What about the Phillies and the original A’s?

## Mariners 6, Orioles 5

I took CJ with me to Seattle, where I was giving a talk at the American Statistical Association meetings, and what luck — the Orioles were in town!  So we took in this game.

Observations:

• I’ve never seen so many Orioles fans at an away game.  In fact, I kept seeing people in O’s gear all over Seattle.  Are they strangely popular in the PNW?  Or is it just that four years of winning has made it safe to wear orange and black in public?
• First trip to SafeCo, a great field on the underrated Miller Park model.  The retractable roof here doesn’t open and close; it slides over the top of the stadium like an umbrella.  When it’s open, the roof hangs over the railroad tracks adjoining the park, and when a train comes by, the whistle echoes off the roof into the stadium, and it is awesome.
• The Mariner dog is an unusually good ballpark dog.  As big as a brat, nicely blackened, good snap.  Well worth seven dollars.  The signature SafeCo food — at least, everyone around us had it — was garlic fries.  I’m sorry Seattle but these are not that good.  Huge heap of fries with a bunch of minced garlic and parsley on top.  Impressive to look at, but impossible to keep the garlic on the fries as you eat, and the fries get cold and depressing very quickly.
• Nice sunburned-looking blond couple in front of us turned out to be Dutch people whose son, they said, played for the Orioles in the Netherlands.  What could they have meant?  I think maybe he plays for these guys? But are they actually affiliated with the Orioles?  Mysteries of honkbal.
• “Dad has to catch a fly ball in a cowboy hat to win him and his kid Mariners tickets” is a great pregame promotion.  Every team should do this.

The game started out looking like a laugher; terrible defense and baserunning on both ends and the first inning ended with the Mariners up 4-2.  Then nothing happened for a long time.  Seattle’s Taijuan Williams wasn’t really dominant but the Orioles couldn’t really get a big hit.  Tillman got hit in the arm with a batted ball, and was bad anyway, and was out after 2 1/3, but the usual succession of long relievers shut down Seattle.  I told CJ “this team has an explosive offense and can score a bunch of runs at any time” and just then Adam Jones sneaked one over the left field fence to make it 5-4 and then Chris Davis came up.  He has grown a super-weird mustache, which CJ and I had been admiring on TV at the end of the previous night’s contest.

Davis says it helps him hit home runs and I guess so because he immediately launched a no-doubter so far into right it could have beat Ted Cruz in a primary. Maybe the best home run I’ve seen since the grand slam Jim Thome hit against the Orioles at U.S. Cellular. Did I blog that? Oh yeah, I did.

So we’re tied at 5, and we go into extras, T.J. MacFarland coming in for his third inning of work.  He faces the bottom of the order and loads the bases with one out.  Britton pitched 1 2/3 the previous night and is unavailable.  But you have O’Day warmed and ready.  Yes I know you want to save him to close, but at what point do you bring him in?  Would you rather lose with your best reliever waiting in the bullpen?  That’s what happened; McFarland stayed in to face Austin Jackson, who lashed a ball that landed about a centimeter inside the foul line and that was the ballgame.

Unusually bearable loss; much easier to take than if the Orioles had laid down and accepted that they were going to get beat by the runs they allowed in the first.

