## Kevin Jamieson, hyperparameter optimization, playoffs

Kevin Jamieson gave a great seminar here on Hyperband, his algorithm for hyperparameter optimization.

Here’s the idea.  Doing machine learning involves making a lot of choices.  You set up your deep learning neural thingamajig but that’s not exactly one size fits all:  How many layers do you want in your net?  How fast do you want your gradient descents to step?  And etc. and etc.  The parameters are the structures your thingamajig learns.  The hyperparameters are the decisions you make about your thingamajig before you start learning.  And it turns out these decisions can actually affect performance a lot.  So how do you know how to make them?

Well, one option is to pick N choices of hyperparameters at random, run your algorithm on your test set with each choice, and see how you do.  The problem is, thingamajigs take a long time to converge.  This is expensive to do, and when N is small, you’re not really seeing very much of hyperparameter space (which might have dozens of dimensions.)

A more popular choice is to place some prior on the function

F:[hyperparameter space] -> [performance on test set]

You make a choice of hyperparameters, you run the thingamajig, based on the output you update your distribution on F, based on your new distribution you choose a likely-to-be-informative hyperparameter and run again, etc.

This is called “Bayesian optimization of hyperparameters” — it works pretty well — but really only about as well as taking twice as many random choices of hyperparameters, in practice.  A 2x speedup is nothing to sneeze at, but it still means you can’t get N large enough to search much of the space.

Kevin thinks you should think of this as a multi-armed bandit problem.  You have a hyperparameter whose performance you’d like to judge.  You could run your thingamajig with those parameters until it seems to be converging, and see how well it does.  But that’s expensive.  Alternatively, you could run your thingamajig (1/c) times as long; then you have time to consider Nc values of the hyperparameters, much better.  But of course you have a much less accurate assessment of the performance:  maybe the best performer in that first (1/c) time segment is actually pretty bad, and just got off to a good start!

So you do this instead.  Run the thingamajig for time (1/c) on Nc values.  That costs you N.  Then throw out all values of the hyperparameters that came in below median on performance.  You still have (1/2)Nc values left, so continue running those processes for another time (1/c).  That costs you (1/2)N.  Throw out everything below the median.  And so on.  When you get to the end you’ve spent N log Nc, not bad at all but instead of looking at only N hyperparameters, you’ve looked at Nc, where c might be pretty big.  And you haven’t wasted lots of processor time following unpromising choices all the way to the end; rather, you’ve mercilessly culled the low performers along the way.

But how do you choose c?  I insisted to Kevin that he call c a hyperhyperparameter but he wasn’t into it.  No fun!  Maybe the reason Kevin resisted my choice is that he doesn’t actually choose c; he just carries out his procedure once for each c as c ranges over 1,2,4,8,…. N; this costs you only another log N.

In practice, this seems to find hyperparameters just as well as more fancy Bayesian methods, and much faster.  Very cool!  You can imagine doing the same things in simpler situations (e.g. I want to do a gradient descent, where should I start?) and Kevin says this works too.

In some sense this is how a single-elimination tournament works!  In the NCAA men’s basketball finals, 64 teams each play a game; the teams above the median are 1-0, while the teams below the median, at 0-1, get cut.  Then the 1-0 teams each play one more game:  the teams above the median at 2-0 stay, the teams below the median at 1-1 get cut.

What if the regular season worked like this?  Like if in June, the bottom half of major league baseball just stopped playing, and the remaining 15 teams duked it out until August, then down to 8… It would be impossible to schedule, of course.  But in a way we have some form of it:  at the July 31 trade deadline, teams sufficiently far out of the running can just give up on the season and trade their best players for contending teams’ prospects.  Of course the bad teams keep playing games, but in some sense, the competition has narrowed to a smaller field.

## Yakov Sinai wins the Abel Prize

And I had the job of delivering, in a format suitable for non-mathematicians, a half-hour summary of Sinai’s work.  A tough task, especially since you can’t ask any experts for help without breaking the secrecy!  I like what Tim Gowers wrote in 2011 about doing the same job the year Milnor won.

I was very happy when I learned (after agreeing to make the presentation) that Sinai had won — mainly for the obvious reason that he’s such a deserving recipient, but selfishly because he didn’t realize either of my main two fears.  On the one hand, I feared that the laureate would be someone whose mathematics was so deeply different from anything I know that I would really struggle to say anything at all that I felt confident was correct.  On the other hand, if the winner were someone in number theory, I would feel an intense responsibility to convey the full picture of the winner’s work and how it fit into the entire sweep of the subject, and I would feel terribly guilty about any simplifications I made, and the thing would be a mess.  As it is, the talk was not exactly easy to prepare but I never worried I actually couldn’t do it.  And I learned a lot!

Anyway, the video of the whole ceremony, including my talk starting at about 9:00, is here.

(Note:  All the sound on this is coming from my mike.  So I know it seems like every joke I crack on here is followed by some seconds of uncomfortable silence, but no, seriously, some people laughed, you just couldn’t hear it!)

Tagged , , ,

## Anti-TED

Cathy goes off on TED talks today, calling them shallow, one-directional, and slick.

I was thinking about TED the other day, while I was watching Jared Weinstein give a great lecture at the Arizona Winter School.  At AWS, they felt like people were leaning too much on prepped slides, and the rule is now that you have to handwrite your slides in real time, using an opaque projector to show the slides on the big screen.

Would TED talks be better if the speakers were restricted to visuals they could write or draw by hand in 18 minutes?

## Idle question: cluster algebras over finite fields and spectral gaps

Yet another great talk at the JMM:  Lauren Williams gave an introduction to cluster algebras in the Current Events section which was perfect for people, like me, who didn’t know the definition.  (The talks by Wei Ho, Sam Payne, and Mladen Bestvina were equally good, but I don’t have any idle questions about them!)

This post will be too long if I try to include the definitions myself, and I wouldn’t do as good a job of exposition as Williams did, so it’s good news that she’s arXived a survey paper which covers roughly the same ground as her talk.

Setup for idle question:  you can get a cluster algebra comes from a process called “seed mutation” — given a rational function field K = k(x_1, … x_m), a labelled seed is a pair (Q,f) where Q is a quiver on m vertices and f = (f_1, … f_m) is a labelling of the vertices of Q with rational functions in K.  For each i, there’s a seed mutation mu_i which is an involution on the labelled seeds; see Williams’s paper for the definition.

Now start with a labelled seed (Q,(x_1, … x_m)) and let T be the set of labelled seeds obtainable from the starting seed by repeated application of seed mutations mu_1, …. m_n for some n < m.  (I didn’t think carefully about the meaning of this special subset of n vertices, which are called the mutable vertices.)

It’s called T because it’s a tree, regular of degree n; each vertex is indexed by a word in the n seed mutations with no letter appearing twice in succession.

Anyway, for each vertex of T and each mutable vertex i you have a rational function f_i.  The cluster algebra is the algebra generated by all these rational functions.

The great miracle — rather, one of the great miracles — is that, by a theorem of Fomin and Zelevinsky, the f_i are all Laurent; that is, their denominators are just monomials in the original functions x_i.

We are now ready for the idle question!

Let’s take k to be a finite field F_q, and let U be (F_q^*)^m, the rational points of the m-torus over F_q.  Choose a point u = (u_1, … u_n) in (F_q^*)^m.

Then for any vertex of T, we can (thanks to the Laurent condition!) evaluate the functions (f_1, …. f_m) at u, getting an element of F_q^m.

So a random walk on the tree T, combined with evaluation at u, gives you a random walk on F_q^m.

Idle question:  Is there a spectral gap for this family of walks, independent of q?

Update:  As David Speyer explains in the comments, this finite random walk is not in general well-defined.  Let me try another phrasing which I think makes sense.

Let t be the endpoint of a length-R random walk on T; then evaluation at (1,1,..1) gives a probability distribution P_{R,N} on (Z/NZ)^m.  Let U_N be the uniform distribution on (Z/NZ)^m.  Now for each N we can ask about the limit

$\Lambda_N = \lim_{R \rightarrow \infty} ||P_{R,N} - U_{N}||^{1/R}$

(I don’t think it matters what norm we use.)

The idea is that the random walk on the tree should be equidistributing mod N, and the speed of convergence is governed by Λ_N.  Then we can ask

Idle question mark 2:  Is Λ_N bounded away from 1 by a constant independent of N?

This is a question in “spectral gap” style, which, if I did it right, doesn’t a priori have to do with a sequence of finite graphs.

Motivation:  this setup reminds me of a very well-known story in arithmetic groups; you have a discrete group Gamma which comes equipped with an action on an set of objects “over Z” — then reducing mod p for all p gives you a family of actions of Gamma on larger and larger finite sets, and a critical organizing question is:  do the corresponding graphs have a spectral gap?

For that matter, what happens if you, say, keep k = C and then evaluate your f_i at (1,1,… 1)?  Looking at a bigger and bigger ball in the tree you get bigger and bigger sets of elements of C^m; what do these look like?  Do they shoot off to infinity, accumulate, equidistribute…..?

## Slides from my JMM talk, “How to Count With Topology”

Back from San Diego, recovering from the redeye.  It was a terrific Joint Math Meetings this year; I saw lots of old friends and great talks, but had to miss a lot of both, too.

A couple of people asked me for the slides of my talk, “How To Count with Topology.”  Here they are, in .pdf:

How To Count With Topology

If you find this stuff interesting, these blog posts give a somewhat more detailed sketch of the papers with Venkatesh and Westerland I talked about.

## Me and Adam Phillips, Friday 29 Sep, 11:30am

I have often been told I needed to sit down and have a conversation with a psychoanalyst, and now I’m doing it — in public!  Adam Phillips and I will be at Hillel Friday morning to talk about the challenges of writing about technical material for a general audience.  Feel free to suggest questions for Phillips in the comments.

Tagged , ,

## John Doyle on handwaving and universal laws

John Doyle gave this year’s J. Barkley Rosser Lecture at the Wisconsin Institute for Discovery; his talk was dedicated to the proposition that tradeoffs between flexibility and robustness in control systems with significant delays are in the end going to be bound by universal laws, just as the operation of a classical Turing machine is bound by laws coming from information theory and complexity theory.  (A simple such one:  a machine that has the potential to produce N different outputs is going to have a worst-case run time of at least log N steps.)

Doyle believes the robustness-flexibility tradeoff should be fundamental to our way of thinking of both biological and technological devices.  He gave the following very illustrative example, which is so simple that you can play along as you read my blog.

Hold your hand in front of your face and wave your hand vigorously back and forth.  It looks blurry, right?

Which is strange, because the optical problem is in some sense exactly the same.  But the mechanism is different, and so the delay time is different.  When your hand moves, you’re using the same general-function apparatus you use to track moving objects more generally.  It’s a pretty good apparatus!  But because it’s so flexible, working well for all kinds of optical challenges, it is slow, and like any system with a long delay, input that oscillates pretty fast — like your waving hand — can cross it up.

When your head moves, it’s a different story:  we have a vestibulo-ocular reflex which moves our eyes in sync with our head to fix the images on our retina in place.  This doesn’t pass through cognition at all — it’s a direct neural connection from the vestibular sensors in the inner ear to the muscles that control eye movement.  This system isn’t flexible or adaptable at all.  It does just one thing — but it does it fast.

(All this material derived from my notes on Doyle’s talk, which went pretty fast:  all mistakes are mine.)

Here are the slides from Doyle’s talk.  (TooManySlides.pdf is the best filename ever!)

Here’s a paper from Science that Doyle said would be especially useful for mathematicians who want to see how the tradeoffs in question can be precisely formalize.  (Authors:  Chandra, Buzi, Doyle.)

## David Mumford says we should replace plane geometry with programming and I’m not sure he’s wrong

MAA Mathfest is in Madison this week — lots of interesting talks about mathematics, lots of interesting talks about mathematics education.  Yesterday morning, David Mumford gave a plenary lecture in which he decried the lack of engagement between pure and applied mathematics — lots of nodding heads — and then suggested that two-column proofs in plane geometry should be replaced with basic programming — lots of muttering and concerned looks.

But there’s something to what he’s saying.  The task of writing a good two-column proof has a lot in common with the task of writing a correct program.  Both ask you to construct a sequence that accomplishes a complicated goal from a relatively small set of simple elements.  Both have the useful feature that there’s not a unique “correct answer” — there are different proofs of the same proposition, and different programs delivering the same output.  Both quite difficult for novices and both are difficult to teach.  Both have the “weakest link” feature:  one wrong step means the whole thing is wrong.

Most importantly:  both provide the training in formalization of mental process that we mathematicians mostly consider a non-negotiable part of general education.

But teaching coding instead of two-column proofs has certain advantages.  I am not, in general, of the opinion that everything in school has to lead to employable skills.  But I can’t deny that “can’t write five lines of code” closes more doors to a kid than “can’t write or identify a correct proof.”  People say that really understanding what it means to prove a theorem helps you assess people’s deductive reasoning in domains outside mathematics, and I think that’s true; but really understanding what it means to write a five-line program helps you understand and construct deterministic processes in domains outside a terminal window, and that’s surely just as important!

Computer programs are easier to check, for the teacher and more importantly the student — you can tell whether the program is correct by running it, which means that the student can iterate the try-check-fail-try-again process many times without the need for intervention.

And then there’s this:  a computer program does something.  When you ask a kid to prove that a right triangle is similar to the triangle cut off by an altitude to the hypotenuse, she may well say “but that’s obvious, I can just see that it’s true.”  And she’s not exactly wrong!  “I know you know this, but you don’t really know this, despite the fact that it’s completely clear” is a hard sell, it devalues the geometric intuition we should be working to encourage.

## Roch on phylogenetic trees, learning ultrametrics from noisy measurements, and the shrimp-dog

Sebastien Roch gave a beautiful and inspiring talk here yesterday about the problem of reconstructing an evolutionary tree given genetic data about present-day species.  It was generally thought that keeping track of pairwise comparisons between species was not going to be sufficient to determine the tree efficiently; Roch has proven that it’s just the opposite.  His talk gave me a lot to think about.  I’m going to try to record a probably  corrupted, certainly filtered through my own viewpoint account of Roch’s idea.

So let’s say we have n points P_1, … P_n, which we believe are secretly the leaves of a tree.  In fact, let’s say that the edges of the tree are assigned lengths.  In other words, there is a secret ultrametric on the finite set P_1, … P_n, which we wish to learn.  In the phylogenetic case, the points are species, and the ultrametric distance d(P_i, P_j) between P_i and P_j measures how far back in the evolutionary tree we need to go to find a comon ancestor between species i and species j.

One way to estimate d(P_i, P_j) is to study the correlation between various markers on the genomes of the two species.  This correlation, in Roch’s model, is going to be on order

exp(-d(P_i,P_j))

which is to say that it is very close to 0 when P_i and P_j are far apart, and close to 1 when the two species have a recent common ancestor.  What that means is that short distances are way easier to measure than long distances — you have no chance of telling the difference between a correlation of exp(-10) and exp(-11) unless you have a huge number of measurements at hand.  Another way to put it:  the error bar around your measurement of d(P_i,P_j) is much greater when your estimate is small than when your estimate is high; in particular, at great enough distance you’ll have no real confidence in any upper bound for the distance.

So the problem of estimating the metric accurately seems impossible except in small neighborhoods.  But it isn’t.  Because metrics are not just arbitrary symmetric n x n matrices.  And ultrametrics are not just arbitrary metrics.  They satisfy the ultrametric inequality

d(x,y) <= max(d(x,z),d(y,z)).

And this helps a lot.  For instance, suppose the number of measurements I have is sufficient to estimate with high confidence whether or not a distance is less than 1, but totally helpless with distances on order 5.  So if my measurements give me an estimate d(P_1, P_2) = 5, I have no real idea whether that distance is actually 5, or maybe 4, or maybe 100 — I can say, though, that it’s that it’s probably not 1.

So am I stuck?  I am not stuck!  Because the distances are not independent of each other; they are yoked together under the unforgiving harness of the ultrametric inequality.  Let’s say, for instance, that I find 10 other points Q_1, …. Q_10 which I can confidently say are within 1 of P_1, and 10 other points R_1, .. , R_10 which are within 1 of P_2.  Then the ultrametric inequality tells us that

d(Q_i, R_j) = d(P_1, P_2)

for any one of the 100 ordered pairs (i,j)!  So I have 100 times as many measurements as I thought I did — and this might be enough to confidently estimate d(P_1,P_2).

In biological terms:  if I look at a bunch of genetic markers in a shrimp and a dog, it may be hard to estimate how far back in time one has to go to find their common ancestor.  But the common ancestor of a shrimp and a dog is presumably also the common ancestor of a lobster and a wolf, or a clam and a jackal!  So even if we’re only measuring a few markers per species, we can still end up with a reasonable estimate for the age of the proto-shrimp-dog.

What do you need if you want this to work?  You need a reasonably large population of points which are close together.  In other words, you want small neighborhoods to have a lot of points in them.  And what Roch finds is that there’s a threshold effect; if the mutation rate is too fast relative to the amount of measurement per species you do, then you don’t hit “critical mass” and you can’t bootstrap your way up to a full high-confidence reconstruction of the metric.

This leads one to a host of interesting questions — interesting to me, that is, albeit not necessarily interesting for biology.  What if you want to estimate a metric from pairwise distances but you don’t know it’s an ultrametric? Maybe instead you have some kind of hyperbolicity constraint; or maybe you have a prior on possible metrics which weights “closer to ultrametric” distances more highly.  For that matter, is there a principled way to test the hypothesis that a measured distance is in fact an ultrametric in the first place?  All of this is somehow related to this previous post about metric embeddings and the work of Eriksson, Darasathy, Singh, and Nowak.

## Set is not the only version of Set

I gave the Serge Lang lecture today at Berkeley about the card game Set and the related “affine cap problem” — how many cards is it possible to have on the table with no legal play?  You can ask this question for the actual card game Set (in which case the answer is known) or for the d-dimensional generalization of Set, where there are 3^d cards.  In this case, the question is:  how big is the largest subset of (F_3)^d in which no three points are collinear?  Our knowledge about the general case is embarrassingly weak.

Years ago I really thought I was going to make progress on this by devising very clever algebraic subvarieties of affine n-space which contained no three collinear F_3-rational points; but this didn’t work.  (It works great for d=3 and 4, though!  You can write the maximal cap set in (F_3)^4 as the set of nonzero solutions to x^2 + y^2 = zw.)

Tagged , , , ,