Kevin Jamieson gave a great seminar here on Hyperband, his algorithm for hyperparameter optimization.

Here’s the idea. Doing machine learning involves making a lot of choices. You set up your deep learning neural thingamajig but that’s not exactly one size fits all: How many layers do you want in your net? How fast do you want your gradient descents to step? And etc. and etc. The *parameters *are the structures your thingamajig learns. The hyperparameters are the decisions you make about your thingamajig before you start learning. And it turns out these decisions can actually affect performance a lot. So how do you know how to make them?

Well, one option is to pick N choices of hyperparameters at random, run your algorithm on your test set with each choice, and see how you do. The problem is, thingamajigs take a long time to converge. This is expensive to do, and when N is small, you’re not really seeing very much of hyperparameter space (which might have dozens of dimensions.)

A more popular choice is to place some prior on the function

F:[hyperparameter space] -> [performance on test set]

You make a choice of hyperparameters, you run the thingamajig, based on the output you update your distribution on F, based on your new distribution you choose a likely-to-be-informative hyperparameter and run again, etc.

This is called “Bayesian optimization of hyperparameters” — it works pretty well — but really only about as well as taking twice as many random choices of hyperparameters, in practice. A 2x speedup is nothing to sneeze at, but it still means you can’t get N large enough to search much of the space.

Kevin thinks you should think of this as a multi-armed bandit problem. You have a hyperparameter whose performance you’d like to judge. You could run your thingamajig with those parameters until it seems to be converging, and see how well it does. But that’s expensive. Alternatively, you could run your thingamajig (1/c) times as long; then you have time to consider Nc values of the hyperparameters, much better. But of course you have a much less accurate assessment of the performance: maybe the best performer in that first (1/c) time segment is actually pretty bad, and just got off to a good start!

So you do this instead. Run the thingamajig for time (1/c) on Nc values. That costs you N. Then throw out all values of the hyperparameters that came in *below median* on performance. You still have (1/2)Nc values left, so continue running those processes for another time (1/c). That costs you (1/2)N. Throw out everything below the median. And so on. When you get to the end you’ve spent N log Nc, not bad at all but instead of looking at only N hyperparameters, you’ve looked at Nc, where c might be pretty big. And you haven’t wasted lots of processor time following unpromising choices all the way to the end; rather, you’ve mercilessly culled the low performers along the way.

But how do you choose c? I insisted to Kevin that he call c a *hyperhyperparameter* but he wasn’t into it. No fun! Maybe the reason Kevin resisted my choice is that he *doesn’t *actually choose c; he just carries out his procedure once for each c as c ranges over 1,2,4,8,…. N; this costs you only another log N.

In practice, this seems to find hyperparameters just as well as more fancy Bayesian methods, and much faster. Very cool! You can imagine doing the same things in simpler situations (e.g. I want to do a gradient descent, where should I start?) and Kevin says this works too.

In some sense this is how a single-elimination tournament works! In the NCAA men’s basketball finals, 64 teams each play a game; the teams above the median are 1-0, while the teams below the median, at 0-1, get cut. Then the 1-0 teams each play one more game: the teams above the median at 2-0 stay, the teams below the median at 1-1 get cut.

What if the regular season worked like this? Like if in June, the bottom half of major league baseball just stopped playing, and the remaining 15 teams duked it out until August, then down to 8… It would be impossible to schedule, of course. But in a way we have some form of it: at the July 31 trade deadline, teams sufficiently far out of the running can just give up on the season and trade their best players for contending teams’ prospects. Of course the bad teams keep playing games, but in some sense, the competition has narrowed to a smaller field.

Maybe the 1, 2, 4, 8, 16, . . . choice (for an extra log factor) then corresponds to a double-elimination tournament?