## Which pictures do children draw with straight lines?

Edray Goins gave a great colloquium today about his work on dessins d’enfants.  And in this talk there was a picture that surprised me.  It was one of the ones on the lower right of this poster.  Here, I’ll put in a screen shot:

Let me tell you what you’re looking at.  You are looking for elliptic curves E admitting a Belyi map f: E -> P^1, which is to say a map ramified only over 0,1, and infinity.  For each such map, the blue graph is f^{-1}([0,1]), the preimage of the line segment joining o and 1 in P^1(R).

In four of these cases, the graph is piecewise linear!  I didn’t know there were examples like this.  Don’t know if this is easy, but:  for which Belyi maps (of any genus, not just genus 1) is f^{-1}([0,1]) a union of geodesics?

Tagged ,

## Suriya Gunasekar, optimization geometry, loss minimization as dynamical system

Awesome SILO seminar this week by Suriya Gunasekar of TTI Chicago.  Here’s the idea, as I understand it.  In a classical optimization problem, like linear regression, you are trying to solve a problem which typically has no solution (draw a line that passes through every point in this cloud!) and the challenge is to find the best approximate solution.  Algebraically speaking:  you might be asked to solve

$Ax = b$

for x; but since x may not be in the image of the linear transformation A, you settle for minimizing

$||Ax-b||$

in whatever norm you like (L^2 for standard linear regression.)

In many modern optimization problems, on the other hand, the problem you’re trying to solve may have a lot more degrees of freedom.  Maybe you’re setting up an RNN with lots and lots and lots of parameters.  Or maybe, to bring this down to earth, you’re trying to pass a curve through lots of points but the curve is allowed to have very high degree.  This has the advantage that you can definitely find a curve that passes through all the points.  But it also has the disadvantage that you can definitely find a curve that passes through all the points.  You are likely to overfit!  Your wildly wiggly curve, engineered to exactly fit the data you trained on, is unlikely to generalize well to future data.

Everybody knows about this problem, everybody knows to worry about it.  But here’s the thing.  A lot of modern problems are of this form, and yet the optima we find on training data often do generalize pretty well to test data!  Why?

Make this more formal.  Let’s say for the sake of argument you’re trying to learn a real-valued function F, which you hypothesize is drawn from some giant space X.  (Not necessarily a vector space, just any old space.)  You have N training pairs (x_i, y_i), and a good choice for F might be one such that F(x_i) = y_i.  So you might try to find F such that

$F(x_i) = y_i$

for all i.  But if X is big enough, there will be a whole space of functions F which do the trick!  The solution set to

$F(\mathbf{x}) = \mathbf{y}$

will be some big subspace F_{x,y} of X.  How do you know which of these F’s to pick?

One popular way is to regularize; you decide that some elements of X are just better than others, and choose the point of F_{x,y} that optimizes that objective.  For instance, if you’re curve-fitting, you might try to find, among those curves passing through your N points, the least wiggly one (e.g. the one with the least total curvature.)  Or you might optimize for some combination of hitting the points and non-wiggliness, arriving at a compromise curve that wiggles only mildly and still passes near most of the points.  (The ultimate version of this strategy would be to retreat all the way back to linear regression.)

But it’s not obvious what regularization objective to choose, and maybe trying to optimize that objective is yet another hard computational problem, and so on and so on.  What’s really surprising is that something much simpler often works pretty well.  Namely:  how would you find F such that F(x) = y in the first place?  You would choose some random F in X, then do some version of gradient descent.  Find the direction in the tangent space to X at F that decreases $||F(\mathbf{x})-\mathbf{y}||$ most steeply, perturb F a bit in that direction, lather, rinse, repeat.

If this process converges, it ought to get you somewhere on the solution space F_{x,y}. But where?  And this is really what Gunasekar’s work is about.  Even if your starting F is distributed broadly, the distribution of the spot where gradient descent “lands” on F_{x,y} can be much more sharply focused.  In some cases, it’s concentrated on a single point!  The “likely targets of gradient descent” seem to generalize better to test data, and in some cases Gunasekar et al can prove gradient descent likes to find the points on F_{x,y} which optimize some regularizer.

I was really struck by this outlook.  I have tended to think of function learning as a problem of optimization; how can you effectively minimize the training loss ||F(x)  – y||?  But Gunasekar asks us instead to think about the much richer mathematical structure of the dynamical system of gradient descent on X guided by the loss function.  (Or I should say dynamical systems; gradient descent comes in many flavors.)

The dynamical system has a lot more stuff in it!  Think about iterating a function; knowing the fixed points is one thing, but knowing which fixed points are stable and which aren’t, and knowing which stable points have big basins of attraction, tells you way more.

What’s more, the dynamical system formulation is much more natural for learning problems as they are so often encountered in life, with streaming rather than static training data.  If you are constantly observing more pairs (x_i,y_i), you don’t want to have to start over every second and optimize a new loss function!  But if you take the primary object of study to be, not the loss function, but the dynamical system on the hypothesis space X, new data is no problem; your gradient is just a longer and longer sum with each timestep (or you exponentially deweight the older data, whatever you want my friend, the world is yours.)

Anyway.  Loved this talk.  Maybe this dynamical framework is the way other people are already accustomed to think of it but it was news to me.

Slides for a talk of Gunasekar’s similar to the one she gave here

“Characterizing Implicit Bias in terms of Optimization Geometry” (2018)

“Convergence of Gradient Descent on Separable Data” (2018)

A little googling for gradient descent and dynamical systems shows me that, unsurprisingly, Ben Recht is on this train.

## Farblandia

The job fell to me of giving an overview talk about Benson Farb’s entire career at his birthday conference last fall.  Hard task!  I didn’t cover nearly everything but I think I gave a decent impression of what Farbisme is all about.

Update:  For WordPressreasons, you can’t watch the video within this page, but if you click through to Vimeo you can watch it!

Tagged ,

## Wanlin Li, “Vanishing of hyperelliptic L-functions at the central point”

My Ph.D. student Wanlin Li has posted her first paper!  And it’s very cool.  Here’s the idea.  If chi is a real quadratic Dirichlet character, there’s no reason the special value L(1/2,chi) should vanish; the functional equation doesn’t enforce it, there’s no group whose rank is supposed to be the order of vanishing, etc.  And there’s an old conjecture of Chowla which says the special value never vanishes.  On the very useful principle that what needn’t happen doesn’t happen.

Alexandra Florea (last seen on the blog here)  gave a great seminar here last year about quadratic L-functions over function fields, which gave Wanlin the idea of thinking about Chowla’s conjecture in that setting.  And something interesting developed — it turns out that Chowla’s conjecture is totally false!  OK, well, maybe not totally false.  Let’s put it this way.  If you count quadratic extensions of F_q(t) up to conductor N, Wanlin shows that at least c N^a of the corresponding L-functions vanish at the center of the critical strip.  The exponent a is either 1/2,1/3, or 1/5, depending on q.  But it is never 1.  Which is to say that Wanlin’s theorem leaves open the possibility that o(N) of the first N hyperelliptic L-functions vanishes at the critical point.  In other words, a density form of Chowla’s conjecture over function fields might still be true — in fact, I’d guess it probably is.

The main idea is to use some algebraic geometry.  To say an L-function vanishes at 1/2 is to say some Frobenius eigenvalue which has to have absolute value q^{1/2} is actually equal to q^{1/2}.  In turn, this is telling you that the hyperelliptic curve over F_q whose L-function you’re studying has a map to some fixed elliptic curve.  Well, that’s something you can make happen by physically writing down equations!  Of course you also need a lower bound for the number of distinct quadratic extensions of F_q(t) that arise this way; this is the most delicate part.

I think it’s very interesting to wonder what the truth of the matter is.  I hope I’ll be back in a few months to tell you what new things Wanlin has discovered about it!

## How I won the talent show

So I don’t want to brag or anything but I won the neighborhood talent show tonight.  Look, here’s my trophy:

My act was “The Great Squarerootio.”  I drank beer and computed square roots in my head.  So I thought it might be fun to explain how to do this!  Old hat for my mathematician readers, but this is elementary enough for everyone.

Here’s how it works.  Somebody in the audience said “752.”  First of all, you need to know your squares.  I think I know them all up to 36^2 = 1296.  In this case, I can see that 752 is between 27^2 = 729 and 28^2 = 784, a little closer to 729.  So my first estimate is 27.  This is not too bad:  27^2 = 729 is 23 away from 752, so my error is 23.

Now here’s the rule that makes things go:

new estimate = old estimate + error/(2*old estimate).

In this case, the error is 23 and my old estimate is 27, so I should calculate 23/2*27; well, 23 is just a bit less than 27, between 10 and 20% less, so 23/2*27 is a bit less than 1/2, but should be bigger than 0.4.

So our new estimate is “27.4 something.”

Actual value of 27 + 23/54: 27.4259…

so I probably would have gotten pretty close on the hundredth place if I’d taken more care with estimating the fraction.  Of course, another way to get even closer is to take 27.4 as your “old estimate” and repeat the process!  But then I’d have to know the square of 27.4, which I don’t by heart, and computing it mentally would give me some trouble.

Why does this work?  The two-word answer is “Newton’s method.”  But let me give a more elementary answer.

Suppose x is our original estimate, and n is the number we’re trying to find the square root of, and e is the error; that is, n = x^2 + e.

We want to know how much we should change x in order to get a better estimate.  So let’s say we modify x by d.  Then

$(x+d)^2 = x^2 + 2xd + d^2$

Now let’s say we’ve wisely chosen x to be the closest integer to the square root of n, so d should be pretty small, less than 1 at any rate; then d^2 is really small.  So small I will decide to ignore it, and estimate

$(x+d)^2 = x^2 + 2xd$.

What we want is

$(x+d)^2 = n = x^2 +e$.

For this to be the case, we need e = 2xd, which tells us we should take d = e/2x; that’s our rule above!

Here’s another way to think about it.  We’re trying to compute 752^(1/2), right?  And 752 is 729+23.  So what is (729+23)^(1/2)?

If you remember the binomial theorem from Algebra 2, you might remember that it tells you how to compute (x+y)^n for any n.  Well, any positive whole number n, right?  That’s the thing!  No!  Any n!  It works for fractions too!  Your Algebra 2 teacher may have concealed this awesome fact from you!  I will not be so tight-lipped.

The binomial theorem says

$(x+y)^n = x^n + {n \choose 1} x^{n-1} y + {n \choose 2} x^{n-2} y^2 + \ldots +$

Plugging in x=729,y=23,n=1/2 we get

$(729+23)^{1/2} = 729^{1/2} + {1/2 \choose 1} 729^{-1/2} \cdot 23 + {1/2 \choose 2} 729^{-3/2} \cdot 23^2 + \ldots$

Now 729^{1/2} we know; that’s 27.  What is the binomial coefficient 1/2 choose 1?  Is it the number of ways to choose 1 item out of 1/2 choices?  No, because that makes no sense.  Rather:  by “n choose 1” we just mean n, by “n choose 2” we just mean (1/2)n(n-1), etc.

So we get

$(729+23)^{1/2} = 27 + (1/2) \cdot 23 / 27 + (-1/8) \cdot 23^2 / 27^3 + \ldots$

And look, that second term is 23/54 again!  So the Great Squarerootio algorithm is really just “use the first two terms in the binomial expansion.”

To sum up:  if you know your squares up to 1000, and you can estimate fractions reasonably fast, you can get a pretty good mental approximation to the square root of any three-digit number.  Even while drinking beer!  You might even be able to beat out cute kids in your neighborhood and win a big honking cup!

## Trace test

Jose Rodriguez gave a great seminar here yesterday about his work on the trace test, a numerical way of identifying irreducible components of varieties.  In Jose’s world, you do a lot of work with homotopy; if a variety X intersects a linear subspace V in points p1, p2, .. pk, you can move V a little bit and numerically follow those k points around.  If you move V more than a little bit — say in a nice long path in the Grassmannian that loops around and returns to its starting point — you’ll come back to p1, p2, .. pk, but maybe in a different order.  In this way you can compute the monodromy of those points; if it’s transitive, and if you’re careful about avoiding some kind of discriminant locus, you’ve proven that p1,p2…pk are all on the same component of V.

But the trace test is another thing; it’s about short paths, not long paths.  For somebody like me, who never thinks about numerical methods, this means “oh we should work in the local ring.”  And then it says something interesting!  It comes down to this.  Suppose F(x,y) is a form (not necessarily homogenous) of degree at most d over a field k.  Hitting it with a linear transformation if need be, we can assume the x^d term is nonzero.  Now think of F as an element of k((y))[x]:  namely

$F = x^d + a_1(y) x^{d-1} + \ldots + a_d(y).$

Letting K be the algebraic closure of k((y)), we can then factor F as (x-r_1) … (x-r_d).  Each of these roots can be computed as explicitly to any desired precision by Hensel’s lemma.  While the r_i may be power series in k((y)) (or in some algebraic extension), the sum of the r_i is -a_1(y), which is a linear function A+by.

Suppose you are wondering whether F factors in k[x,y], and whether, for instance, r_1 and r_2 are the roots of an irreducible factor of F.  For that to be true, r_1 + r_2 must be a linear function of y!  (In Jose’s world, you grab a set of points, you homotopy them around, and observe that if they lie on an irreducible component, their centroid moves linearly as you translate the plane V.)

Anyway, you can falsify this easily; it’s enough for e.g. the quadratic term of r_1 + r_2 to be nonzero.  If you want to prove F is irreducible, you just check that every proper subset of the r_i sums to something nonlinear.

1.  Is this something I already know in another guise?
2.  Is there a nice way to express the condition (which implies irreducibility) that no proper subset of the r_i sums to something with zero quadratic term?

## Good math days

I have good math days and bad math days; we all do.  An outsider might think the good math days are the days when you have good ideas.  That’s not how it works, at least for me.  You have good ideas on the bad math days, too; but one at a time.  You have an idea, you try it, you make some progress, it doesn’t work, your mind says “too bad.”

On the good math days, you have an idea, you try it, it doesn’t work, you click over to the next idea, you get over the obstacle that was blocking you, then you’re stuck again, you ask your mind “What’s the next thing to do?” you get the next idea, you take another step, and you just keep going.

You don’t feel smarter on the good math days.  It’s not even momentum, exactly, because it’s not a feeling of speed.  More like:  the feeling of being a big, heavy, not very fast vehicle, with very large tires, that’s just going to keep on traveling, over a bump, across a ditch, through a river, continually and inexorably moving in a roughly fixed direction.

Tagged

## When random people give money to random other people

A post on Decision Science about a problem of Uri Wilensky‘s has been making the rounds:

Imagine a room full of 100 people with 100 dollars each. With every tick of the clock, every person with money gives a dollar to one randomly chosen other person. After some time progresses, how will the money be distributed?

People often expect the distribution to be close to uniform.  But this isn’t right; the simulations in the post show clearly that inequality of wealth rapidly appears and then persists (though each individual person bobs up and down from rich to poor.)  What’s going on?  Why would this utterly fair and random process generate winners and losers?

Here’s one way to think about it.  The possible states of the system are the sets of nonnegative integers (m_1, .. m_100) summing to 10,000; if you like, the lattice points inside a simplex.  (From now on, let’s write N for 100 because who cares if it’s 100?)

The process is a random walk on a graph G, whose vertices are these states and where two vertices are connected if you can get from one to the other by taking a dollar from one person and giving it to another.  We are asking:  when you run the random walk for a long time, where are you on this graph?  Well, we know what the stationary distribution for random walk on an undirected graph is; it gives each vertex a probability proportional to its degree.  On a regular graph, you get uniform distribution.

Our state graph G isn’t regular, but it almost is; most nodes have degree N, where by “most” I mean “about 1-1/e”; since the number of states is

$N^2 + N - 1 \choose N-1$

and, of these, the ones with degree N are exactly those in which nobody’s out of money; if each person has a dollar, the number of ways to distribute the remaining N^2 – N dollars is

$N^2 - 1 \choose N-1$

and so the proportion of states where someone’s out of money is about

$\frac{(N^2 - 1)^N}{(N^2 + N - 1)^N} \sim (1-1/N)^N \sim 1/e$.

So, apart from those states where somebody’s broke, in the long run every possible state is equally likely;  we are just as likely to see $9,901 in one person’s hands and everybody else with$1 as we are to see exact equidistribution again.

What is a random lattice point in this simplex like?  Good question!  An argument just like the one above shows that the probability nobody goes below \$c is on order e^-c, at least when c is small relative to N; in other words, it’s highly likely that somebody’s very nearly out of money.

If X is the maximal amount of money held by any player, what’s the distribution of X?  I didn’t immediately see how to figure this out.  You might consider the continuous version, where you pick a point at random from the real simplex

$(x_1, .. x_N) \in \mathbf{R}^N: \sum x_i = N^2$.

Equivalently; break a stick at N-1 randomly chosen points; what is the length of the longest piece?  This is a well-studied problem; the mean size of the longest piece is about N log N.  So I guess I think maybe that’s the expected value of the net worth of the richest player?

But it’s not obvious to me whether you can safely approximate the finite problem by its continuous limit (which corresponds to the case where we keep the number of players at N but reduce the step size so that each player can give each other a cent, or a picocent, or whatever.)

What happens if you give each of the N players just one dollar?  Now the uniformity really breaks down, because it’s incredibly unlikely that nobody’s broke.  The probability distribution on the set of (m_1, .. m_N) summing to N assigns each vector a probability proportional to the size of its support (i.e. the number of m_i that are nonzero.)  That must be a well-known distribution, right?  What does the corresponding distribution on partitions of N look like?

Update:  Kenny Easwaran points out that this is basically the same computation physicists do when they compute the Boltzmann distribution, which was new to me.

Tagged , ,

## Rational points on solvable curves over Q via non-abelian Chabauty (with Daniel Hast)

New paper up!  With my Ph.D. student Daniel Hast (last seen on the blog here.)

We prove that hyperelliptic curves over Q of genus at least 2 have only finitely many rational points.  Actually, we prove this for a more general class of high-genus curves over Q, including all solvable covers of P^1.

But wait, don’t we already know that, by Faltings?  Of course we do.  So the point of the paper is to show that you can get this finiteness in a different way, via the non-abelian Chabauty method pioneered by Kim.  And I think it seems possible in principle to get Faltings for all curves over Q this way; though I don’t know how to do it.

## Multiple height zeta functions?

Idle speculation ensues.

Let X be a projective variety over a global field K, which is Fano — that is, its anticanonical bundle is ample.  Then we expect, and in lots of cases know, that X has lots of rational points over K.  We can put these points together into a height zeta function

$\zeta_X(s) = \sum_{x \in X(K)} H(x)^{-s}$

where H(x) is the height of x with respect to the given projective embedding.  The height zeta function organizes information about the distribution of the rational points of X, and which in favorable circumstances (e.g. if X is a homogeneous space) has the handsome analytic properties we have come to expect from something called a zeta function.  (Nice survey by Chambert-Loir.)

What if X is a variety with two (or more) natural ample line bundles, e.g. a variety that sits inside P^m x P^n?  Then there are two natural height functions H_1 and H_2 on X(K), and we can form a “multiple height zeta function”

$\zeta_X(s,t) = \sum_{x \in X(K)} H_1(x)^{-s} H_2(x)^{-t}$

There is a whole story of “multiple Dirichlet series” which studies functions like

$\sum_{m,n} (\frac{m}{n}) m^{-s} n^{-t}$

where $(\frac{m}{n})$ denotes the Legendre symbol.  These often have interesting analytic properties that you wouldn’t see if you fixed one variable and let the other move; for instance, they sometimes have finite groups of functional equations that commingle the s and the t!

So I just wonder:  are there situations where the multiple height zeta function is an “analytically interesting” multiple Dirichlet series?

Here’s a case to consider:  what if X is the subvariety of P^2 x P^2 cut out by the equation

$x_0 y_0 + x_1 y_1 + x_2 y_2 = 0?$

This has something to do with Eisenstein series on GL_3 but I am a bit confused about what exactly to say.