Category Archives: math

The Lovasz number of the plane is about 3.48287

As seen in this comment on Polymath and explicated further in Fernando de Oliveira Filho’s thesis, section 4.4.

I actually spent much of today thinking about this so let me try to explain it in a down-to-earth way, because it involved me thinking about Bessel functions for the first time ever, surely a life event worthy of recording.

So here’s what we’re going to do.  As I mentioned last week, you can express this problem as follows:  suppose you have a map h: R^2 -> V, for some normed vector space V, which is a unit-distance embedding; that is, if |x-x’|_{R^2} = 1, then |h(x)-h(x’)|_V = 1.  (We don’t ask that h is an isometry, only that it preserves the distance-1 set.)

Then let t be the radius of the smallest hypersphere in V containing h(R^2).

Then any graph embeddable in R^2 with all edges of length 1 is sent to a unit-distance graph in V contained in the hyperplane of radius t; this turns out to be equivalent to saying the Lovasz number of G (ok, really I mean the Lovasz number of the complement of G) is at most 1/(1-2t).  So we want to show that t is bounded below 1, is the point.  Or rather:  we can find a V and a map from R^2 to V to make this the case.

So here’s one!  Let V be the space of L^2 functions on R^2 with the usual inner product.  Choose a square-integrable function F on R^2 — in fact let’s normalize to make F^2 integrate to 1 — and for each a in R^2 we let h(a) be the function F(x-a).

We want the distance between F(x-a) and F(x-b) to be the same for every pair of points at distance 1 from each other; the easiest way to arrange that is to insist that F(x) be a radially symmetric function F(x) = f(|x|); then it’s easy to see that the distance between F(x-a) and F(x-b) in V is a function G(a-b) which depends only on |a-b|.  We write

g(r) = \int_{\mathbf{R}^2} F(x)F(x-r) dx

so that the squared distance between F(x) and F(x-r) is

\int F(x)^2 dx - \int F(x)F(x-r) dx + \int F(x-r)^2 dx = 2(1-g(r)).

In particular, if two points in R^2 are at distance 1, the squared distance between their images in V is 2(1-g(1)).  Note also that g(0) is the square integral of F, which is 1.

What kind of hypersphere encloses all the points F(x-a) in V?  We can just go ahead and take the “center” of our hypersphere to be 0; since |F| = 1, every point in h(R^2) lies in (indeed, lies on) the sphere of radius 1 around the origin.

Hey but remember:  we want to study a unit-distance embedding of R^2 in V.  Right now, h sends unit distances to the distance 2(1-g(1)), whatever that is.  We can fix that by scaling h by the square root of that number.  So now h sends unit distances to unit distances, and its image is enclosed in a hypersphere of radius

2(1-g(1))^{-1}

The more negative g(1) is, the smaller this sphere is, which means the more we can “fold” R^2 into a small space.  Remember, the relationship between hypersphere number and Lovasz theta is

2t + 1/\theta = 1

and plugging in the above bound for the hypersphere number, we find that the Lovasz theta number of R^2, and thus the Lovasz theta number of any unit-distance graph in R^2, is at most

1-1/g(1).

So the only question is — what is g(1)?

Well, that depends on what g is.

Which depends on what F is.

Which depends on what f is.

And of course we get to choose what f is, in order to make g(1) as negative as possible.

How do we do this?  Well, here’s the trick.  The function G is not arbitrary; if it were, we could make g(1) whatever we wanted.  It’s not hard to see that G is what’s called a positive definite function on R^2.  And moreover, if G is positive definite, there exists some f giving rise to it.  (Roughly speaking, this is the fact that a positive definite symmetric matrix has a square root.)  So we ask:  if G is a positive definite (radially symmetric) function on R^2, and g(0) = 1, how small can g(1) be?

And now there’s an old theorem of (Wisconsin’s own!) Isaac Schoenberg which helpfully classifies the positive definite functions on R^2; they are precisely the functions G(x) = g(|x|) where g is a mixture of scalings of the Bessel function $J_0$:

g(r) = \int_0^\infty J_0(ur) A(u)

for some everywhere nonnegative A(u).  (Actually it’s more correct to say that A is a distribution and we are integrating J_0(ur) against a non-decreasing measure.)

So g(1) can be no smaller than the minimum value of J_0 on [0,infty], and in fact can be exactly that small if you let A become narrowly supported around the minimum argument.  This is basically just taking g to be a rescaled version of J_0 which achieves its minimum at 1.  That minimum value is about -0.4, and so the Lovasz theta for any unit-distance subgraph on the plane is bounded above by a number that’s about 1 + 1/0.4 = 3.5.

To sum up:  I give you a set of points in the plane, I connect every pair that’s at distance 1, and I ask how you can embed that graph in a small hypersphere keeping all the distances 1.  And you say:  “Oh, I know what to do, just assign to each point a the radially symmetrized Bessel function J_0(|x-a|) on R^2, the embedding of your graph in the finite-dimensional space of functions spanned by those Bessel translates will do the trick!”

That is cool!

Remark: Oliveira’s thesis does this for Euclidean space of every dimension (it gets more complicated.)  And I think (using analysis I haven’t really tried to understand) he doesn’t just give an upper bound for the Lovasz number of the plane as I do in this post, he really computes that number on the nose.

Update:  DeCorte, Oliveira, and Vallentin just posted a relevant paper on the arXiv this morning!

Tagged , ,

What is the Lovasz number of the plane?

There are lots of interesting invariants of a graph which bound its chromatic number!  Most famous is the Lovász number, which asks, roughly:  I attach vectors v_x to each vertex x such that v_x and v_y are orthogonal whenever x and y are adjacent, I try to stuff all those vectors into a small cone, the half-angle of the cone tells you the Lovász number, which is bigger and bigger as the smallest cone gets closer and closer to a hemisphere.

Here’s an equivalent formulation:  If G is a graph and V(G) its vertex set, I try to find a function f: V(G) -> R^d, for some d, such that

|f(x) – f(y)| = 1 whenever x and y are adjacent.

This is called a unit distance embedding, for obvious reasons.

The hypersphere number t(G) of the graph is the radius of the smallest sphere containing a unit distance embedding of G.  Computing t(G) is equivalent to computing the Lovász number, but let’s not worry about that now.  I want to generalize it a bit.  We say a finite sequence (t_1, t_2, t_3, … ,t_d) is big enough for G if there’s a unit-distance embedding of G contained in an ellipsoid with major radii t_1^{1/2}, t_2^{1/2}, .. t_d^{1/2}.  (We could also just consider infinite sequences with all but finitely many terms nonzero, that would be a little cleaner.)

Physically I think of it like this:  the graph is trying to fold itself into Euclidean space and fit into a small region, with the constraint that the edges are rigid and have to stay length 1.

Sometimes it can fold a lot!  Like if it’s bipartite.  Then the graph can totally fold itself down to a line segment of length 1, with all the black vertices going to one end and the white vertices going to the other.  And the big enough sequences are just those with some entry bigger than 1.

On the other hand, if G is a complete graph on k vertices, a unit-distance embedding has to be a simplex, so certainly anything with k of the t_i of size at least 1-1/k is big enough.   (Is that an if and only if?  To know this I’d have to know whether an ellipse containing an equilateral triangle can have a radius shorter than that of the circumcircle.)

Let’s face it, it’s confusing to think about ellipsoids circumscribing embedded graphs, so what about instead we define t(p,G) to be the minimum value of the L^p norm of (t_1, t_2, …) over ellipsoids enclosing a unit-distance embedding of G.

Then a graph has a unit-distance embedding in the plane iff t(0,G) <= 2.  And t(oo,G) is just the hypersphere number again, right?  If G has a k-clique then t(p,G) >= t(p,K_k) for any p, while if G has a k-coloring (i.e. a map to K_k) then t(p,G) <= t(p,K_k) for any n.  In particular, a regular k-simplex with unit edges fits into a sphere of squared radius 1-1/k, so t(oo,G) < 1-1/k.

So… what’s the relation between these invariants?  Is there a graph with t(0,G) = 2 and t(oo,G) > 4/5?  If so, there would be a non-5-colorable unit distance graph in the plane.  But I guess the relationship between these various “norms” feels interesting to me irrespective of any relation to plane-coloring.  What is the max of t(oo,G) with t(0,G)=2?

The intermediate t(p,G) all give functions which upper-bound clique number and lower-bound chromatic number; are any of them interesting?  Are any of them easily calculable, like the Lovász number?

Remarks:

  1.  I called this post “What is the Lovász number of the plane?” but the question of “how big can t(oo,G) be if t(0,G)=2”? is more a question about finite subgraphs of the plane and their Lovász numbers.  Another way to ask “What is the Lovász number of the plane” would be to adopt the point of view that the Lovász number of a graph has to do with extremizers on the set of positive semidefinite matrices whose (i,j) entry is nonzero only when i and j are adjacent vertices or i=j.  So there must be some question one could ask about the space of positive semidefinite symmetric kernels K(x,y) on R^2  x R^2 which are supported on the locus ||x-y||=1 and the diagonal, which question would rightly be called “What is the Lovász number of the plane?” But I’m not sure what it is.
  2. Having written this, I wonder whether it might be better, rather than thinking about enclosing ellipsoids of a set of points in R^d, just to think of the n points as an nxd matrix X and compute the singular values of X^T X, which would be kind of an “approximating ellipsoid” to the points.  Maybe later I’ll think about what that would measure.  Or you can!

 

 

 

 

 

 

Tagged , , ,

The chromatic number of the plane is at least 5

That is:  any coloring of the plane with four colors has two points at distance 1 from each other.  So says a paper just posted by Aubrey de Grey.

The idea:  given a set S of points in the plane, its unit distance graph G_S is the graph whose vertices are S and where two points are adjacent if they’re at distance 1 in the plane.  If you can find S such that G_S has chromatic number k, then the chromatic number of the plane is at least k.  And de Grey finds a set of 1,567 points whose unit distance graph can’t be 4-colored.

It’s known that the chromatic number of the plane is at most 7.  Idle question:  is there any chance of a “polynomial method”-style proof that there is no subset S of the plane whose unit distance graph has chromatic number 7?  Such a graph would have a lot of unit distances, and ruling out lots of repetitions of the same distance is something the polynomial method can in principle do.

Though be warned:  as far as I know the polynomial method has generated no improvement so far on older bounds on the unit distance problem (“how many unit distances can there be among pairs drawn from S?”) while it has essentially solved the distinct distance problem (“how few distinct distances can there be among pairs drawn from S?”)

 

Tagged , ,

Which pictures do children draw with straight lines?

Edray Goins gave a great colloquium today about his work on dessins d’enfants.  And in this talk there was a picture that surprised me.  It was one of the ones on the lower right of this poster.  Here, I’ll put in a screen shot:

Let me tell you what you’re looking at.  You are looking for elliptic curves E admitting a Belyi map f: E -> P^1, which is to say a map ramified only over 0,1, and infinity.  For each such map, the blue graph is f^{-1}([0,1]), the preimage of the line segment joining o and 1 in P^1(R).

In four of these cases, the graph is piecewise linear!  I didn’t know there were examples like this.  Don’t know if this is easy, but:  for which Belyi maps (of any genus, not just genus 1) is f^{-1}([0,1]) a union of geodesics?

Tagged ,

Suriya Gunasekar, optimization geometry, loss minimization as dynamical system

Awesome SILO seminar this week by Suriya Gunasekar of TTI Chicago.  Here’s the idea, as I understand it.  In a classical optimization problem, like linear regression, you are trying to solve a problem which typically has no solution (draw a line that passes through every point in this cloud!) and the challenge is to find the best approximate solution.  Algebraically speaking:  you might be asked to solve

Ax = b

for x; but since x may not be in the image of the linear transformation A, you settle for minimizing

||Ax-b||

in whatever norm you like (L^2 for standard linear regression.)

In many modern optimization problems, on the other hand, the problem you’re trying to solve may have a lot more degrees of freedom.  Maybe you’re setting up an RNN with lots and lots and lots of parameters.  Or maybe, to bring this down to earth, you’re trying to pass a curve through lots of points but the curve is allowed to have very high degree.  This has the advantage that you can definitely find a curve that passes through all the points.  But it also has the disadvantage that you can definitely find a curve that passes through all the points.  You are likely to overfit!  Your wildly wiggly curve, engineered to exactly fit the data you trained on, is unlikely to generalize well to future data.

Everybody knows about this problem, everybody knows to worry about it.  But here’s the thing.  A lot of modern problems are of this form, and yet the optima we find on training data often do generalize pretty well to test data!  Why?

Make this more formal.  Let’s say for the sake of argument you’re trying to learn a real-valued function F, which you hypothesize is drawn from some giant space X.  (Not necessarily a vector space, just any old space.)  You have N training pairs (x_i, y_i), and a good choice for F might be one such that F(x_i) = y_i.  So you might try to find F such that

F(x_i) = y_i

for all i.  But if X is big enough, there will be a whole space of functions F which do the trick!  The solution set to

F(\mathbf{x}) = \mathbf{y}

will be some big subspace F_{x,y} of X.  How do you know which of these F’s to pick?

One popular way is to regularize; you decide that some elements of X are just better than others, and choose the point of F_{x,y} that optimizes that objective.  For instance, if you’re curve-fitting, you might try to find, among those curves passing through your N points, the least wiggly one (e.g. the one with the least total curvature.)  Or you might optimize for some combination of hitting the points and non-wiggliness, arriving at a compromise curve that wiggles only mildly and still passes near most of the points.  (The ultimate version of this strategy would be to retreat all the way back to linear regression.)

But it’s not obvious what regularization objective to choose, and maybe trying to optimize that objective is yet another hard computational problem, and so on and so on.  What’s really surprising is that something much simpler often works pretty well.  Namely:  how would you find F such that F(x) = y in the first place?  You would choose some random F in X, then do some version of gradient descent.  Find the direction in the tangent space to X at F that decreases ||F(\mathbf{x})-\mathbf{y}|| most steeply, perturb F a bit in that direction, lather, rinse, repeat.

If this process converges, it ought to get you somewhere on the solution space F_{x,y}. But where?  And this is really what Gunasekar’s work is about.  Even if your starting F is distributed broadly, the distribution of the spot where gradient descent “lands” on F_{x,y} can be much more sharply focused.  In some cases, it’s concentrated on a single point!  The “likely targets of gradient descent” seem to generalize better to test data, and in some cases Gunasekar et al can prove gradient descent likes to find the points on F_{x,y} which optimize some regularizer.

I was really struck by this outlook.  I have tended to think of function learning as a problem of optimization; how can you effectively minimize the training loss ||F(x)  – y||?  But Gunasekar asks us instead to think about the much richer mathematical structure of the dynamical system of gradient descent on X guided by the loss function.  (Or I should say dynamical systems; gradient descent comes in many flavors.)

The dynamical system has a lot more stuff in it!  Think about iterating a function; knowing the fixed points is one thing, but knowing which fixed points are stable and which aren’t, and knowing which stable points have big basins of attraction, tells you way more.

What’s more, the dynamical system formulation is much more natural for learning problems as they are so often encountered in life, with streaming rather than static training data.  If you are constantly observing more pairs (x_i,y_i), you don’t want to have to start over every second and optimize a new loss function!  But if you take the primary object of study to be, not the loss function, but the dynamical system on the hypothesis space X, new data is no problem; your gradient is just a longer and longer sum with each timestep (or you exponentially deweight the older data, whatever you want my friend, the world is yours.)

Anyway.  Loved this talk.  Maybe this dynamical framework is the way other people are already accustomed to think of it but it was news to me.

Slides for a talk of Gunasekar’s similar to the one she gave here

“Characterizing Implicit Bias in terms of Optimization Geometry” (2018)

“Convergence of Gradient Descent on Separable Data” (2018)

A little googling for gradient descent and dynamical systems shows me that, unsurprisingly, Ben Recht is on this train.

 

 

 

 

 

 

 

Tagged , , , , ,

Farblandia

The job fell to me of giving an overview talk about Benson Farb’s entire career at his birthday conference last fall.  Hard task!  I didn’t cover nearly everything but I think I gave a decent impression of what Farbisme is all about.

Update:  For WordPressreasons, you can’t watch the video within this page, but if you click through to Vimeo you can watch it!

 

Tagged ,

Wanlin Li, “Vanishing of hyperelliptic L-functions at the central point”

My Ph.D. student Wanlin Li has posted her first paper!  And it’s very cool.  Here’s the idea.  If chi is a real quadratic Dirichlet character, there’s no reason the special value L(1/2,chi) should vanish; the functional equation doesn’t enforce it, there’s no group whose rank is supposed to be the order of vanishing, etc.  And there’s an old conjecture of Chowla which says the special value never vanishes.  On the very useful principle that what needn’t happen doesn’t happen.

Alexandra Florea (last seen on the blog here)  gave a great seminar here last year about quadratic L-functions over function fields, which gave Wanlin the idea of thinking about Chowla’s conjecture in that setting.  And something interesting developed — it turns out that Chowla’s conjecture is totally false!  OK, well, maybe not totally false.  Let’s put it this way.  If you count quadratic extensions of F_q(t) up to conductor N, Wanlin shows that at least c N^a of the corresponding L-functions vanish at the center of the critical strip.  The exponent a is either 1/2,1/3, or 1/5, depending on q.  But it is never 1.  Which is to say that Wanlin’s theorem leaves open the possibility that o(N) of the first N hyperelliptic L-functions vanishes at the critical point.  In other words, a density form of Chowla’s conjecture over function fields might still be true — in fact, I’d guess it probably is.

The main idea is to use some algebraic geometry.  To say an L-function vanishes at 1/2 is to say some Frobenius eigenvalue which has to have absolute value q^{1/2} is actually equal to q^{1/2}.  In turn, this is telling you that the hyperelliptic curve over F_q whose L-function you’re studying has a map to some fixed elliptic curve.  Well, that’s something you can make happen by physically writing down equations!  Of course you also need a lower bound for the number of distinct quadratic extensions of F_q(t) that arise this way; this is the most delicate part.

I think it’s very interesting to wonder what the truth of the matter is.  I hope I’ll be back in a few months to tell you what new things Wanlin has discovered about it!

 

Tagged , , , , , ,

How I won the talent show

So I don’t want to brag or anything but I won the neighborhood talent show tonight.  Look, here’s my trophy:

My act was “The Great Squarerootio.”  I drank beer and computed square roots in my head.  So I thought it might be fun to explain how to do this!  Old hat for my mathematician readers, but this is elementary enough for everyone.

Here’s how it works.  Somebody in the audience said “752.”  First of all, you need to know your squares.  I think I know them all up to 36^2 = 1296.  In this case, I can see that 752 is between 27^2 = 729 and 28^2 = 784, a little closer to 729.  So my first estimate is 27.  This is not too bad:  27^2 = 729 is 23 away from 752, so my error is 23.

Now here’s the rule that makes things go:

new estimate = old estimate + error/(2*old estimate).

In this case, the error is 23 and my old estimate is 27, so I should calculate 23/2*27; well, 23 is just a bit less than 27, between 10 and 20% less, so 23/2*27 is a bit less than 1/2, but should be bigger than 0.4.

So our new estimate is “27.4 something.”

Actual answer: 27.4226…

Actual value of 27 + 23/54: 27.4259…

so I probably would have gotten pretty close on the hundredth place if I’d taken more care with estimating the fraction.  Of course, another way to get even closer is to take 27.4 as your “old estimate” and repeat the process!  But then I’d have to know the square of 27.4, which I don’t by heart, and computing it mentally would give me some trouble.

Why does this work?  The two-word answer is “Newton’s method.”  But let me give a more elementary answer.

Suppose x is our original estimate, and n is the number we’re trying to find the square root of, and e is the error; that is, n = x^2 + e.

We want to know how much we should change x in order to get a better estimate.  So let’s say we modify x by d.  Then

(x+d)^2 = x^2 + 2xd + d^2

Now let’s say we’ve wisely chosen x to be the closest integer to the square root of n, so d should be pretty small, less than 1 at any rate; then d^2 is really small.  So small I will decide to ignore it, and estimate

(x+d)^2 = x^2 + 2xd.

What we want is

(x+d)^2 = n = x^2 +e.

For this to be the case, we need e = 2xd, which tells us we should take d = e/2x; that’s our rule above!

Here’s another way to think about it.  We’re trying to compute 752^(1/2), right?  And 752 is 729+23.  So what is (729+23)^(1/2)?

If you remember the binomial theorem from Algebra 2, you might remember that it tells you how to compute (x+y)^n for any n.  Well, any positive whole number n, right?  That’s the thing!  No!  Any n!  It works for fractions too!  Your Algebra 2 teacher may have concealed this awesome fact from you!  I will not be so tight-lipped.

The binomial theorem says

(x+y)^n = x^n + {n \choose 1} x^{n-1} y + {n \choose 2} x^{n-2} y^2 + \ldots +

Plugging in x=729,y=23,n=1/2 we get

(729+23)^{1/2} = 729^{1/2} + {1/2 \choose 1} 729^{-1/2} \cdot 23 + {1/2 \choose 2} 729^{-3/2} \cdot 23^2 + \ldots

Now 729^{1/2} we know; that’s 27.  What is the binomial coefficient 1/2 choose 1?  Is it the number of ways to choose 1 item out of 1/2 choices?  No, because that makes no sense.  Rather:  by “n choose 1” we just mean n, by “n choose 2” we just mean (1/2)n(n-1), etc.

So we get

(729+23)^{1/2} = 27 + (1/2) \cdot 23 / 27 + (-1/8) \cdot 23^2 / 27^3 + \ldots

And look, that second term is 23/54 again!  So the Great Squarerootio algorithm is really just “use the first two terms in the binomial expansion.”

To sum up:  if you know your squares up to 1000, and you can estimate fractions reasonably fast, you can get a pretty good mental approximation to the square root of any three-digit number.  Even while drinking beer!  You might even be able to beat out cute kids in your neighborhood and win a big honking cup!

 

Tagged , , ,

Trace test

Jose Rodriguez gave a great seminar here yesterday about his work on the trace test, a numerical way of identifying irreducible components of varieties.  In Jose’s world, you do a lot of work with homotopy; if a variety X intersects a linear subspace V in points p1, p2, .. pk, you can move V a little bit and numerically follow those k points around.  If you move V more than a little bit — say in a nice long path in the Grassmannian that loops around and returns to its starting point — you’ll come back to p1, p2, .. pk, but maybe in a different order.  In this way you can compute the monodromy of those points; if it’s transitive, and if you’re careful about avoiding some kind of discriminant locus, you’ve proven that p1,p2…pk are all on the same component of V.

But the trace test is another thing; it’s about short paths, not long paths.  For somebody like me, who never thinks about numerical methods, this means “oh we should work in the local ring.”  And then it says something interesting!  It comes down to this.  Suppose F(x,y) is a form (not necessarily homogenous) of degree at most d over a field k.  Hitting it with a linear transformation if need be, we can assume the x^d term is nonzero.  Now think of F as an element of k((y))[x]:  namely

F = x^d + a_1(y) x^{d-1} + \ldots + a_d(y).

Letting K be the algebraic closure of k((y)), we can then factor F as (x-r_1) … (x-r_d).  Each of these roots can be computed as explicitly to any desired precision by Hensel’s lemma.  While the r_i may be power series in k((y)) (or in some algebraic extension), the sum of the r_i is -a_1(y), which is a linear function A+by.

Suppose you are wondering whether F factors in k[x,y], and whether, for instance, r_1 and r_2 are the roots of an irreducible factor of F.  For that to be true, r_1 + r_2 must be a linear function of y!  (In Jose’s world, you grab a set of points, you homotopy them around, and observe that if they lie on an irreducible component, their centroid moves linearly as you translate the plane V.)

Anyway, you can falsify this easily; it’s enough for e.g. the quadratic term of r_1 + r_2 to be nonzero.  If you want to prove F is irreducible, you just check that every proper subset of the r_i sums to something nonlinear.

  1.  Is this something I already know in another guise?
  2.  Is there a nice way to express the condition (which implies irreducibility) that no proper subset of the r_i sums to something with zero quadratic term?

 

Tagged , , ,

Good math days

I have good math days and bad math days; we all do.  An outsider might think the good math days are the days when you have good ideas.  That’s not how it works, at least for me.  You have good ideas on the bad math days, too; but one at a time.  You have an idea, you try it, you make some progress, it doesn’t work, your mind says “too bad.”

On the good math days, you have an idea, you try it, it doesn’t work, you click over to the next idea, you get over the obstacle that was blocking you, then you’re stuck again, you ask your mind “What’s the next thing to do?” you get the next idea, you take another step, and you just keep going.

You don’t feel smarter on the good math days.  It’s not even momentum, exactly, because it’s not a feeling of speed.  More like:  the feeling of being a big, heavy, not very fast vehicle, with very large tires, that’s just going to keep on traveling, over a bump, across a ditch, through a river, continually and inexorably moving in a roughly fixed direction.

 

Tagged
%d bloggers like this: