## Bobrowski-Kahle-Skraba on the null hypothesis in persistent homology

I really like persistent homology; it’s a very beautiful idea, a way to look for structure in data when you really don’t have any principled way to embed it in Euclidean space (or, even when it does come embedded in Euclidean space, to find the kind of structure that doesn’t depend too much on the embedding.)

But because I like it, I want to see it done well, so I have some minor complaints!

Complaint one:  Persistent homology, applied to H_0 only, is clustering, and we know a lot about clustering already.  (Update:  As commenters point out, this is really only so for persistent homology computed on the Vietoris-Rips complex of a point cloud, the “classical case…”!)  Not to say that the ideas of persistence can’t be useful here at all (I have some ideas about directed graphs I want to eventually work out) but my sense is that people are not craving new clustering algorithms.  I really like the work that tries to grapple with the topology of the data in its fullness; I was really charmed, for instance, by Ezra Miller’s piece about the persistent homology of fruit fly wings.  (There’s a lot of nice stuff about geometric probability theory, too — e.g., how do you take the “average” of a bunch of graded modules for k[x,y], which you may think of as noisy measurements of some true module you want to estimate?)

My second complaint is the lack of understanding of the null hypothesis.  You have some point cloud, you make a barcode, you see some bars that look long, you say they’re features — but why are you so sure?  How long would bars be under the null hypothesis that the data has no topological structure at all?  You kind of have to know this in order to do good inference.  Laura Balzano and I did a little numerical investigation of this years ago but now Omer Bobrowski, Matthew Kahle, and Primoz Skraba have proved a theorem!  (Kahle’s cool work in probabilistic topology has appeared several times before on Quomodocumque…)

They show that if you sample points from a uniform Poisson process on the unit cube of intensity n (i.e. you expect n points) the longest bar in the H_k barcode has

(death radius / birth radius) ~ [(log n)/(log log n)]^(1/k).

That is really short!  And it makes me feel like there actually is something going on, when you see a long barcode in practice.

## Random simplicial complexes

This is a post about Matt Kahle’s cool paper “Sharp vanishing thresholds for cohomology of random flag complexes,” which has just been accepted in the Annals.

The simplest way to make a random graph is to start with n vertices and then, for each pair (i,j) independently, put an edge between vertices i and j with probability p.  That’s called the Erdös-Rényi graph G(n,p), after the two people who first really dug into its properties.  What’s famously true about Erdös-Rényi graphs is that there’s a sharp threshold for connectness.  Imagine n being some fixed large number and p varying from 0 to 1 along a slider.  When p is very small relative to n, G(n,p) is very likely to be disconnected; in fact, if

$p = (0.9999) \frac{\log n}{n}$

there is very likely to be an isolated vertex, which makes G(n,p) disconnected all by itself.

On the other hand, if

$p = (1.0001) \frac{\log n}{n}$

then G(n,p) is almost surely connected!  In other words, the probability of connectedness “snaps” from 0 to 1 as you cross the barrier p = (log n)/n.  Of course, there are lots of other interesting questions you can ask — what exactly happens very near the “phase transition”?  For p < (log n)/n, what do the components look like?  (Answer:  for some range of p there is, with probability 1, a single “giant component” much larger than all others.  For instance, when p = 1/n the giant component has size around n^{2/3}.)

I think it’s safe to say that the Erdös-Rényi graph is the single most-studied object in probabilistic combinatorics.

But Kahle asked a very interesting question about it that was completely new to me.  Namely:  what if you consider the flag complex X(n,p), a simplicial complex whose k-simplices are precisely the k-cliques in G(n,p)?  X(n,p) is connected precisely when G(n,p) is, so there’s nothing new to say from that point of view.  But, unlike the graph, the complex has lots of interesting higher homology groups!  The connectedness threshold says that dim H_0(X(n,p)) is 1 above some sharp threshold and larger below it.  What Kahle proves is that a similar threshold exists for all the homology.  Namely, for each k there’s a range (bounded approximately by $n^{1/k}$ and $(log n / n)^{1/(k+1)}$) such that H_k(X(n,p)) vanishes when p is outside the range, but not when p is inside the range!  So there are two phase transitions; first, H^k appears, then it disappears.  (If I understand correctly, there’s a narrow window where two consecutive Betti numbers are nonzero, but most of the time there’s only one nonzero Betti number.)  Here’s a graph showing the appearance and disappearance of Betti in different ranges of p:

This kind of “higher Erdös-Rényi theorem” is, to me, quite dramatic and unexpected.  (One consequence that I like a lot; if you condition on the complex having dimension d, i.e. d being the size of the largest clique in G(n,p), then with probability 1 the homology of the complex is supported in middle degree, just as you might want!)  And there’s other stuff there too — like a threshold for the fundamental group of X(n,p) to have property T.

For yet more about this area, see Kahle’s recent survey on the topology of random simplicial complexes.  The probability that a random graph has a spectral gap, the distribution of Betti numbers of X(n,p) in the regime where they’re nonzero, the behavior of torsion, etc., etc……

## The hardest Rush Hour position

It takes 93 moves to solve, per this paper by Collette, Raskin, and Servais.  I tried it and got nowhere.

You can think of the space of all possible configurations of vehicles as, well, a configuration space, not unlike the configuration spaces of disks in a box.  But here there is a bit less topology; the space is just a graph, with two configurations made adjacent if one can be reached from the other by making a single move.  The connected component of configuration space containing the “hardest case” shown here has 24,132 vertices.

I wonder what this graph looks like?   What does the path of the cars look like as you traverse the 93-step path; do most of the cars traverse most of their range?  How many of the possible configurations of the 13 vehicles (constrained to stay in the given rows and columns, and in the same linear order when two share a row or column) are actually contained in this component?  Maybe Matt Kahle knows.  By the way, another Matt Kahle-like fact is that among the list of the hardest configurations are some which are not so dense at all, like this one with only 9 cars.  It looks like it should be easy, but apparently it takes 83 moves to solve!

## Rush Hour, Jr.

OK, so a black toddler and a Chinese toddler stumble on an international drug-trafficking ring — no, actually, this is a game I just bought for CJ, a kid’s version of Nob Yoshigahara‘s classic game Rush Hour.  The object here is to get the small white truck to the edge of the board (the top edge, in the image here.)  The trucks in your way can’t move sideways or turn; they just go forward and back.

You play a captivating game like this and naturally you start abstracting out the underlying math problem.  Play Set enough and you can’t avoid thinking about affine capsRush Hour has more to do with the geometry of configuration spaces; it reminds me of the “disk in a box” problems that people like Persi Diaconis and Matt Kahle work on.

So here’s a question — it doesn’t capture all the features of Rush Hour, but let’s start here.  Let X be the unit square, and let c be a parameter between 0 and 1, and let N be a large integer.  Let L be the set of line segments in X which are either horizontal of the form y = i/N or vertical of the form x = i/N.  A traffic jam is a choice of a length-c interval in each of the 2N +2 line segments in L, where we require that these intervals be pairwise disjoint.  The traffic jams naturally form a topological space, which we call T(N,c).  We say an interval (x,i/n),(x+c,i/n) in a traffic jam t is trapped if no traffic jam in the connected component of t contains the interval (0,i/n),(c,i/n).

Questions: For which values of (N,c) is T(N,c) connected?  In particular, is it connected almost always once it’s nonempty?  If not, when does T(N,c) have a “giant component”?  If there’s an interesting range of parameters where T(N,c) is not connected, what proportion of intervals do we expect to be trapped?