Sebastien Roch gave a beautiful and inspiring talk here yesterday about the problem of reconstructing an evolutionary tree given genetic data about present-day species. It was generally thought that keeping track of pairwise comparisons between species was not going to be sufficient to determine the tree efficiently; Roch has proven that it’s just the opposite. His talk gave me a lot to think about. I’m going to try to record a probably corrupted, certainly filtered through my own viewpoint account of Roch’s idea.

So let’s say we have n points P_1, … P_n, which we believe are secretly the leaves of a tree. In fact, let’s say that the edges of the tree are assigned lengths. In other words, there is a secret ultrametric on the finite set P_1, … P_n, which we wish to learn. In the phylogenetic case, the points are species, and the ultrametric distance d(P_i, P_j) between P_i and P_j measures how far back in the evolutionary tree we need to go to find a comon ancestor between species i and species j.

One way to estimate d(P_i, P_j) is to study the correlation between various markers on the genomes of the two species. This correlation, in Roch’s model, is going to be on order

exp(-d(P_i,P_j))

which is to say that it is very close to 0 when P_i and P_j are far apart, and close to 1 when the two species have a recent common ancestor. What that means is that short distances are way easier to measure than long distances — you have no chance of telling the difference between a correlation of exp(-10) and exp(-11) unless you have a huge number of measurements at hand. Another way to put it: the error bar around your measurement of d(P_i,P_j) is much greater when your estimate is small than when your estimate is high; in particular, at great enough distance you’ll have no real confidence in any upper bound for the distance.

So the problem of estimating the metric accurately seems impossible except in small neighborhoods. But it isn’t. Because metrics are not just arbitrary symmetric n x n matrices. And ultrametrics are not just arbitrary metrics. They satisfy the ultrametric inequality

d(x,y) <= max(d(x,z),d(y,z)).

And this helps a lot. For instance, suppose the number of measurements I have is sufficient to estimate with high confidence whether or not a distance is less than 1, but totally helpless with distances on order 5. So if my measurements give me an estimate d(P_1, P_2) = 5, I have no real idea whether that distance is actually 5, or maybe 4, or maybe 100 — I can say, though, that it’s that it’s probably not 1.

So am I stuck? I am not stuck! Because the distances are not independent of each other; they are yoked together under the unforgiving harness of the ultrametric inequality. Let’s say, for instance, that I find 10 other points Q_1, …. Q_10 which I can confidently say are within 1 of P_1, and 10 other points R_1, .. , R_10 which are within 1 of P_2. Then the ultrametric inequality tells us that

d(Q_i, R_j) = d(P_1, P_2)

for any one of the 100 ordered pairs (i,j)! So I have 100 times as many measurements as I thought I did — and this might be enough to confidently estimate d(P_1,P_2).

In biological terms: if I look at a bunch of genetic markers in a shrimp and a dog, it may be hard to estimate how far back in time one has to go to find their common ancestor. But the common ancestor of a shrimp and a dog is presumably also the common ancestor of a lobster and a wolf, or a clam and a jackal! So even if we’re only measuring a few markers per species, we can still end up with a reasonable estimate for the age of the proto-shrimp-dog.

What do you need if you want this to work? You need a reasonably large population of points which are close together. In other words, you want small neighborhoods to have a lot of points in them. And what Roch finds is that there’s a threshold effect; if the mutation rate is too fast relative to the amount of measurement per species you do, then you don’t hit “critical mass” and you can’t bootstrap your way up to a full high-confidence reconstruction of the metric.

This leads one to a host of interesting questions — interesting to me, that is, albeit not necessarily interesting for biology. What if you want to estimate a metric from pairwise distances but you don’t know it’s an ultrametric? Maybe instead you have some kind of hyperbolicity constraint; or maybe you have a prior on possible metrics which weights “closer to ultrametric” distances more highly. For that matter, is there a principled way to test the hypothesis that a measured distance is in fact an ultrametric in the first place? All of this is somehow related to this previous post about metric embeddings and the work of Eriksson, Darasathy, Singh, and Nowak.