Tag Archives: ultrametrics

Roch on phylogenetic trees, learning ultrametrics from noisy measurements, and the shrimp-dog

Sebastien Roch gave a beautiful and inspiring talk here yesterday about the problem of reconstructing an evolutionary tree given genetic data about present-day species.  It was generally thought that keeping track of pairwise comparisons between species was not going to be sufficient to determine the tree efficiently; Roch has proven that it’s just the opposite.  His talk gave me a lot to think about.  I’m going to try to record a probably  corrupted, certainly filtered through my own viewpoint account of Roch’s idea.

So let’s say we have n points P_1, … P_n, which we believe are secretly the leaves of a tree.  In fact, let’s say that the edges of the tree are assigned lengths.  In other words, there is a secret ultrametric on the finite set P_1, … P_n, which we wish to learn.  In the phylogenetic case, the points are species, and the ultrametric distance d(P_i, P_j) between P_i and P_j measures how far back in the evolutionary tree we need to go to find a comon ancestor between species i and species j.

One way to estimate d(P_i, P_j) is to study the correlation between various markers on the genomes of the two species.  This correlation, in Roch’s model, is going to be on order

exp(-d(P_i,P_j))

which is to say that it is very close to 0 when P_i and P_j are far apart, and close to 1 when the two species have a recent common ancestor.  What that means is that short distances are way easier to measure than long distances — you have no chance of telling the difference between a correlation of exp(-10) and exp(-11) unless you have a huge number of measurements at hand.  Another way to put it:  the error bar around your measurement of d(P_i,P_j) is much greater when your estimate is small than when your estimate is high; in particular, at great enough distance you’ll have no real confidence in any upper bound for the distance.

So the problem of estimating the metric accurately seems impossible except in small neighborhoods.  But it isn’t.  Because metrics are not just arbitrary symmetric n x n matrices.  And ultrametrics are not just arbitrary metrics.  They satisfy the ultrametric inequality

d(x,y) <= max(d(x,z),d(y,z)).

And this helps a lot.  For instance, suppose the number of measurements I have is sufficient to estimate with high confidence whether or not a distance is less than 1, but totally helpless with distances on order 5.  So if my measurements give me an estimate d(P_1, P_2) = 5, I have no real idea whether that distance is actually 5, or maybe 4, or maybe 100 — I can say, though, that it’s that it’s probably not 1.

So am I stuck?  I am not stuck!  Because the distances are not independent of each other; they are yoked together under the unforgiving harness of the ultrametric inequality.  Let’s say, for instance, that I find 10 other points Q_1, …. Q_10 which I can confidently say are within 1 of P_1, and 10 other points R_1, .. , R_10 which are within 1 of P_2.  Then the ultrametric inequality tells us that

d(Q_i, R_j) = d(P_1, P_2)

for any one of the 100 ordered pairs (i,j)!  So I have 100 times as many measurements as I thought I did — and this might be enough to confidently estimate d(P_1,P_2).

In biological terms:  if I look at a bunch of genetic markers in a shrimp and a dog, it may be hard to estimate how far back in time one has to go to find their common ancestor.  But the common ancestor of a shrimp and a dog is presumably also the common ancestor of a lobster and a wolf, or a clam and a jackal!  So even if we’re only measuring a few markers per species, we can still end up with a reasonable estimate for the age of the proto-shrimp-dog.

What do you need if you want this to work?  You need a reasonably large population of points which are close together.  In other words, you want small neighborhoods to have a lot of points in them.  And what Roch finds is that there’s a threshold effect; if the mutation rate is too fast relative to the amount of measurement per species you do, then you don’t hit “critical mass” and you can’t bootstrap your way up to a full high-confidence reconstruction of the metric.

This leads one to a host of interesting questions — interesting to me, that is, albeit not necessarily interesting for biology.  What if you want to estimate a metric from pairwise distances but you don’t know it’s an ultrametric? Maybe instead you have some kind of hyperbolicity constraint; or maybe you have a prior on possible metrics which weights “closer to ultrametric” distances more highly.  For that matter, is there a principled way to test the hypothesis that a measured distance is in fact an ultrametric in the first place?  All of this is somehow related to this previous post about metric embeddings and the work of Eriksson, Darasathy, Singh, and Nowak.

 

 

 

Tagged , , , , , , , , ,

Compressed sensing, compressed MDS, compressed clustering, and my talk tomorrow

I’m talking tomorrow at the spanking-new SILO seminar (“Systems, Learning, Information, and Optimization”) in the spanking-new Wisconsin Institute for Discovery.  As you may know, I don’t do any research in any of the four subjects appearing in the acronym!  But I’ve been following some work in the area ever since last year’s very successful Topology and Data seminar.  Here’s one semi-baked idea that will appear in the talk.

World’s most hurried introduction to compressed sensing (see Terry’s blog for a more thorough account.)

Think of an image on an N x N grid of pixels as a vector in R^(N^2).  We know that images can be compressed — that is, we can specify an image with reasonable accuracy using much less information.  That’s because most real-world images lie on or near a manifold in R^(N^2) of dimension much lower than N^2.

It’s already great that we can get away with fewer than N^2 coordinates to specify our image.  But it gets better still.  Maybe it’s expensive to record all N^2 pixels.  If we’re want to end up with (say) only N log N numbers, maybe we should try making only N log N measurements of our image in the first place?  If you just record a subset of N log N pixels, you’re probably in trouble; there may be localized features in the image that happen to be far from any of the pixels in your sparse set.  But if you measure N log N random linear functions on R^(N^2), it turns out that you’re very likely to have pinned down a single point on the manifold of interesting images!

Finite metric spaces

Suppose you’ve got a set S of objects, and a function d: S x S -> R, which you think of as a distance or dissimilarity function.  Could be texts, could be genomes, whatever.  An arbitrary finite metric space is a hard thing to get one’s head around, so it’s desirable to construct an embedding f: S -> R^k such that the Euclidean distance between f(x) and f(y) agrees as closely as possible with d(x,y).  (Think of k as fixed and small.)  The standard algorithm for this is called multidimensional scaling. It works well.  End of story.

Except — note that once again, we’ve compressed our data set.  It takes N^2 coordinates to specify a distance function, but only kN to specify an embedding.  So you can think of MDS as a compression algorithm for metric spaces!  And why does it work?  Because the set of N-point metrics that embed in R^k is a very low-dimensional submanifold of the N^2-dimensional space of general metrics.

So the example of compressed sensing leads us to ask — if you are willing to grant that your metric is approximately embeddable in R^k, can you reconstruct it from a sparse set of distances?  The answer is yes — you can do this with “Landmark MDS,” developed by Vin de Silva and Josh Tenenbaum, and, just as one might want, one can reconstruct the metric from a number of distances that’s linear in N.  I want to call this “Compressed MDS” — is that appropriate?  Will a randomly chosen set of cN distances specify the embedding with high probability?  Seems like it to me.

Multidimensional scaling is just one example of a metric embedding problem — you have a finite set S and a fixed metric space M (or, sometimes, a class of metrics from which M is to be drawn) and you want to find an embedding f: S -> M which respects distance as well as possible.  (This survey by Indyk and Matousek is a good resource for the general problem; and here’s a Newton Institute conference on metric embeddings from two weeks ago, with all the talks online for streaming.)  Of special interest is the case where M is required to be an ultrametric; to embed S in an ultrametric space is to identify S with the leaves of a weighted tree, which is more or less to impose a heirarchical clustering on S.  The space of ultrametrics on S is again of dimension much smaller than N^2, so, the heirarchical construction can be thought of as a compressed representation of the distance function d.  And again, one asks — can we reconstruct the ultrametric from a sparse set of distances?

Yes — a forthcoming paper of Eriksson, Dasarathy, Singh, and Nowak explains how.   N log N distances is enough.  (Update, Feb 2011:  now on the arXiv.)

Can one call this “compressed clustering”?

Tagged , , , , , , ,
%d bloggers like this: