## A little more about random configurations

In my previous post about the configuration space of hard discs in a box, I neglected to say anything about the main point of Persi’s article!  It’s the following — even though  you don’t know anything about the topology of the space of configurations, you can still do an excellent job of drawing a configuration at random from the natural distribution using Monte Carlo techniques.  And if you’re a physicist trying to model the behavior of a gas or a fluid, you might be more interested in what a random configuration looks like than whether the space of configurations is connected — it might not be so relevant that you can get from something that looks like tight packing 1 to something that looks like tight packing 2, if the probability of doing so is vanishingly small.  Or to put it another way — if the space looks like a bunch of big blobs connected by extremely narrow paths, then from the point of view of physics it might as well be disconnected.

Still, as a topologist, you might ask:  if I can do a good random sample of points from a mystery manifold M, can I compute topological invariants of M with high confidence?  Can you guess whether M is connected?  More generally, can you guess the homology groups of M?

You might think of this as a massive geometric generalization of the age-old problem of “cluster analysis.”  Let’s say you have a bunch of people, and for each person you measure N variables — let’s say, height, weight, and shoe size.  So you have a bunch of points in R^N — in this case R^3.

Maybe you hope that these points are well-modeled as a sample from some multivariate normal distribution.  But under some circumstances, this is a really bad hope!  For instance, if your sample isn’t segregated by gender, you’re going to see two big clusters — one cluster of women where the mean height is around 5’6″, one cluster of men where the mean height is around 5’9″.  You’re not really sampling from a normal distribution — you might be sampling from a superimposition of two different normal distributions with different centers.  Or, alternately, you could think of yourself as sampling points from a manifold in R^3 consisting of just two points — “the ideal woman and the ideal man” — where your measurements are subject to some error that’s distributed normally.

My impression is that statisticians are pretty good at distinguishing between a normal distribution and a superimposition of some small finite set of normal distributions.  But I think it’s much harder to look at a giant cloud of points in R^100 and say “aha — this is actually a random sample from a normal distribution centered on the union of a surface of genus 2 sitting over here, and these ten disjoint circles sitting over there.”

If you were wondering about this while reading the Diaconis article in the Bulletin, you’d be in luck, because flipping forward a few pages you’d get to Gunnar Carlsson’s long survey article on precisely this genre of problem!  More on “topological statistics” once I’ve read Carlsson’s article, but let me point out now that if you’re a young mathematician interested in these matters you might consider going to the CBMS summer school this August, centered on a lecture series by Ghrist.