## Tantalisingly close to significance

Matthew Hankins and others on Twitter are making fun of scientists who twist themselves up lexically in order to report results that fail the significance test, using phrases like “approached but did not quite achieve significance” and “only just insignificant” and “tantalisingly close to significance.”

But I think this fun-making is somewhat misplaced!  We should instead be jeering at the conventional dichotomy that a result significant at p < .05 is “a real effect” and one that scores at p = .06 is “no effect.”

The lexically twisted scientists are on the side of the angels here, insisting that a statistically insignificant finding is usually much better described as “not enough evidence” than “no evidence,” and should be mentioned, in whatever language the journal allows, not mulched.

Tagged , ,

## Math on Trial, by Leila Schneps and Coralie Colmez

The arithmetic geometer Leila Schneps, who taught me most of what I know about Galois actions on fundamental groups of varieties, has a new book out, Math on Trial:  How Numbers Get Used and Abused in the Courtroom, written with her daughter Coralie Colmez.  Each chapter covers a famous case whose resolution, for better or worse, involved a mathematical argument.  Interspersed among the homicide and vice are short chapters that speak directly to some of the mathematical and statistical issues that arise in legal matters.  One of the cases is the much-publicized prosecution of college student Amanda Knox for a murder in Italy; today in the New York Times, Schneps and Colmez write about some of the mathematical pratfalls in their trial.

I am happy to have played some small part in building their book — I was the one who told Leila about the murder of Diana Sylvester, which turned into a whole chapter of Math on Trial; very satisfying to see the case treated with much more rigor, depth, and care than I gave it on the blog!  I hope it is not a spoiler to say that Schneps and Colmez come down on the side of assigning a probability close to 1 that the right man was convicted (though not nearly so close to 1 as the prosecution claimed, and perhaps to close enough for a jury to have rightfully convicted, depending on how you roll re “reasonable doubt.”)

## Nate Silver is the Kurt Cobain of statistics

Or so I argue in today’s Boston Globe, where I review Silver’s excellent new book.  I considered trying to wedge a “The Signal and The Noise” / “The Colour and the Shape” joke in there too, but it was too labored.

Concluding graf:

Prediction is a fundamentally human activity. Just as a novel is no less an expression of human feeling for being composed on a laptop, the forecasts Silver studies — at least the good ones — are expressions of human thought and belief, no matter how many theorems and algorithms forecasters bring to their aid. The math serves as a check on our human biases, and our insight serves as a check on the computer’s bugs and blind spots. In Silver’s world, math can’t replace or supersede us. Quite the contrary: It is math that allows us to become our wiser selves.

Tagged , , ,

I am the child of two statisticians, and as a result my childhood reading included the great sourcebook Statistics: A Guide To The Unknowna collection of essays by some of the great statisticians of the century.  The thing that made a lasting impression on me was the map of adjectives from Joseph Kruskal’s article, “The Meaning of Words.”  Psychologists gathered survey data about pairs of adjectives describing personality traits, asking  to what extent the traits were similar or different, until they had enough responses to estimate a “dissimilarity measure” for each pair.  Then they used multidimensional scaling (pretty new in 1968, I think) to map the adjectives onto the plane in such a way that the distances between adjectives matched the measured dissimilarities as well as possible.  That such a thing was possible was a relevation to me — I guess I knew on some level that arithmetic could be translated into geometry, but I didn’t know that meaning could be translated into geometry.

Here’s the map, from Rosenberg, Nelson, and Vivekananthan’s original paper:

## Ockham’s Razor Conference

What does it mean to say “All things being equal, believe the simplest theory?”  It sure sounds like good advice, but in practice it can be vexingly hard to understand which theories Ockham’s razor is lopping off and which are to be left behind.  So I’m happy to see this announcement of a conference at CMU this weekend on the topic, where philosophers, statisticians, and machine learning types will get together and hash it out. Speakers include my collaborator Elliott Sober and blog favorite Cosma Shalizi.

## Roch on phylogenetic trees, learning ultrametrics from noisy measurements, and the shrimp-dog

Sebastien Roch gave a beautiful and inspiring talk here yesterday about the problem of reconstructing an evolutionary tree given genetic data about present-day species.  It was generally thought that keeping track of pairwise comparisons between species was not going to be sufficient to determine the tree efficiently; Roch has proven that it’s just the opposite.  His talk gave me a lot to think about.  I’m going to try to record a probably  corrupted, certainly filtered through my own viewpoint account of Roch’s idea.

So let’s say we have n points P_1, … P_n, which we believe are secretly the leaves of a tree.  In fact, let’s say that the edges of the tree are assigned lengths.  In other words, there is a secret ultrametric on the finite set P_1, … P_n, which we wish to learn.  In the phylogenetic case, the points are species, and the ultrametric distance d(P_i, P_j) between P_i and P_j measures how far back in the evolutionary tree we need to go to find a comon ancestor between species i and species j.

One way to estimate d(P_i, P_j) is to study the correlation between various markers on the genomes of the two species.  This correlation, in Roch’s model, is going to be on order

exp(-d(P_i,P_j))

which is to say that it is very close to 0 when P_i and P_j are far apart, and close to 1 when the two species have a recent common ancestor.  What that means is that short distances are way easier to measure than long distances — you have no chance of telling the difference between a correlation of exp(-10) and exp(-11) unless you have a huge number of measurements at hand.  Another way to put it:  the error bar around your measurement of d(P_i,P_j) is much greater when your estimate is small than when your estimate is high; in particular, at great enough distance you’ll have no real confidence in any upper bound for the distance.

So the problem of estimating the metric accurately seems impossible except in small neighborhoods.  But it isn’t.  Because metrics are not just arbitrary symmetric n x n matrices.  And ultrametrics are not just arbitrary metrics.  They satisfy the ultrametric inequality

d(x,y) <= max(d(x,z),d(y,z)).

And this helps a lot.  For instance, suppose the number of measurements I have is sufficient to estimate with high confidence whether or not a distance is less than 1, but totally helpless with distances on order 5.  So if my measurements give me an estimate d(P_1, P_2) = 5, I have no real idea whether that distance is actually 5, or maybe 4, or maybe 100 — I can say, though, that it’s that it’s probably not 1.

So am I stuck?  I am not stuck!  Because the distances are not independent of each other; they are yoked together under the unforgiving harness of the ultrametric inequality.  Let’s say, for instance, that I find 10 other points Q_1, …. Q_10 which I can confidently say are within 1 of P_1, and 10 other points R_1, .. , R_10 which are within 1 of P_2.  Then the ultrametric inequality tells us that

d(Q_i, R_j) = d(P_1, P_2)

for any one of the 100 ordered pairs (i,j)!  So I have 100 times as many measurements as I thought I did — and this might be enough to confidently estimate d(P_1,P_2).

In biological terms:  if I look at a bunch of genetic markers in a shrimp and a dog, it may be hard to estimate how far back in time one has to go to find their common ancestor.  But the common ancestor of a shrimp and a dog is presumably also the common ancestor of a lobster and a wolf, or a clam and a jackal!  So even if we’re only measuring a few markers per species, we can still end up with a reasonable estimate for the age of the proto-shrimp-dog.

What do you need if you want this to work?  You need a reasonably large population of points which are close together.  In other words, you want small neighborhoods to have a lot of points in them.  And what Roch finds is that there’s a threshold effect; if the mutation rate is too fast relative to the amount of measurement per species you do, then you don’t hit “critical mass” and you can’t bootstrap your way up to a full high-confidence reconstruction of the metric.

This leads one to a host of interesting questions — interesting to me, that is, albeit not necessarily interesting for biology.  What if you want to estimate a metric from pairwise distances but you don’t know it’s an ultrametric? Maybe instead you have some kind of hyperbolicity constraint; or maybe you have a prior on possible metrics which weights “closer to ultrametric” distances more highly.  For that matter, is there a principled way to test the hypothesis that a measured distance is in fact an ultrametric in the first place?  All of this is somehow related to this previous post about metric embeddings and the work of Eriksson, Darasathy, Singh, and Nowak.

## Raw polling data as playground

This is a picture of the American electorate!

More precisely; this is a scatterplot I just made using the dataset recently released by PPP, a major political polling firm.  (They’re the outfit that did the “is your state hot or not” poll I blogged about last week.)  PPP has made available the raw responses from 46 polls with 1000 responses each, conducted more or less weekly over the course of 2011.  Here’s the whole thing as a .zip file.

Analyzing data sets like this is in some sense not hard.  But there’s a learning curve.  Little things, like:  you have to know that the .csv format is beautifully portable and universal — it’s the ASCII of data.  You have to know how to get your .csv file into your math package of choice (in my case, python, but I think I could easily have done this in r or MatLab as well) and you have to know where to get a PCA package, if it’s not already installed.  And you have to know how to output a new .csv file and make a graphic from it when you’re done.  (As you can see, I haven’t quite mastered this last part, and have presented you with a cruddy Excel scatterplot.)  In total, this probably took me about three hours to do, and now that I have a data-to-picture path I understand how to use, I think I could do it again in about 30 minutes.  It’s fun and I highly recommend it.  There’s a lot of data out there.

So what is this picture?  The scatterplot has 1000 points, one for each person polled in the December 15, 2011 PPP survey.  The respondents answered a bunch of questions, mostly about politics:

Q1: Do you have a favorable or unfavorable opinion of Barack Obama?
Q2: Do you approve or disapprove of Barack Obama‚Äôs job performance?
Q3: Do you think Barack Obama is too liberal, too conservative, or about right?
Q4: Do you approve or disapprove of the job Harry Reid is doing?
Q5: Do you approve or disapprove of the job Mitch McConnell is doing?
Q6: Do you have a favorable or unfavorable opinion of the Democratic Party?
Q7: Do you have a favorable or unfavorable opinion of the Republican Party?
Q8: Generally speaking, if there was an election today, would you vote to reelect Barack Obama, or would you vote for his Republican opponent?
Q9: Are you very excited, somewhat excited, or not at all excited about voting in the 2012 elections?
Q10: If passed into law one version of immigration reform that people have discussed would secure the border and crack down on employers who hire illegal immigrants. It would also require illegal immigrants to register for legal immigration status, pay back taxes, and learn English in order to be eligible for U.S. citizenship. Do you favor or oppose Congress passing this version of immigration reform?
Q11: Have you heard about the \$10,000 bet Mitt Romney challenged Rick Perry to in last week‚Äôs Republican Presidential debate?
Q12: (Asked only of those who say ‘yes’ to Q11:) Did Romney‚Äôs bet make you more or less likely to vote for him next year, or did it not make a difference either way?
Q13: Do you believe that there’s a “War on Christmas” or not?
Q14: Do you consider yourself to be a liberal, moderate, or conservative?
Q15: Do you consider yourself to be a supporter of the Tea Party or not?
Q16: Are you or is anyone in your household a member of a labor union?
Q17: If you are a woman, press 1. If a man, press 2.
Q18: If you are a Democrat, press 1. If a Republican, press 2. If you are an independent or a member of another party, press 3.
Q19: If you are Hispanic, press 1. If white, press 2. If African American, press 3. If Asian, press 4. If you are an American Indian, press 5. If other, press 6.
Q20: (Asked only of people who say American Indian on Q19:) Are you enrolled in a federally recognized tribe?
Q21: If you are 18 to 29 years old, press 1. If 30 to 45, press 2. If 46 to 65, press 3. If you are older than 65, press 4.
Q22: What part of the country do you live in NOW – the Northeast, the Midwest, the South, or the West?
Q23: What is your household’s annual income?

The answers to these questions, which are coded as integers, now give us 1000 points in R^{23}.  Our eyes are not good at looking at point clouds in 23-dimensional space.  So it’s useful to project down to R^2, that mos bloggable of Euclidean spaces.  But how?  We could just look at two coordinates and see what we get.  But this requires careful choice.  Suppose I map the voters onto the plane via their answers to Q1 and Q2.  The problem is, almost everyone who has a favorable opinion of Barack Obama approves of his job performance, and vice versa.  Considering these two features is hardly better than considering only one feature.  Better would be to look at Q8 and Q21; these two variables are surely less correlated, and studying both together would give us good information on how support for Obama varies with age.  But still, we’re throwing out a lot.  Principal component analysis is a very popular quick-n-dirty method of dimension reduction; it finds the projection onto R^2 (or a Euclidean space of any desired dimension) which best captures the variance in the original dataset.  In particular, the two axes in the PCA projection have correlation zero with each other.

A projection from R^23 to R^2 can be expressed by two vectors, each one of which is some linear combination of the original 23 variables.  The hope is always that, when you stare at the entries of these vectors, the corresponding axis has some “meaning” that jumps out at you.  And that’s just what happens here.

The horizontal axis is “left vs. right.”  It assigns positive weight to approving of Obama, identifying as a liberal, and approving of the Democratic Party, and negative weight to supporting the Tea Party and believing in a “War on Christmas.”  It would be very weird if any analysis of this kind of polling data didn’t pull out political affiliation as the dominant determinant of poll answers.

The second axis is “low-information voter vs. high-information voter,” I think.  It assigns a negative value to all answers of the form “don’t know / won’t answer,” and positive value to saying you are “very excited to vote” and having heard about Mitt Romney’s \$10,000 bet.  (Remember that?)

And now the picture already tells you something interesting.  These two variables are uncorrelated, by definition, but they are not unrelated.  The voters split roughly into two clusters, the Democrats and the Republicans.  But the plot is “heart-shaped” — the farther you go into the low-information voters, the less polarization there is between the two parties, until in the lower third of the graph it is hard to tell there are two parties at all.  This phenomenon is not surprising — but I think it’s pretty cool that it pops right out of a completely automatic process.

(I am less sure about the third-strongest axis, which I didn’t include in the plot.  High scorers here, like low scorers on axis 2, tend to give a lot of “don’t know” answers, except when asked about Harry Reid and Mitch McConnell, whom they dislike.  They are more likely to say they’re “not at all excited to vote” and more likely to be independents.  So I think one might call this the “to hell with all those crooks” axis.)

A few technical notes:  I removed questions, like “region of residence,” that didn’t really map on a linear scale, and others, like “income,” that not everyone answered.  I normalized all the columns to have equal variance.  I made new 0-1-valued columns to record “don’t know” answers.  Yes, I know that many people consider it bad news to run PCA on binary variables, but I decided that since I was just trying to draw pictures and not infer anything, it would be OK.

## Tukey before he was Tukey

There’s no end to the interesting tidbits to be found in The Princeton Mathematics Community in the 1930s, an oral history project hosted by the Mudd Library.  I liked this, about the great statistician John Tukey, from an interview with Joseph F. Daly and Churchill Eisenhart:

Daly: … Tukey was about as pure a mathematician as you can imagine.

Eisenhart: When he first came.

Daly: All he was interested in was axioms and set theory and stuff like that. But eventually he found out there was life after ultrafilters and things, and he had fun in statistics.

Tagged , ,

## Why not data science?

Addendum to the previous post: if the goal — surely a worthwhile one — is to promote NSF-DMS funding for data sciences, why not change the name to Division of Mathematical and Data Sciences?  My experience at the very interesting “High-dimensional phenomena” workshop at IMA was that good work in this area is being done not only by self-described statisticians, but by mathematicians, computer scientists, and electrical engineers; it seems reasonable to use a name that doesn’t suggest the field is the property of a single academic department.

Also, a colleague points out to me that DMSS would inevitably pronounced “Dumbass.”  So there’s that.

Tagged , ,

## Division of Mathematical and Statistical Sciences

Apparently the NSF is considering changing the name of the DMS (Division of Mathematical Sciences) to DMSS (Division of Mathematical and Statistical Sciences.)  There is some unease — surely at least partially related to the recent decision by the Engineering and Physical Sciences Research Council, the NSF’s rough British analogue, to restrict their math postdoctoral program to cover applied probability and statistics only.  I can attest from personal experience that pure mathematicians are very excited about the rise of data science — but also concerned about it choking out K-theory and functional analysis and geometric group theory and etc and etc.

Here’s the letter from Eric Friedlander, current AMS president:

October 10, 2011

Dear Colleagues,

I write to encourage discussion and comments among members of the AMS about the proposal under consideration by the National Science Foundation (NSF) that NSF’s Division of Mathematical Sciences (DMS) be renamed the Division of Mathematical and Statistical Sciences.  At the request of the NSF, I attach a letter from DMS Division Director Sastry Pantula advocating this name change; I also attach a particularly cogent response from a member of the AMS leadership.

dmsname@ams.org
(The process to summarize comments is described below.)

Many of us strongly oppose this name change.  Such a name change could create an unnecessary and unfortunate divide in the mathematical sciences community.  We question whether this portends a shift within DMS away from support of basic research toward mission-oriented research.  This could bring the less mathematical aspects of Statistics into the same funding pool as basic research in Mathematical Sciences, thereby negatively impacting resources available for basic research in the Mathematical Sciences, including basic research in Statistics.

While waiting for NSF approval to consult the broad mathematical community, I have discussed this personally  with many mathematical scientists, including the leadership of the AMS.  The responses I have received have been near-unanimous in their opposition to such a name change.  It is significant that three previous DMS Division Directors Peter March, William Rundell, and Philippe Tondeur have written to express their opposition to this name change.

Permit me to give some reasons why such a name change is much more important than “just a name.”

1.)  The mission of the NSF is to fund basic research.  Much of
mission-oriented Statistics is funded by other federal agencies,
hospitals, industry, etc.  This name change suggests a move within
DMS to relax its focus on basic research.
2.)  The suggestion of “new resources to all core programs” is far
different from any commitment to seek new resources to support the
basic research of these programs.
3.)  The current name (Division of Mathematical Sciences) was crafted to
be inclusive.  The inclusiveness of DMS has resulted in increased
funding for many programs including Statistics.  The Mathematical
Sciences should work together, emphasizing commonality and presenting
the best case for the importance of the Mathematical Sciences.
4.)  Statistics is only one of 10 programs supported by DMS.  In 2010, of
the 2978 proposals submitted to DMS core programs, 242 were submitted
to the Statistics program.  It is natural to ask why Statistics
appears to be uniquely selected by DMS for special emphasis.
5.)  The analysis of big data is indeed important, and the Mathematical
Sciences will play an important role in developing fundamental concepts
and approaches to manage the “data deluge” and extract useful content.
That said, National Science Foundation support of the Mathematical
Sciences should energetically embrace basic research in all aspects
of the Mathematical Sciences to advance fundamental knowledge and
initiate unexpected revolutionary applications.