Category Archives: bad statistics

What correlation means

From Maria Konnikova’s New Yorker piece on Randall Munroe and what makes science interesting:

In a meta-analysis of sixty-six studies tracking interests over time (the average study followed subjects for seven years), psychologists from the University of Illinois at Urbana–Champaign found that our interests in adolescence had only a point-five correlation with our interests later in life. This means that if a subject filled out a questionnaire about her interests at the age of, say, thirteen, and again at the age of twenty-one, only half of her answers remained consistent on both.

I think it’s totally OK to not say precisely what correlation means.  It’s sort of subtle!  It would be fine to say the correlation was “moderate,” or something like that.

But I don’t think it’s OK to say “This means that…” and then say something which isn’t what it means.  If the questionnaire was a series of yes-or-no questions, and if exactly half the answers stayed the same between age 13 and 21, the correlation would be zero.  As it should be — 50% agreement is what you’d expect if the two questionnaires had nothing to do with each other.  If the questionnaire was of a different kind, say, “rate your interest in the following subjects on a scale of 1 to 5,” then agreement on 50% of the answers would be more suggestive of a positive relationship; but it wouldn’t in any sense be the same thing as 0.5 correlation.  What does the number 0.5 add to the meaning of the piece?  What does the explanation add?  I think nothing, and I think both should have been taken out.

Credit, though — the piece does include a link to the original study, a practice that is sadly not universal!  But demerit — the piece is behind a paywall, leaving most readers just as unable as before to figure out what the study actually measured.  If you’re a journal, is the cost of depaywalling one article really so great that it’s worth forgoing thousands of New Yorker readers actually looking at your science?




Tagged ,

Where are people buying How Not To Be Wrong?

Amazon Author Central shows you Bookscan sales for your books broken down by metropolitan statistical area.  (BookScan tracks most hardcover sales, but not e-book sales.)  This allows me to see which MSAs are buying the most and fewest copies, per capita, of How Not To Be Wrong.  Unsurprisingly, Madison has by far the highest number of copies of HNTBW per person.  But Burlington, VT is not far behind!  Then there’s a big drop, until you get down to DC, SF, Boston, and Seattle, each of which still bought more than twice as many copies per person as the median MSA.

Where do people not want the book?  Lowest sales per capita are in Miami.  They also have little use for me in Los Angeles, Atlanta, and Houston.  Note that for reasons of time I only looked at the 30 MSAs that sold the most copies of the book; going farther down that list, there are more pretty big cities where the book is unpopular, like Tampa, Charlotte, San Antonio, and Orlando.

It would be interesting to compare the sales figures, not to population, but to overall hardcover book sales.  But I couldn’t find this information broken down by city.


How do you share your New York Times?

My op/ed about math teaching and Little League coaching is the most emailed article in the New York Times today.  Very cool!

But here’s something interesting; it’s only the 14th most viewed article, the 6th most tweeted, and the 6th most shared on Facebook.  On the other hand, this article about child refugees from Honduras is

#14 most emailed

#1 most viewed

#1 most shared on Facebook

#1 most tweeted

while Paul Krugman’s column about California is

#4 most emailed

#3 most viewed

#4 most shared on Facebook

#7 most tweeted.

Why are some articles, like mine, much more emailed than tweeted, while others, like the one about refugees, much more tweeted than emailed, and others still, like Krugman’s, come out about even?  Is it always the case that views track tweets, not emails?  Not necessarily; an article about the commercial success and legal woes of conservative poo-stirrer Dinesh D’Souza is #3 most viewed, but only #13 in tweets (and #9 in emails.)  Today’s Gaza story has lots of tweets and views but not so many emails, like the Honduras piece, so maybe this is a pattern for international news?  Presumably people inside newspapers actually study stuff like this; is any of that research public?  Now I’m curious.



Tagged , , , ,

Statistical chutzpah in the Indiana school grade-changing scandal

I wrote a piece for Slate yesterday about Tony Bennett, the former Indiana schools czar who intervened in the state’s school-grading system to ensure that a politically connected public charter got an A instead of a C.  (The AP’s Tom LoBianco broke the original story.)  Bennett offered interviewers an explanation for the last-minute grade change which was plainly contradicted by the figures in the internal e-mails LoBianco had obtained and released.  Presumably, Bennett figured nobody would bother to look at the actual numbers.  That is incredibly annoying.

Summary of what actually happened in Indiana, by analogy:

Suppose the syllabus for my math class said that the final grade would be determined by averaging the homework grade and the exam grade, and that the exam grade was itself the average of the grades on the three tests I gave. Now imagine a student gets a B on the homework, gets a D-minus on the first two tests, and misses the third. She then comes to me and says, “Professor, your syllabus says the exam component of the grade is the average of my grade on the three tests—but I only took twotests, so that line of the syllabus doesn’t apply to my special case, and the only fair thing is to drop the entire exam component and give me a B for the course.”

I would laugh her out of the office. Or maybe suggest that she apply for a job as a state superintendent of instruction.




Tagged , , , , ,

10,000 baby names of Harvard

My 20th Harvard reunion book is in hand, offering a social snapshot of a certain educationally (and mostly financially) elite slice of the US population.

Here is what Harvard alums name their kids.  These are chosen by alphabetical order of surname from one segment of the book.  Most of these children are born between 2003 and the present.  They are grouped by family.

Molly, Danielle

Zachary, Zoe, Alex

Elias, Ella, Irena

Sawyer, Luke

Peyton, Aiden

Richard, Sonya

Grayson, Parker, Saya

Yoomi, Dae-il

Io, Pico, Daphne

Lucine, Mayri

Matthew, Christopher

Richard, Annalise, Ryan


Christopher, Sarah, Zachary, Claire

Shaiann, Zaccary

Alexandra, Victoria, Arianna, Madeline


Grace, Luke, Anna

William, Cecilia, Maya

Bode, Tyler

Daniel, Catherine

Alex, Gretchen

Nathan, Spencer, Benjamin

Ezekiel, Jesse

Matthew, Lauren, Ava, Nathan

Samuel, Katherine, Peter, Sophia

Ameri, Charles


Andrew, Zachary, Nathan

Alexander, Gabriella


Andrew, Nadia

Caroline, Elizabeth

Paul, Andrew

Shania, Tell, Delia

Saxon, Beatrix


Nathan, Lukas, Jacob

Noah, Haydn, Ellyson


Leonidas, Cyrus

Isabelle, Emma

Joseph, Theodore

Asha, Sophie, Tejas

Gabriela, Carlos, Sebastian

Brendan, Katherine


James, Seeger, Arden

Helena, Freya

Alexandra, Matthew


If you saw these names, would you be able to guess roughly what part of the culture they were drawn from?  Are there ways in which the distribution is plainly different from “standard” US naming practice?


Tagged , ,

Tantalisingly close to significance

Matthew Hankins and others on Twitter are making fun of scientists who twist themselves up lexically in order to report results that fail the significance test, using phrases like “approached but did not quite achieve significance” and “only just insignificant” and “tantalisingly close to significance.”

But I think this fun-making is somewhat misplaced!  We should instead be jeering at the conventional dichotomy that a result significant at p < .05 is “a real effect” and one that scores at p = .06 is “no effect.”

The lexically twisted scientists are on the side of the angels here, insisting that a statistically insignificant finding is usually much better described as “not enough evidence” than “no evidence,” and should be mentioned, in whatever language the journal allows, not mulched.





Tagged , ,

Math on Trial, by Leila Schneps and Coralie Colmez

The arithmetic geometer Leila Schneps, who taught me most of what I know about Galois actions on fundamental groups of varieties, has a new book out, Math on Trial:  How Numbers Get Used and Abused in the Courtroom, written with her daughter Coralie Colmez.  Each chapter covers a famous case whose resolution, for better or worse, involved a mathematical argument.  Interspersed among the homicide and vice are short chapters that speak directly to some of the mathematical and statistical issues that arise in legal matters.  One of the cases is the much-publicized prosecution of college student Amanda Knox for a murder in Italy; today in the New York Times, Schneps and Colmez write about some of the mathematical pratfalls in their trial.

I am happy to have played some small part in building their book — I was the one who told Leila about the murder of Diana Sylvester, which turned into a whole chapter of Math on Trial; very satisfying to see the case treated with much more rigor, depth, and care than I gave it on the blog!  I hope it is not a spoiler to say that Schneps and Colmez come down on the side of assigning a probability close to 1 that the right man was convicted (though not nearly so close to 1 as the prosecution claimed, and perhaps to close enough for a jury to have rightfully convicted, depending on how you roll re “reasonable doubt.”)

Anyway — great book!  Buy, read, publicize!



Tagged , , , , , ,

Voros McCracken is a wise, wise man

From McCracken’s talk at the MIT Sloan Sports Analytics Conference, reported by Fangraphs:

Just because everyone knows OBP is important doesn’t mean OBP isn’t important. Just because we learned something a long time ago doesn’t mean we should unlearn it. We should keep it and add to it. There are a lot of people who are itching to do the next new thing. That’s great, it’s just that mindset can cause you to forget some of the basics.

“Not to pint fingers at any team, but to a certain extent the Mariners did that. They got so wrapped up in talking advantage of fielding statistics that they forgot they should have a first baseman with an on-base percentage over .280. Maybe that’s unfair. If they were here, they may interrupt me and say no, that’s not the way it happened. But my perception is that sometimes you can forget about the basics when you’re pursuing something new.

“You might say to yourself, ‘I want a stat that can measure this.’ Then video technology comes out and gives you the stat you wanted to measure. There is a tendency to think, “Ooh, I’ve been waiting for this, and now I’ve got it, and it’s the greatest stat in the world.” But you haven’t even looked at it yet. You haven’t looked at what it actually says — what its weaknesses are. There’s a hazard there. You want to know more things than your competition. What you don’t want is to know something your competition doesn’t, and it’s wrong. If everybody is wrong about something it doesn’t hurt you too bad, but if you’re the only one, you have 29 teams taking advantage of your mistake.

“Scouting is still an important a smell test. If scouts all say someone is a terrible defender, and a stat says he’s the best defender in the world, the truth is probably somewhere in between. Scouts say things for a reason, and you shouldn’t dismiss that.

“If you come up with a new number, and somebody says they don’t like it, I don’t think it’s helpful to just keep pointing at it, over and over again. ‘Well, that’s the number.’ Every number a guy like me comes up with it, you have be skeptical of. You have to be extremely skeptical. That’s the quickest way to knowledge. If you don’t believe something, figure out if it’s true or not. It’s a basic scientific approach, to a certain extent. Falsifiable hypotheses, that sort of thing.


Tagged ,

More on the end of history: what is a rational prediction?

It’s scrolled off the bottom of the page now, but there’s an amazing comment thread going on under my post on “The End of History Illusion,” the Science paper that got its feet caught in a subtle but critical statistical error.

Commenter Deinst has been especially good, digging into the paper’s dataset (kudos to the authors for making it public!) and finding further reasons to question its conclusions.  In this comment, he makes the following observation:  Quoidbach et al believe there’s a general trend to underestimate future changes in “favorites,” testing this by studying people’s predictions about their favorite movies, food, music, vacation, hobbies, and their best friends, averaging, and finding a slightly negative bias.  What Deinst noticed is that the negative bias is almost entirely driven by people’s unwillingness to predict that they might change their best friend.  On four of the six dimensions, respondents predicted more change than actually occurred.  That sounds much more like “people assign positive moral value to loyalty to friends” than “people have a tendency across domains to underestimate change.”

But here I want to complicate a bit what I wrote in the post.  Neither Quoidbach’s paper nor my post directly addresses the question:  what do we mean by a “rational prediction?”  Precisely:  if there is an outcome which, given the knowledge I have, is a random variable Y, what do I do when asked to “predict” the value of Y?  In my post I took the “rational” answer to be EY.  But this is not the only option.  You might think of a rational person as one who makes the prediction most likely to be correct, i.e. the modal value of Y.  Or you might, as Deinst suggests, think that rational people “run a simulation,” taking a random draw from Y and reporting that as the prediction.

Now suppose people do that last thing, exactly on the nose.  Say X is my level of extraversion now, Y is my level of extraversion in 10 years, and Z is my prediction for the value of Y.  In the model described in the first post, the value of Z depends only on the value of X; if X=a, it is E(Y|X=a).  But in the “run a simulation” model, the joint distribution of X and Z is exactly the same as the joint distribution of X and Y; in particular, E(|Z-X|) and E(|Y-X|) agree.

I hasten to emphasize that there’s no evidence Quoidbach et al. have this model of prediction in mind, but it would give some backing to the idea that, absent an “end of history bias,” you could imagine the absolute difference in their predictor condition matching the absolute difference in the reporter condition.

There’s some evidence that people actually do use small samples, or even just one sample, to predict variables with unknown distributions, and moreover that doing so can actually maximize utility, under some hypotheses on the cognitive cost of carrying out a more fully Bayesian estimate.

Does that mean I think Quoidbach’s inference is OK?  Nope — unfortunately, it stays wrong.

It seems very doubtful that we can count on people hewing exactly to the one-sample model.

Example:  suppose one in twenty people radically changes their level of extraversion in a 10-year interval.  What happens if you ask people to predict whether they themselves are going to experience such a change in the next 10 years?  Under the one-sample model, 5% of people would say “yes.”  Is this what would actually happen?  I don’t know.  Is it rational?  Certainly it fails to maximize the likelihood of being right.  In a population of fully rational Bayesians, everyone would recognize shifts like this as events with probabiity less than 50%, and everyone would say “no” to this question.  Quoidbach et al. would categorize this result as evidence for an “end of history illusion.”  I would not.

Now we’re going to hear from my inner Andrew Gelman.  (Don’t you have one?  They’re great!)  I think the real problem with Quoidbach et al’s analysis is that they think their job is to falsify the null hypothesis.  This makes sense in a classical situation like a randomized clinical trial.  Your null hypothesis is that the drug has no effect.  And your operationalization of the null hypothesis — the thing you literally measure — is that the probability distribution on “outcome for patients who get the drug” is the same as the one on “outcome for patients who don’t get the drug.”  That’s reasonable!  If the drug isn’t doing anything, and if we did our job randomizing, it seems pretty safe to assume those distributions are the same.

What’s the null hypothesis in the “end of history” paper?   It’s that people predict the extent of personality change in an unbiased way, neither underpredicting nor overpredicting it.

But the operationalization is that the absolute difference of predictions, |Z-X|, is drawn from the same distribution as the difference of actual outcomes, |Y-X|, or at least that these distributions have the same means.  As we’ve seen, even without any “end of history illusion”, there’s no good reason for this version of the null hypothesis to be true.  Indeed, we have pretty good reason to believe it’s not true.  A rejection of this null hypothesis tells us nothing about whether there’s an end of history illusion.  It’s not clear to me it tells you anything at all.






Tagged , , , , , ,

Do we really underestimate how much we’ll change? (or: absolute value is not linear!)

Let’s say I present you with a portfolio of five stocks,  and ask you to predict each stock’s price one year from now.  You know the current prices, and you know stocks are pretty volatile, but absent any special reason to think five companies are more likely to have good years than bad ones, you write down the current price as your best prediction for all five slots.

Then I write a paper accusing you of suffering from an “end of financial history illusion.”  After all, on average you predicted that the stock values won’t change at all over six months — but in reality, stock prices change a lot!  If I compute how much each of the five stock prices changed over the last six months, and average those numbers, I get something pretty big.  And yet you, you crazy thing, seem to believe that the stock prices, having arrived at their current values, are to be fixed in place forever more.

Pretty bad argument, right?

And yet the same computation, applied to five personality traits instead of five stocks, got published in Science.  Quoidbach, Gilbert, and Wilson write:

In study 1, we sought to determine whether people underestimate the extent to which their personalities will change in the future. We recruited a sample of 7519 adults ranging in age from 18 to 68 years [mean (M) = 40 years, standard deviation (SD) = 11.3 years, 80% women] through the Web site of a popular television show and asked them to complete the Ten Item Personality Inventory (1), which is a standard measure of the five trait dimensions that underlie human personality (i.e., conscientiousness, agreeableness, emotional stability, openness to experience, and extraversion). Participants were then randomly assigned either to the reporter condition (and were asked to complete the measure as they would have completed it 10 years earlier) or the predictor condition (and were asked to complete the measure as they thought they would complete it 10 years hence). We then computed the absolute value of the difference between participants’ ratings of their current personality and their reported or predicted personality and averaged these across the five traits to create a measure of reported or predicted change in personality.

This study is getting a lot of press:  it was written up in the New York Times (why, oh why, is it always John Tierney?), USA Today, and Time, and even made it to Mathbabe.

Unfortunately, it’s wrong.

The difference in predictions is not the predicted difference 

The error here is just the same as in the story of the stocks.  The two quantities

  • The difference between the predicted future value and the current value
  • The predicted difference between the future value and the current value

sound like the same thing.  But they’re not the same thing.  Life’s noncommutative that way sometimes. Quoidbach et al are measuring the former quantity and referring to it as if it’s the latter.

You can see the difference even in a very simple model.  Let’s say the ways a stock works is that, over six months, there’s a 30% chance it goes up a dollar, a 25% chance it goes down a dollar, and a 45% chance it stays the same.  And let’s say you know this.  Then your estimated expected value of the stock price six months from now is “price now + 5 cents,” and the first number — the size of difference between your predicted value and the current value is 5 cents.

But what’s the second number?  In your model, the difference between the future price and the current price has a 55% chance of being a dollar and a 45% chance of being zero.  So your prediction for the size of the difference is 55 cents — 11 times as much!

If you measure the first quantity and say you’ve measured the second, you’re gonna have a bad time.

In the “predictor” condition of the paper, a rational respondent quizzed about a bunch of stocks will get a score of about 5 cents.  What about the “reporter” condition?  Then the respondent’s score will be the average value of the difference between the price six months ago and the price now; this difference will be a dollar 55% of the time and zero 45% of the time, so the scores in the reporter condition will average 55 cents.

To sum up:  completely rational respondents with full information ought to display the behavior observed by Quoidbach et al — precisely the behavior the authors adduce as evidence that their subjects are in the grips of a cognitive bias!

To get mathy with it for a minute — if Y is the value of a variable at some future time, and X is the value now, the two quantities are

  • |E(Y-X)|
  • E(|Y-X|)

Those numbers would be the same if absolute value were a linear function.  But absolute value isn’t a linear function.  Unless, that is, you know a priori that Y -X was positive.  In other words, if people knew for certain that over a decade they’d get less extraverted, but didn’t know to what extent, you might expect to see the same scores appearing in the predictor and reporter conditions.  But this is not, in fact, something people know about themselves.

I always think I’m right but I don’t think I’m always right

The study I’ve mentioned isn’t the only one in the paper.  Here’s another:

[In study 3]…we recruited a new sample of 7130 adults ranging from 18 to 68 years old (M = 40.2 years, SD = 11.1 years, 80% women) through the same Web site and asked them to report their favorite type of music, their favorite type of vacation, their favorite type of food, their favorite hobby, and the name of their best friend. Participants were then randomly assigned either to the reporter condition (and were asked to report whether each of their current preferences was the same as or different than it was 10 years ago) or the predictor condition (and were asked to predict whether each of their current preferences would be the same or different 10 years from now). We then counted the number of items on which participants responded “different” and used this as a measure of reported or predicted changes in preference.

Let’s say I tend to change my favorite music (respectively vacation, food, hobby, and friend) about once every 25 years, so that there’s about a 40% chance that in a given ten-year period I’ll make a change.  And let’s say I know this about myself, and I’m free from cognitive biases.  If you ask me to predict whether I’ll have the same or different favorite food in ten years, I’ll say “same” — after all, there’s a 60-40 chance that’s correct!  Ditto for the other four categories.

Once again, Quoidbach et al refer to the number of times I answer “different” as “a measure of predicted changes in preference.”  But it isn’t — or rather, it has nothing to say about the predicted number of changes.  If you ask me “How many of the five categories do you think I’ll change in the next ten years?” I’ll say “two.”  While if you ask me, for each of the five categories in turn, “Do you think you’ll change this in the next ten years?” I’ll say no, five times straight.  This is not a contradiction and it is not a failure of rationality and it is not a cognitive bias.  It is math, done correctly.

(Relevant philosophical maxim about groundedness of belief:  “I always think I’m right, but I don’t think I’m always right.”  We correctly recognize that some subset of things we currently believe are wrong, but each particular belief we take as correct.  Update:  NDE in comments reminds me that WVO Quine is the source of the maxim.)

What kind of behavior would the authors consider rational in this case?  Presumably, one in which the proportion of “different” answers is the same in the prospective and retrospective conditions.  In other words, I’d score as bias-free if I answered

“My best friend and my favorite music will change, but my favorite food, vacation, and hobby will stay the same.”

This answer has a substantially smaller chance of being correct than my original one.  (108/3125 against 243/3125, if you’re keeping score at home.)  The author’s suggestion that it represents a less biased response is wrong.

Now you may ask:  why didn’t Quoidbach et al just directly ask people “to what extent do you expect your personality to change over the next ten years?” and compare that with retrospective report?  To their credit, they did just that — and there they did indeed find that people predicted smaller changes than they reported:

Third, is it possible that predictors in study 1 knew that they would change over the next 10 years, but because they did not know exactly how they would change, they did not feel confident predicting specific changes? To investigate this possibility, we replicated study 1 with an independent sample of 1163 adults (M = 38.4 years, SD = 12.1 years, 78% women) recruited through the same Web site. Instead of being asked to report or predict their specific personality traits, these participants were simply asked to report how much they felt they had “changed as a person over the last 10 years” and how much they thought they would “change as a person over the next 10 years.” Because some participants contributed data to both conditions, we performed a multilevel version of the analysis described in study 1. The analysis revealed the expected effect of condition (β = –0.74, P = 0.007), indicating that predictors aged a years predicted that they would change less over the next decade than reporters aged a + 10 years reported having changed over the same decade. This finding suggests that a lack of specific knowledge about how one might change in the future was not the cause of the effects seen in study 1.

This study, unlike the others, addresses the question the paper proposes to consider.  To me, it seems questionable that numerical answers to “how much will you change as a person in the next 10 years?” are directly comparable with numerical answers to “how much did you change as a person over the last 10 years?” but this is a question about empirical social science, not mathematics.  Even if I were a social scientist, I couldn’t really judge this part of the study, because the paragraph I just quoted is all we see of it — how the questions were worded, how they were scored, what the rest of the coefficients in the regression were, etc, are not available, either in the main body of the paper or the supplementary material.

[Update:  Commenter deinst makes the really important point that Quoidbach et al have made their data publicly available at the ICPSR repository, and that things like the exact wording of the questions, the scoring mechanism, are available there.]

Do we actually underestimate the extent to which we’ll change our personalities and preferences over time? It certainly seems plausible:  indeed, other researchers have observed similar effects, and the “changed as a person” study in the present paper is suggestive in this respect.

But much of the paper doesn’t actually address that question.   Let me be clear:  I don’t think the authors are trying to put one over.  This is a mistake — a somewhat subtle mistake, but a bad mistake, and one which kills a big chunk of the paper. Science should not have accepted the article in its current form, and the authors should withdraw it, revise it, and resubmit it.

Yes, I know this isn’t actually going to happen.

Tagged , , ,

Get every new post delivered to your Inbox.

Join 611 other followers

%d bloggers like this: