Do we really underestimate how much we’ll change? (or: absolute value is not linear!)

Let’s say I present you with a portfolio of five stocks,  and ask you to predict each stock’s price one year from now.  You know the current prices, and you know stocks are pretty volatile, but absent any special reason to think five companies are more likely to have good years than bad ones, you write down the current price as your best prediction for all five slots.

Then I write a paper accusing you of suffering from an “end of financial history illusion.”  After all, on average you predicted that the stock values won’t change at all over six months — but in reality, stock prices change a lot!  If I compute how much each of the five stock prices changed over the last six months, and average those numbers, I get something pretty big.  And yet you, you crazy thing, seem to believe that the stock prices, having arrived at their current values, are to be fixed in place forever more.

Pretty bad argument, right?

And yet the same computation, applied to five personality traits instead of five stocks, got published in Science.  Quoidbach, Gilbert, and Wilson write:

In study 1, we sought to determine whether people underestimate the extent to which their personalities will change in the future. We recruited a sample of 7519 adults ranging in age from 18 to 68 years [mean (M) = 40 years, standard deviation (SD) = 11.3 years, 80% women] through the Web site of a popular television show and asked them to complete the Ten Item Personality Inventory (1), which is a standard measure of the five trait dimensions that underlie human personality (i.e., conscientiousness, agreeableness, emotional stability, openness to experience, and extraversion). Participants were then randomly assigned either to the reporter condition (and were asked to complete the measure as they would have completed it 10 years earlier) or the predictor condition (and were asked to complete the measure as they thought they would complete it 10 years hence). We then computed the absolute value of the difference between participants’ ratings of their current personality and their reported or predicted personality and averaged these across the five traits to create a measure of reported or predicted change in personality.

This study is getting a lot of press:  it was written up in the New York Times (why, oh why, is it always John Tierney?), USA Today, and Time, and even made it to Mathbabe.

Unfortunately, it’s wrong.

The difference in predictions is not the predicted difference 

The error here is just the same as in the story of the stocks.  The two quantities

  • The difference between the predicted future value and the current value
  • The predicted difference between the future value and the current value

sound like the same thing.  But they’re not the same thing.  Life’s noncommutative that way sometimes. Quoidbach et al are measuring the former quantity and referring to it as if it’s the latter.

You can see the difference even in a very simple model.  Let’s say the ways a stock works is that, over six months, there’s a 30% chance it goes up a dollar, a 25% chance it goes down a dollar, and a 45% chance it stays the same.  And let’s say you know this.  Then your estimated expected value of the stock price six months from now is “price now + 5 cents,” and the first number — the size of difference between your predicted value and the current value is 5 cents.

But what’s the second number?  In your model, the difference between the future price and the current price has a 55% chance of being a dollar and a 45% chance of being zero.  So your prediction for the size of the difference is 55 cents — 11 times as much!

If you measure the first quantity and say you’ve measured the second, you’re gonna have a bad time.

In the “predictor” condition of the paper, a rational respondent quizzed about a bunch of stocks will get a score of about 5 cents.  What about the “reporter” condition?  Then the respondent’s score will be the average value of the difference between the price six months ago and the price now; this difference will be a dollar 55% of the time and zero 45% of the time, so the scores in the reporter condition will average 55 cents.

To sum up:  completely rational respondents with full information ought to display the behavior observed by Quoidbach et al — precisely the behavior the authors adduce as evidence that their subjects are in the grips of a cognitive bias!

To get mathy with it for a minute — if Y is the value of a variable at some future time, and X is the value now, the two quantities are

  • |E(Y-X)|
  • E(|Y-X|)

Those numbers would be the same if absolute value were a linear function.  But absolute value isn’t a linear function.  Unless, that is, you know a priori that Y -X was positive.  In other words, if people knew for certain that over a decade they’d get less extraverted, but didn’t know to what extent, you might expect to see the same scores appearing in the predictor and reporter conditions.  But this is not, in fact, something people know about themselves.

I always think I’m right but I don’t think I’m always right

The study I’ve mentioned isn’t the only one in the paper.  Here’s another:

[In study 3]…we recruited a new sample of 7130 adults ranging from 18 to 68 years old (M = 40.2 years, SD = 11.1 years, 80% women) through the same Web site and asked them to report their favorite type of music, their favorite type of vacation, their favorite type of food, their favorite hobby, and the name of their best friend. Participants were then randomly assigned either to the reporter condition (and were asked to report whether each of their current preferences was the same as or different than it was 10 years ago) or the predictor condition (and were asked to predict whether each of their current preferences would be the same or different 10 years from now). We then counted the number of items on which participants responded “different” and used this as a measure of reported or predicted changes in preference.

Let’s say I tend to change my favorite music (respectively vacation, food, hobby, and friend) about once every 25 years, so that there’s about a 40% chance that in a given ten-year period I’ll make a change.  And let’s say I know this about myself, and I’m free from cognitive biases.  If you ask me to predict whether I’ll have the same or different favorite food in ten years, I’ll say “same” — after all, there’s a 60-40 chance that’s correct!  Ditto for the other four categories.

Once again, Quoidbach et al refer to the number of times I answer “different” as “a measure of predicted changes in preference.”  But it isn’t — or rather, it has nothing to say about the predicted number of changes.  If you ask me “How many of the five categories do you think I’ll change in the next ten years?” I’ll say “two.”  While if you ask me, for each of the five categories in turn, “Do you think you’ll change this in the next ten years?” I’ll say no, five times straight.  This is not a contradiction and it is not a failure of rationality and it is not a cognitive bias.  It is math, done correctly.

(Relevant philosophical maxim about groundedness of belief:  “I always think I’m right, but I don’t think I’m always right.”  We correctly recognize that some subset of things we currently believe are wrong, but each particular belief we take as correct.  Update:  NDE in comments reminds me that WVO Quine is the source of the maxim.)

What kind of behavior would the authors consider rational in this case?  Presumably, one in which the proportion of “different” answers is the same in the prospective and retrospective conditions.  In other words, I’d score as bias-free if I answered

“My best friend and my favorite music will change, but my favorite food, vacation, and hobby will stay the same.”

This answer has a substantially smaller chance of being correct than my original one.  (108/3125 against 243/3125, if you’re keeping score at home.)  The author’s suggestion that it represents a less biased response is wrong.

Now you may ask:  why didn’t Quoidbach et al just directly ask people “to what extent do you expect your personality to change over the next ten years?” and compare that with retrospective report?  To their credit, they did just that — and there they did indeed find that people predicted smaller changes than they reported:

Third, is it possible that predictors in study 1 knew that they would change over the next 10 years, but because they did not know exactly how they would change, they did not feel confident predicting specific changes? To investigate this possibility, we replicated study 1 with an independent sample of 1163 adults (M = 38.4 years, SD = 12.1 years, 78% women) recruited through the same Web site. Instead of being asked to report or predict their specific personality traits, these participants were simply asked to report how much they felt they had “changed as a person over the last 10 years” and how much they thought they would “change as a person over the next 10 years.” Because some participants contributed data to both conditions, we performed a multilevel version of the analysis described in study 1. The analysis revealed the expected effect of condition (β = –0.74, P = 0.007), indicating that predictors aged a years predicted that they would change less over the next decade than reporters aged a + 10 years reported having changed over the same decade. This finding suggests that a lack of specific knowledge about how one might change in the future was not the cause of the effects seen in study 1.

This study, unlike the others, addresses the question the paper proposes to consider.  To me, it seems questionable that numerical answers to “how much will you change as a person in the next 10 years?” are directly comparable with numerical answers to “how much did you change as a person over the last 10 years?” but this is a question about empirical social science, not mathematics.  Even if I were a social scientist, I couldn’t really judge this part of the study, because the paragraph I just quoted is all we see of it — how the questions were worded, how they were scored, what the rest of the coefficients in the regression were, etc, are not available, either in the main body of the paper or the supplementary material.

[Update:  Commenter deinst makes the really important point that Quoidbach et al have made their data publicly available at the ICPSR repository, and that things like the exact wording of the questions, the scoring mechanism, are available there.]

Do we actually underestimate the extent to which we’ll change our personalities and preferences over time? It certainly seems plausible:  indeed, other researchers have observed similar effects, and the “changed as a person” study in the present paper is suggestive in this respect.

But much of the paper doesn’t actually address that question.   Let me be clear:  I don’t think the authors are trying to put one over.  This is a mistake — a somewhat subtle mistake, but a bad mistake, and one which kills a big chunk of the paper. Science should not have accepted the article in its current form, and the authors should withdraw it, revise it, and resubmit it.

Yes, I know this isn’t actually going to happen.

Tagged , , ,

60 thoughts on “Do we really underestimate how much we’ll change? (or: absolute value is not linear!)

  1. JSE says:

    And of course I’m well aware of the irony that, after writing several posts in a row on the importance of deferring to experts, here I am, a non-psychologist, disputing a peer-reviewed psychology paper! I can only say that a) I think the mistake is not psychological, but mathematical; b) I did discuss this with some social scientists before posting to make sure I was not 100% full of it; and c) I promise to keep my mind open to psychologists and other researchers in social science who want to convince me that I’m 80% full of it. (That last 20% is always hard for me to concede.)

  2. BCnrd says:

    Will you consider writing a letter to the editor of Science to explain via succinct examples (that can fit in a “letter to the editor” how the paper is flawed)? Or perhaps contact the authors with these points?

  3. If they are anything like the NYTimes, they will publish plenty of articles which either misrepresent or downright falsify the facts, and then, in a brazen display of underserved arrogance, will print correction pages consisting solely of a small note indicating the misuse of a preposition.

  4. Tom Leinster says:

    Excellent post, and I think BCnrd’s idea is great.

    Re the new theme: I like the overall design, especially the title, but for me the font for the main text is too small. It’s also grey rather than black, as I see it anyway. The type is OK in small doses, but in this longish post I found myself wanting to skip bits because it was too much effort to read. Quoted text comes out both bigger and blacker. I’m reading on a laptop.

  5. JSE says:

    I completely agree that the default Chunk text is too small. On my MacBook Pro using Chrome, I just hit Command-+ a few times and this zooms the page so that the font is nice and big. It’s “sticky” — the page stays at my preferred zoom when I come back to it later. Let me know if this works!

    I don’t think there’s a “make the font blacker” keyboard shortcut, though. Maybe I should keep writing about things I’m slightly peeved about and this will induce me to use more boldface.

  6. […] over at Quomodocumque has written the post I wish I wrote on the recent “end of history illusion” introduced in this New York Times article. […]

  7. Deinst says:

    The one thing that the paper should be commended for is that they made the data publicly available at (see the last item in the bibliography). This should make writing a publishable rebuttal much easier.

  8. Deinst says:

    The data also addresses your complaints about not knowing how the questions were worded, scored, etc. I downloaded the data to figure out exactly what was going on with figure 1. Reporting standardized values without reporting means and standard deviations is evil. I’m pretty skeptical of standardizing absolute data in general, zero is a much more meaningful landmark than the mean.

  9. JSE says:

    Great find! I added an update to this effect. I wasn’t able to convince the archive to unzip, but it sounds like you did. I’d be very curious to hear more about how the data looks to you, if you feel like messing with it!

    I’m less interested in “rebuttal” than in the question of whether their raw data has enough information to measure the quantity they really want to measure.

  10. Terence Tao says:

    If you have the custom CSS upgrade, then you can change the font parameters. In my own CSS (which consists mostly of code contributed by commenters, actually), I have the snippet

    body {

    which I believe is what makes my blog text black. Presumably one can also fiddle with the font size in this fashion.

  11. Rutger says:

    Do the respondents ask their friends, whether they (the respondents) have changed? You may think you’ve changed a lot, but it might not seem so from the outside.

  12. JSE says:

    I don’t have that upgrade, because I fear I’d spend too much time fiddling!

  13. Deinst says:

    There were 10 questions, each answered on a scale of 1-10 with 1 meaning “not at all” and 10 meaning “extremely”.

    1. I see myself as Extroverted, enthusiastic.
    2. I see myself as Critical, quarrelsome.
    3. I see myself as Dependable, self-disciplined.
    4. I see myself as Anxious, easily upset.
    5. I see myself as Open to new experiences, complex.
    6. I see myself as Reserved, quiet.
    7. I see myself as Sympathetic, warm.
    8. I see myself as Disorganized, careless.
    9. I see myself as Calm, emotionally stable.
    10. I see myself as Conventional, uncreative.

    Note that questions i and i+5 are essentially the same question reversed. Change is measured by the sum of the absolute difference between the two answers to the same questions (Manhattan metric).

    What I found most surprising is the opposite of what the news reports are saying. The remembered change is a change of about 1.46 points per question, while the predicted change is about 1.2 points per question. Yes, the amount of change is underestimated consistently, but not by much. This is implicit, but not obvious in the reported values of beta in the paper. I wish I could predict the magnitude of a change in the price of a stock with something near that accuracy. Calling it “The End Of History” is at best hyperbole.

    The interesting problem is to try to tease out the accuracy of the predictions. There seem to be some correlations between the remembered states and the changes. It seems doubtful that there is enough data to tease out anything significant.

  14. Leila Schneps says:

    Great post, Jordan! I like it when reading a math argument actually gets me laughing. You did put your finger exactly on the weak point, and illustrated it perfectly.

  15. John Rickert says:

    Is it possible that both 1.46 and 1.2 are accurate and that change slows down as people age? When the pro- and retro-pective changes are broken down by age does each average hold steady?

  16. JSE says:

    (reply to John Rickert) — indeed, change does slow down on average as people age; the claim the study makes is that people’s estimated future change is too low even taking that ambient effect into account.

  17. I’m quite content with pressing CTRL-+ so as to read all the text easily. Nevertheless, in the title of the post, “DO WE REALLY UNDERESTIMATE” appears to me to be in grey (not black) (?). I certainly think black on white is more legible than grey on white.

  18. Deinst says:

    @John Rickert Yes, the 1.46 to 1.2 ratio is a slight overestimate of the actual effect. It is easy compute though. It is moderately consistent in the personality tests, both the test where they ask the respondents to report their current and time shifted state and compute the difference, and the test where they ask the respondents to estimate just the magnitude of the change. All I wanted to do is to note that the effect was considerably smaller than the news reports seemed to indicate.

    In replicating the statistical analyses I note that the values of R^2 are around 0.05 or less. I tend to be suspicious of reports of regressions that do not report all three of beta, R^2, and p. I am neither a social psychologist nor a statistician, so take anything I say with a large grain of salt.

  19. Richard Séguin says:

    JSE said “I don’t have that upgrade, because I fear I’d spend too much time fiddling!”

    Oh, just go ahead and do it! A year ago I switched to the LaTeX memoir document class and spent some time figuring out how to customize the styles of running headers, chapter pages, etc. In the end I was extremely pleased with the result and have never regretted the time spent. If you’re wondering how to do something, someone else has probably already figured it out and documented it on the internet somewhere.

    I’m wondering if WordPress lets you choose a theme and play with it before it actually goes live on the internet.

  20. JSE says:

    @Deinst: What’s the effect size in the follow-up to Study 1 where they really did ask people “how much do you expect to change as a person”?

  21. Pedro says:

    Reblogged this on From experience to meaning… and commented:
    I found this post through a tweet from @wduyck and it’s quite revealing. I had read some press about the original research, but sometimes it’s good that someone from another field of expertise takes a look at research…

  22. […] New videos from Coach Jason covering the new moves in the January programming: Lateral lunge with overhead reach / Contralateral jump & reach / Snatch-grip Romanian deadlift Here’s why the brain is so awesome Top tips for your first 2 years of CrossFit How to sit in a chair and drink tea Why you won’t be the person you expect to be (plus a rebuttal) […]

  23. […] Pashler, a psychologist at UCSD, tweeted my post about the “end of history” study, and this led me to his interesting paper, “Is the Replicability Crisis […]

  24. NDE says:

    The 40% example is not quite the same phenomenon, right? It’s nonlinearity of the rounding function (or equivalently of roundoff error), not of absolute value. If you waited 15 years then al the 60%’s would round up and you’d get an overestimate.

    For positive but small summands, Quine wrote about this apparent paradox in _Quiddities_, page 21: “To believe something is to believe that it is true; therefore a reasonable person believes each of his beliefs to be true; yet experience has taught him to expect that some of his beliefs, he knows not which, will turn out to be false. A reasonable person believes, in short, that each of his beliefs is true and that some of them are false. I, for one, had expected better of reasonable persons.”

  25. JSE says:

    @NDE: right — there are two different nonlinear functions whose linearity is implicitly made use of.

    Re Quine — THANK YOU! I knew a famous philosopher said this but I failed to be able to Google it. Will change post.

  26. […] 2013 issue of Science, has gotten a lot of publicity on NPR and the web. For a nice critique, see Do We Really Underestimate How Much We’ll Change? (Or: Absolute Value Is Not Linear!) at the blog […]

  27. Martha Smith says:

    A lot of interesting discussion here. But — after writing the above blog entry (linked back as Comment 26 above) last night, I thought more while taking a walk this morning, and realized there is a flaw in Jason’s argument. I agree that |E(Y-X)| and E(|Y-X|) are not the same function. But in the application in question, this does not apply, because “personality” is not a numerical variable, nor can it be measured as a numerical variable. However, the “dimensions” of personality can be measured by numerical variables. The personality questionnaire used in the study did not have a single numerical outcome; it had five numerical outcomes, one for each personality “dimension”. The sum of the scores on all ten questions has no meaning. So there were no Y and X available to measure “personality at time t” and “personality at time t + 10”. There were only five Yi’s and Xi’s to measure five “scores on personality dimension i” at time t and t + 10. These five five Yi’s and Xi’s were then used to measure “change in personality dimension i” over the interval. But adding the signed changes would not make sense as a measure of personality; it would be comparing apples and oranges. (For example, why should an increase in extroversion cancel out a decrease in conscientiousness?) So the only sensible thing to do to measure “total personality change” would be to add the absolute values on the five “dimensions.”

  28. JSE says:

    Martha — I agree that it would not make sense to add the five signed differences “predicted trait value at t+10 – self-reported trait value at t.” But it also doesn’t make sense to do what they did. I’m not sure how, if at all, you could measure the thing they want to measure using the data they collected.

  29. Martha Smith says:

    Jason – I think perhaps my last line was unclear. I should have said, “add the absolute values of the differences for the five ‘dimensions’,” (i.e., add the absolute values of the changes in each dimension), which gives (up to a constant factor — i.e., they took the average, not the sum) what the authors did. In other words, their measure of “personality change” is the average of the absolute changes in the five dimensions. This seems to be about as good as one could hope to get.

    I would agree that the data are poor — it would have been better to use a longer, presumably better measure of each of the five personality dimensions, rather than the short form using only two questions for each dimension. (See Deinst’s Jan 6 comment.) But I can see that using a longer questionnaire would likely have reduced response rate — an unfortunate but common problem in collecting data from people, one that requires compromises.

  30. JSE says:

    It may be that it’s about as good as one could hope to get with the study design they used, but that doesn’t change the fact that it’s not good enough to draw the conclusions they want to draw. And more refined personality measures wouldn’t change that. Fully rational people with fully correct knowledge about personality change would still get lower scores in the predictor condition than in the reporter condition, so the result that you really do get lower scores in the predictor condition than in the reporter condition still wouldn’t suggest an “end of history illusion.”

  31. Martha Smith says:

    I wasn’t arguing for Quoidbach et al’s arguments (see below), just pointing out how your absolute value argument didn’t fit.

    Here, for what it’s worth, are (I) my comments on the Quoidbach et al paper, based on reading the paper (but not the supplementary information) before having seen the comments on this blog, followed by (II) some additional comments after looking at the supplemental material.

    I. Initial comments on the paper:

    Positive points:
    1. The authors did not do huge numbers of hypothesis tests on a single data set, instead taking additional samples to test further questions.
    2. The sample sizes seemed large enough for the questions asked.

    Negative points:
    1. Other than being large, the samples had lots of questionable aspects. First, (except possibly for the MIDUS sample) they appear to be volunteer samples – but there was no discussion of how having volunteers might affect the population from which the samples were effectively drawn. Second, the initial sample for study 1, the first “replication” sample for study 1, the second “replication” sample for study 1, the sample for study 2, and the sample for study 3 were all majority female (respectively, 80%, 86.6%, 78%, 82% and 80%), but there was no discussion of whether or not it is known that men and women differ on the questions asked. Third, the MIDUS sample was 55% women, which raises the same question, since it was used for comparison with the study 1 sample. Fourth, the sample for study 4 was recruited from a different website, so may have been drawn from a different population (and also had a somewhat narrower age range), but the issue of comparing samples from different populations was not discussed.
    2. There was no discussion of whether the statistical analysis techniques used were appropriate – e.g., using linear regression with no discussion of whether or not there was reason to believe that the assumptions of that model (linearity, equal variance, independence of observations) were plausibly well-enough satisfied to use the model.
    3. The sweeping-sounding conclusion that people “will overpay for future opportunities to indulge in [their current preferences]” seems too strong for a study asking about one specific type of situation. “Might” would be more suitable than “will.” (But maybe this is just a matter of mathematicians being more precise than most people about use of language?)
    4. A ten-item inventory seems pretty ill-suited to answering a question like this.
    5. As is all too common, the authors did not discuss practical significance of the effect sizes they obtained, only statistical significance. (Both are important, but discussing practical significance is especially important with large sample sizes.)
    6. I would have preferred to see non-smoothed plots in Figure 1 – so I could get some sense of variability rather than having it smoothed away.
    7. So what? Why was this question important enough to warrant a publication in Science?

    II. This morning I looked at the Supplementary Information. Things there that are relevant to my concerns above or others that ought to be of concern:
    1. The “population” from which most of the samples were drawn could be described as “people who went to the website of the France 2 channel program “Leurs Secrets du Bonheur” and clicked on the link to the researchers’ website.” There was some aspect of random assignment: The samples for Study 1, a follow-up to Study 1, and Study 3 were randomly assigned from “a first wave of data collection in November, 2011;” those for “a follow-up to Study 1 or Study 2” were randomly assigned from “a second wave of data collection in January, 2012.” In addition, “participants in Study 1, the follow- ups to Study 1, Study 2, and Study 3 completed numerous other questionnaires for other research projects.”
    2. They do mention (Supplement item 7) that “More than 80% of the participants in Study 1, Study 2, and Study 3 were women, so we also performed regression analyses on men and women separately to ensure that the results were not limited to a single gender.”

  32. Deinst says:

    Sorry for taking so long to respond. Life temporarily intervened.

    As far as the experiment where they asked people to just estimate the magnitude of the change in their personality, on average people reported a change of 4.85, and predicted a change of 4.15, so again they underestimated by a bit less than 20%. In this case, the decade was not significant, so comparing the means is a bit more reasonable (this really only shows that the differences do not vary linearly with decade).

    As far as computing a value for R^2 for that experiment, I do not know how to compute R^2 for multilevel models. The function pamer.fnc from LMERConvenienceFunctions reports that about 4.2% of the variation is due to the condition and decade. Do not trust in that too greatly, as I have only a vague idea of what is going on inside of LMERConvenienceFunctions, and have little idea as to whether the functions are applicable in this case.

    Things also start to get interesting in the values experiment. The major difference seems to be that people assess their values 10 years ago to be much worse than people 10 years younger rate themselves today. This does not surprise me much, and is more a measure of the fact that we fail to live up to our ideals.

    What is problematic with this experiment is that there is a significant difference between the present day values of the two groups. Comparing the sum of the ten responses (a meaningless statistic, but one that should not depend on the condition), and running a t-test on the means of the sums for the two groups, we find that the means differ p=0.02 (with the remembering group rating themselves slightly higher.) This suggests that the two sets are possibly not comparable. The difference is small 62.9 vs 62.1, but a significant fraction of the memory penalty 62.9 – 57.9. For comparison, the predicted mean was 63.0. This suggests to me that somehow the random choice has leaked into the reporting of the current state (all the questions on one web page so one can go back and change the answers once one sees the future questions.) This seems like something that psychologists should expect and account for.

  33. […] week about how we underestimate how much our personalities will change in the future? There is an interesting critique of the methods and interpretation of that study by math professor Jordan Ellenberg on his […]

  34. Deinst says:

    I looked at the data for study 3 (counting changes in favorite favorite music, vacation, food, hobby, and friend) this morning, and found two interesting details. First, the respondents were asked to rate the probability of change on a scale of 1 to 4 as follows:

    1 = Absolutely the same
    2 = Probably the same
    3 = Probably Different
    4 = Absolutely different

    The data was then dichotomized, with 1 and 2 mapping to p=0 and 3 and 4 mapping to 1. I am not sure of the reason for this. If we assign probabilities 0, 1/3, 2/3, and 1 to the four categories, we still get a significant result of slightly smaller magnitude. There will still be quantization error, but I think four levels is enough to dampen your argument a la Quine.

    Second, and considerably more disconcerting, the original survey asked about six things, music, vacation, movie, food, hobby, and friend. The favorite movie question seems to have been dropped because people were more likely to think that their favorite movie would change than had reported their favorite movie had changed. This strikes me as a definite statistical sin.

  35. Martha Smith says:

    Re Deinst’s second point: Actually, the SOM says that the favorite movie question had been deleted from the analysis “because more than 200 participants failed to complete it.” This might not have been readily apparent from the data provided.

    Re the first point: The SOM says, “Although results using this continuous
    measure were significant (β condition = -.06, p < .001), we dichotomized the response scale
    for the sake of clarity." This reason for dichotomizing seems pretty fuzzy to me — but the phrasing of the question was pretty fuzzy to begin with. Four point scales are not very informative; dichotomies even less so.

  36. […] searing critique of that “you think you won’t change in the future” paper. “This is not a contradiction… not a failure of rationality… not a […]

  37. Deinst says:

    @Martha Thanks, I missed that. I apologize for casting unwarranted aspersions. I should have noticed the missing movie data, and remembered to check the supplementary materials. (Though looking at the data, of the 217 records with the movie change missing, 117 were remembering, and 100 predicting.) The main result of underpredicting change is still significant if the movie is added back in with the 217 records removed.

    Interestingly, if we look at the non-dichotomized data, people significantly overestimate the amount by which their preference in movies, music, food, and vacation changes. They only underestimate their change in hobbies and friends. But they pretty extremely underestimate their change in friends, more than enough to overcome the other overestimations. Things are a bit fuzzier if one uses the dichotomized data, only the movie is significantly overestimated, but the others (music, food, and vacation) are either insignificant or are underestimated by a small amount. I would hope that the fact that we underestimate the probability of changing best friends is well known to psychologists.

  38. Mathnerd314 says:

    Re deferring to the experts: the hierarchy of purity starts from math (most pure) down to sociology (least pure). I think higher levels of purity should always be allowed to call BS on lower levels, essentially because they have a better understanding of the underlying structure of the problem. The standard of evidence for a formal proof is much higher than that of e.g. this report; so disputing the report should be much easier than disputing a proof. Perhaps there is underlying knowledge necessary to understand how the report is conducted, but otherwise it doesn’t need much for someone to argue with it.

  39. JSE says:

    @Mathnerd314: I can’t get behind you here — as a non-psychologist I do NOT have a better understanding of the underlying structure of personality change than Quoidbach et al do — that’s why my criticisms are purely of the mathematical computations in their paper, and I express no opinion of my own as to whether you should believe in the phenomenon they’re trying to study.

  40. JSE says:

    @Deinst: the starkly different behavior in the different domains is really interesting! It makes me even more unsure that one should believe in a general tendency to underpredict change; it sounds like their result obviously is highly non-robust to choosing which domains to ask about and how to weight those domains.

    @Martha: Now I’m het up that they referred to a four-point scale as a “continuous measure!” (By the way, my first name is Jordan, not Jason.)

  41. Deinst says:

    @JSE For every epsilon greater than a quarter, there exists a delta…

  42. Martha Smith says:

    @mathnerd314: I also can’t go with the “hierarchy of purity,” because of Garbage In, Garbage Out: If a mathematician reasons from a false premise, the conclusion says nothing about the real world. I’ve seen plenty of examples of this in a biology discussion group I’ve been going to for several years — I can read a paper and find the reasoning good, but then get to the discussion and find that one of the biologists gives new information that wasn’t in the paper that makes the argument in the paper inapplicable. For example, one paper gave simulations that seemed well-planned and carried out, and that indicated that estimation method 2 for a certain parameter is usually noticeably more accurate than estimation method 1 — but then several biologists pointed out that in nature, the value of the parameter typically lies in the small region where the simulations in fact indicate tthat method 1 is more accurate than method 2.

    @JSE: Apologies for the name mix-up. (I get called Margaret, Mary, Marsha, or Martin quite often.)

  43. […] there’s another serious issue with the new research, which was highlighted in a recent blog post: that is, the predictors might well have believed that their personality will change, but they […]

  44. Deinst says:

    I apologize for continuing to clutter up your blog, but I am too lazy to make my own.

    Your first point, that the difference in predictions is not the predicted difference, is spot on, but possibly not in the way that you imagined. It appears that the respondents when predicting the future have a simulation model of themselves that they run once. Only mathematicians and their ilk would run it until the mean converged. (Running a simulation of ones self repeatedly without convergence seems an accurate description of depression.) The major flaw in the simulation models is that it is loath to go in a ‘negative’ direction, and seems to translate negative results into non-negative ones. Using your example of a stock that has a 25% chance of going down a dollar, a 45% chance staying the same, and a 30% chance of going up a dollar, if you ask the CEO what will happen, she will tell you something like 5% down, 60% remaining the same, 35% up, having consciously or unconsciously moved the negative result to no change, or a small positive change. Although the expected outcome has gone up to 30 cents, the expected magnitude of change has gone down to .40. If we look at the signed change in results of the ten questions, in all but one the magnitude of predicted change is greater than or equal to the magnitude of remembered change (signed), but the magnitude of remembered absolute change is greater than the predicted absolute change. (The odd question is ascertaining whether one is open to change, and I suspect it is unusual due to selection bias caused by selecting people watching the program on happiness.)

    In looking closer at the data I noticed that the distribution of remembered signed changes has a trimodal (Batman) distribution with a large peak at 0 and smaller peaks at +-2. I am not sure the source of this. Even more puzzling is that the change in remembered values sometimes has peaks at -3, 0, and 3. I will write to the authors to try to understand this, and a couple of other data wierdnesses.

    Your second point I think is a bit less valid. Not mathematically, of course, but psychologically. The way you, Quine, and sometimes I, think of probability is different from the way the rest of us east African plains apes with delusions of grandeur think of probability. Although we live, in fact thrive, in a very uncertain world, mathematical probability is not something that we intuitively understand. There are enough cognitive biases involved, e.g. the gamblers fallacy, in making predictions that I suspect that the probability that someone predicts that an event occurs is a monotonic function of the actual probability that that event occurs. Of course this function would be almost certainly nonlinear, but I suspect that it may be close enough to extract some information. However, any p-value computed for statistics that purports to compare probabilities is likely quite close to meaningless.

  45. JSE says:

    “It appears that the respondents when predicting the future have a simulation model of themselves that they run once.” That seems strange to me, but it’s an empirical question. Is that actually what people do? If you ask them a prediction question a bunch of times in a row, do they give a lot of different answers reflecting their prior distribution on the predicted variable?

  46. Deinst says:

    We have a built in caching algorithm, but I am almost certain that if you waited long enough or distracted them with enough intervening questions for them to forget their answer they would likely come up a different answer. One indication of this is the anchoring and adjustment effect ( where answers are affected by vaguely related recent information. We see this in the fact that peoples reports of their current state is affected by whether they were asked to predict their future state or remember their past state.

  47. Martha Smith says:

    Deinst raises a good point in his last comment: It is well known that phrasing of questions can influence answers. For examples and references, see

  48. Tim Byron says:


    I too wrote a critique of this study, here:

    When I started writing the critique, I knew that this blog post existed, but had only really very briefly skimmed it; I wanted to form my own opinions on the study from reading the article. When I did so, the complete lack of, say, averages of each condition, or descriptions of the questions asked, in the article bugged me, and I wrote about half of a post about it being frustrating that I couldn’t see those numbers to get a sense of how big the change actually was. I then looked at this post again, and noticed the update referring to deinst in the comments having a link to the actual raw data (which I couldn’t access despite my uni being on the list of academic institutions that should have access – bah!) and deinst’s actually giving proper details about the study.

    Anyway, just wanted to say thanks to deinst for writing that stuff out, which was incredibly useful. Annd the info there should be read more widely, thus my post. So cheers!

  49. Ben Golub says:

    “If you ask them a prediction question a bunch of times in a row, do they give a lot of different answers reflecting their prior distribution on the predicted variable?”

    A friend who studies cognition says that if you ask people, “I’m going to draw three numbers between 1 and 10, inclusive; what is your best prediction of what they’ll be?” you tend to get answers like “3, 7, 5” rather than “5.5, 5.5, 5.5”. The general heuristic seems to be mean $\pm$ standard deviation for a “typical” triad. Of course, the question is a little ill-formed — it’s not as if “5.5, 5.5, 5.5” is obviously the right answer.

  50. Ben Golub says:

    P.S. This is a really wonderful post.

  51. Deinst says:

    @Tim You are quite welcome. I started out looking at the data because I was curious about exactly what the results were, and I needed something to procrastinate with. The more I look at the data, the more I agree with Jordan that the article should never have been published. The claims in the abstract and conclusion are unsupported by either the data or the analysis, and are so wildly inflated as to be, to a first approximation, false.

    I find it disconcerting that both you and Jordan had problems with ICPSR. A data repository should make getting the data easy.

  52. Martha Smith says:

    An addendum to Dienst’s last comment: I have often looked up a paper that is mentioned in the popular press. More often than not, I end up shaking my head, or putting my head in my hands, as with this paper — this applies to papers I have seen in PNAS or Science, as well as ones like this in psych journals. Occasionally I have written the authors pointing out problems with the paper, but usually receive either no reply, or occasionally something like, “Thank you for taking an interest in my work,” occasionally accompanied by something like, “I don’t think the problems you point out make a big difference in the conclusion.” Hence the topic of Jordan’s Jan 7 post, “Is the Replicability Crisis Overblown?”

  53. Deinst says:

    @Martha I am deinst (I am only somewhat dyslexic and only somewhat German)

    I do not think that the problem with this paper is largely one of replicability. I’d be surprised if the experiment were repeated elsewhere and results were considerably different. There will be cultural differences, of course, but I think that the effects that they see would appear if someone performed the study again.

    Some of the effects seen are undoubtedly artifacts of how the study was performed. I suspect the results of study 2, on values, are almost entirely artifact. This, I think is innate to how science is done, and I find forgivable.

    What annoys me is that their claim “People, it seems, regard the present as a watershed moment at which they have finally become the person they will be for the rest of their lives.” is completely unsupported by the data or the analysis.

    The other problem is that they do not report effect sizes, or scales of the response variables. Without reporting scales, facts like beta=-7.69 contains at most one bit of information, and is approaching Euler’s proof of god in meaning.

  54. Martha Smith says:

    @Deinst Apologies for my poor typing and proofreading skills. (I did get your name correct in a couple of earlier comments).

    Also apologies for not clarifying that I interpret “replicability” to refer not just to being able to get the same result with the exact same methods, but to the wider problem of poor research practices (including poor design, poor implementation, poor analysis, extrapolation beyond what has been shown, and poor reporting) that lead to results that would not stand up under more rigorous research. This includes the types of things you object to.

    Artifacts do sometimes occur because of limitations of what is currently possible (e.g., limitations on data available or possibilities for obtaining data, or on current computing power, etc.), but good research reporting will include discussion of these limitations and how they might affect the results.

  55. […] scrolled off the bottom of the page now, but there’s an amazing comment thread going on under my post on “The End of History Illusion,” the Science paper that got […]

  56. […] Ellenberg, der skriver på bloggen Quomodocumque. Forleden tog han en artikel fra fagbladet Science under kærlig behandling og viste, hvordan den hverken logisk eller matematisk hænger sammen. Artiklen er skrevet af tre […]

  57. @Mathnerd314:
    Within sociology, one can imagine inquiring into phenomenae such as peer pressure, social rank, cliques, and the different kinds of groups including families, the workplace, neighborhoods, sports teams and hobby clubs. If one accepts to study these kinds of phenomenae, one probably has to also accept that happenings or interactions within groups will remain quite nebulous and that it won’t be possible to directly access the thoughts, beliefs, emotions and feelings/sentiments of group members. I submit that what is learnt in sociological inquiries can’t be found solely using the inquiry methods of the mathematician or the physicist. Therefore, I would caution against mathematicians “over-ruling” findings in sociology, say by improvising themselves as sociologists overnight and thinking they understand sociology better than sociologists do. I sense that Jordan’s and Martha’s comments overlap with some of what I’ve written. I wish to convey my view that amongst all the fields of inquiry, each field/discipline has its own methods of inquiry and its own epistemology, or theory of knowledge.

  58. […] I posted about this paper, I should also post this criticism of it. […]

  59. Martha Smith says:


    I agree that “amongst all the fields of inquiry, each field/discipline has its own methods of inquiry and its own epistemology, or theory of knowledge, ” and that “I would caution against mathematicians “over-ruling” findings in sociology, say by improvising themselves as sociologists overnight and thinking they understand sociology better than sociologists do.”

    At the same time, I think it can often be valuable for people in one field to question the methods of inquiry in another field – with some caveats, including the following:
    1. The questioning needs to be based on adequate relevant understanding of the field whose methods are being questioned.
    2. People in one field may be better qualified to critique methods of their field used in another field than people in the filed whose methods are being critiqued (e.g., a mathematician my be better qualified than a sociologist to critique some uses of mathematics in sociology; a statistician may be better qualified than a biologist to critique some uses of statistics in biology; a physicist may be better qualified than a biologist to critiques some uses of physics in biology), but even in these cases, adequate knowledge of the field being critiqued is needed.

    These beliefs are based on working with people in other fields (especially biology, education, and engineering). See my recent post at for more detail on this.

  60. @Martha Smith:
    I think that’s very well put. I’d say the study of the climate with the aim of discovering the future climate is particularly complex, because it relies on so many disciplines, and also I suppose because validating a model with various future rates of greenhouse gas production seems highly non-trivial. And I think Cathy O’Neil mentioned the error bars (uncertainties) that should accompany predictions.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: