Tag Archives: google

Messing around with word2vec

Word2vec is a way of representing words and phrases as vectors in medium-dimensional space developed by Tomas Mikolov and his team at Google; you can train it on any corpus you like (see Ben Schmidt’s blog for some great examples) but the version of the embedding you can download was trained on about 100 billion words of Google News, and encodes words as unit vectors in 300-dimensional space.

What really got people’s attention, when this came out, was word2vec’s ability to linearize analogies.  For example:  if x is the vector representing “king,” and y the vector representing “woman,” and z the vector representing “man,” then consider

x + y – z

which you might think of, in semantic space, as being the concept “king” to which “woman” has been added and “man” subtracted — in other words, “king made more female.”  What word lies closest in direction to x+y-z?  Just as you might hope, the answer is “queen.”

I found this really startling.  Does it mean that there’s some hidden linear structure in the space of words?

It turns out it’s not quite that simple.  I played around with word2vec a bunch, using Radim Řehůřek’s gensim package that nicely pulls everything into python; here’s what I learned about what the embedding is and isn’t telling you.

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts — that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’  (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.)  If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.)  It’s positive when the words are close to each other, negative when the words are far.  For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words.  ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

You might ask:  well, what words in Word2Vec are farthest from “tremendous?”  You just get trash:

>>> model.most_similar(negative=[‘tremendous’])

[(u’By_DENISE_DICK’, 0.2792186141014099), (u’NAVARRE_CORPORATION’, 0.26894450187683105), (u’By_SEAN_BARRON’, 0.26745346188545227), (u’LEGAL_NOTICES’, 0.25829464197158813), (u’Ky.Busch_##-###’, 0.2564955949783325), (u’desultorily’, 0.2563159763813019), (u’M.Kenseth_###-###’, 0.2562236189842224), (u’J.McMurray_###-###’, 0.25608277320861816), (u’D.Earnhardt_Jr._###-###’, 0.2547803819179535), (u’david.brett_@_thomsonreuters.com’, 0.2520599961280823)]

If 3 million words were distributed randomly in the unit ball in R^300, you’d expect the farthest one from “tremendous” to have dot product about -0.3 from it.  So when you see a list whose largest score is around that size, you should think “there’s no structure there, this is just noise.”

Antonyms

Challenge problem:  Is there a way to accurately generate antonyms using the word2vec embedding?  That seems to me the sort of thing the embedding is not capturing.  Kyle McDonald had a nice go at this, but I think the lesson of his experiment is that asking word2vec to find analogies of the form “word is to antonym as happy is to?” is just going to generate a list of neighbors of “happy.”  McDonald’s results also cast some light on the structure of word2vec analogies:  for instance, he finds that

waste is to economise as happy is to chuffed

First of all, “chuffed” is a synonym of happy, not an antonym.  But more importantly:  The reason “chuffed” is there is because it’s a way that British people say “happy,” just as “economise” is a way British people spell “economize.”  Change the spelling and you get

waste is to economize as happy is to glad

Non-semantic properties of words matter to word2vec.  They matter a lot.  Which brings us to diction.

Word2Vec distance keeps track of diction

Lots of non-semantic stuff is going on in natural language.  Like diction, which can be high or low, formal or informal, flowery or concrete.    Look at the nearest neighbors of “pugnacity”:

>>> model.most_similar(‘pugnacity’)

[(u’pugnaciousness’, 0.6015268564224243), (u’wonkishness’, 0.6014434099197388), (u’pugnacious’, 0.5877301692962646), (u’eloquence’, 0.5875781774520874), (u’sang_froid’, 0.5873805284500122), (u’truculence’, 0.5838015079498291), (u’pithiness’, 0.5773230195045471), (u’irascibility’, 0.5772287845611572), (u’hotheadedness’, 0.5741063356399536), (u’sangfroid’, 0.5715578198432922)]

Some of these are close semantically to pugnacity, but others, like “wonkishness,” “eloquence”, and “sangfroid,” are really just the kind of elevated-diction words the kind of person who says “pugnacity” would also say.

In the other direction:

>>> model.most_similar(‘psyched’)

[(u’geeked’, 0.7244787216186523), (u’excited’, 0.6678282022476196), (u’jazzed’, 0.666187584400177), (u’bummed’, 0.662735104560852), (u’amped’, 0.6473385691642761), (u’pysched’, 0.6245862245559692), (u’exicted’, 0.6116108894348145), (u’awesome’, 0.5838013887405396), (u’enthused’, 0.581687331199646), (u’kinda_bummed’, 0.5701783299446106)]

“geeked” is a pretty good synonym, but “bummed” is an antonym.  You may also note that contexts where “psyched” is common are also fertile ground for “pysched.”  This leads me to one of my favorite classes of examples:

Misspelling analogies

Which words are closest to “teh”?

>>> model.most_similar(‘teh’)

[(u’ther’, 0.6910992860794067), (u’hte’, 0.6501408815383911), (u’fo’, 0.6458913683891296), (u’tha’, 0.6098173260688782), (u’te’, 0.6042138934135437), (u’ot’, 0.595798909664154), (u’thats’, 0.595078706741333), (u’od’, 0.5908242464065552), (u’tho’, 0.58894944190979), (u’oa’, 0.5846965312957764)]

Makes sense:  the contexts where “teh” is common are those contexts where a lot of words are misspelled.

Using the “analogy” gadget, we can ask; which word is to “because” as “teh” is to “the”?

>>> model.most_similar(positive=[‘because’,’teh’],negative=[‘the’])

[(u’becuase’, 0.6815075278282166), (u’becasue’, 0.6744950413703918), (u’cuz’, 0.6165347099304199), (u’becuz’, 0.6027254462242126), (u’coz’, 0.580410361289978), (u’b_c’, 0.5737690925598145), (u’tho’, 0.5647958517074585), (u’beacause’, 0.5630674362182617), (u’thats’, 0.5605655908584595), (u’lol’, 0.5597798228263855)]

Or “like”?

>>> model.most_similar(positive=[‘like’,’teh’],negative=[‘the’])

[(u’liek’, 0.678846001625061), (u’ok’, 0.6136218309402466), (u’hahah’, 0.5887773633003235), (u’lke’, 0.5840467214584351), (u’probly’, 0.5819578170776367), (u’lol’, 0.5802655816078186), (u’becuz’, 0.5771245956420898), (u’wierd’, 0.5759704113006592), (u’dunno’, 0.5709049701690674), (u’tho’, 0.565370500087738)]

Note that this doesn’t always work:

>>> model.most_similar(positive=[‘should’,’teh’],negative=[‘the’])

[(u’wil’, 0.63351970911026), (u’cant’, 0.6080706715583801), (u’wont’, 0.5967696309089661), (u’dont’, 0.5911301970481873), (u’shold’, 0.5908039212226868), (u’shoud’, 0.5776053667068481), (u’shoudl’, 0.5491836071014404), (u”would’nt”, 0.5474458932876587), (u’shld’, 0.5443994402885437), (u’wouldnt’, 0.5413904190063477)]

What are word2vec analogies?

Now let’s come back to the more philosophical question.  Should the effectiveness of word2vec at solving analogy problems make us think that the space of words really has linear structure?

I don’t think so.  Again, I learned something important from the work of Levy and Goldberg.  When word2vec wants to find the word w which is to x as y is to z, it is trying to find w maximizing the dot product

w . (x + y – z)

But this is the same thing as maximizing

w.x + w.y – w.z.

In other words, what word2vec is really doing is saying

“Show me words which are similar to x and y but are dissimilar to z.”

This notion makes sense applied any notion of similarity, whether or not it has anything to do with embedding in a vector space.  For example, Levy and Goldberg experiment with minimizing

log(w.x) + log(w.y) – log(w.z)

instead, and get somewhat superior results on the analogy task.  Optimizing this objective has nothing to do with the linear combination x+y-z.

None of which is to deny that the analogy engine in word2vec works well in many interesting cases!  It has no trouble, for instance, figuring out that Baltimore is to Maryland as Milwaukee is to Wisconsin.  More often than not, the Milwaukee of state X correctly returns the largest city in state X.  And sometimes, when it doesn’t, it gives the right answer anyway:  for instance, the Milwaukee of Ohio is Cleveland, a much better answer than Ohio’s largest city (Columbus — you knew that, right?)  The Milwaukee of Virginia, according to word2vec, is Charlottesville, which seems clearly wrong.  But maybe that’s OK — maybe there really isn’t a Milwaukee of Virginia.  One feels Richmond is a better guess than Charlottesville, but it scores notably lower.  (Note:  Word2Vec’s database doesn’t have Virginia_Beach, the largest city in Virginia.  That one I didn’t know.)

Another interesting case:  what is to state X as Gainesville is to Florida?  The answer should be “the location of the, or at least a, flagship state university, which isn’t the capital or even a major city of the state,” when such a city exists.  But this doesn’t seem to be something word2vec is good at finding.  The Gainesville of Virginia is Charlottesville, as it should be.  But the Gainesville of Georgia is Newnan.  Newnan?  Well, it turns out there’s a Newnan, Georgia, and there’s also a Newnan’s Lake in Gainesville, FL; I think that’s what’s driving the response.  That, and the fact that “Athens”, the right answer, is contextually separated from Georgia by the existence of Athens, Greece.

The Gainesville of Tennessee is Cookeville, though Knoxville, the right answer, comes a close second.

Why?  You can check that Knoxville, according to word2vec, is much closer to Gainesville than Cookeville is.

>>> model.similarity(‘Cookeville’,’Gainesville’)

0.5457580604439547

>>> model.similarity(‘Knoxville’,’Gainesville’)

0.64010456774402158

But Knoxville is placed much closer to Florida!

>>> model.similarity(‘Cookeville’,’Florida’)

0.2044376252927515

>>> model.similarity(‘Knoxville’,’Florida’)

0.36523836770416895

Remember:  what word2vec is really optimizing for here is “words which are close to Gainesville and close to Tennessee, and which are not close to Florida.”  And here you see that phenomenon very clearly.  I don’t think the semantic relationship between ‘Gainesville’ and ‘Florida’ is something word2vec is really capturing.  Similarly:  the Gainesville of Illinois is Edwardsville (though Champaign, Champaign_Urbana, and Urbana are all top 5) and the Gainesville of Indiana is Connersville.  (The top 5 for Indiana are all cities ending in “ville” — is the phonetic similarity playing some role?)

Just for fun, here’s a scatterplot of the 1000 nearest neighbors of ‘Gainesville’, with their similarity to ‘Gainesville’ (x-axis) plotted against their similarity to ‘Tennessee’ (y-axis):

Screen Shot 2016-01-14 at 14 Jan 4.37.PM

The Pareto frontier consists of “Tennessee” (that’s the one whose similarity to “Tennessee” is 1, obviously..) Knoxville, Jacksonville, and Tallahassee.

Bag of contexts

One popular simple linear model of word space is given by representing a word as a “bag of contexts” — perhaps there are several thousand contexts, and each word is given by a sparse vector in the space spanned by contexts:  coefficient 0 if the word is not in that context, 1 if it is.  In that setting, certain kinds of analogies would be linearized and certain kinds would not.  If “major city” is a context, then “Houston” and “Dallas” might have vectors that looked like “Texas” with the coodinate of “major city” flipped from 0 to 1.  Ditto, “Milwaukee” would be “Wisconsin” with the same basis vector added.  So

“Texas” + “Milwaukee” – “Wisconsin”

would be pretty close, in that space, to “Houston” and “Dallas.”

On the other hand, it’s not so easy to see what relations antonyms would have in that space. That’s the kind of relationship the bag of contexts may not make linear.

The word2vec space is only 300-dimensional, and the vectors aren’t sparse at all.  But maybe we should think of it as a random low-dimensional projection of a bag-of-contexts embedding!  By the Johnson-Lindenstrauss lemma, a 300-dimensional projection is plenty big enough to preserve the distances between 3 million points with a small distortion factor; and of course it preserves all linear relationships on the nose.

Perhaps this point of view gives some insight into which kind of word relationships manifest as linear relationships in word2vec.  “flock:birds” is an interesting example.  If you imagine “group of things” as a context, you can maybe imagine word2vec picking this up.  But actually, it doesn’t do well:

>> model.most_similar(positive=[‘fish’,’flock’],negative=[‘birds’])
[(u’crays’, 0.4601619839668274), (u’threadfin_salmon’, 0.4553075134754181), (u’spear_fishers’, 0.44864755868911743), (u’slab_crappies’, 0.4483765661716461), (u’flocked’, 0.44473177194595337), (u’Siltcoos_Lake’, 0.4429660737514496), (u’flounder’, 0.4414420425891876), (u’catfish’, 0.4413948059082031), (u’yellowtail_snappers’, 0.4410281181335449), (u’sockeyes’, 0.4395104944705963)]

>> model.most_similar(positive=[‘dogs’,’flock’],negative=[‘birds’])
[(u’dog’, 0.5390862226486206), (u’pooches’, 0.5000904202461243), (u’Eminem_Darth_Vader’, 0.48777419328689575), (u’Labrador_Retrievers’, 0.4792211949825287), (u’canines’, 0.4766522943973541), (u’barked_incessantly’, 0.4709487557411194), (u’Rottweilers_pit_bulls’, 0.4708423614501953), (u’labradoodles’, 0.47032350301742554), (u’rottweilers’, 0.46935927867889404), (u’forbidding_trespassers’, 0.4649636149406433)]

The answers “school” and “pack” don’t appear here.  Part of this, of course, is that “flock,” “school”, and “pack” all have interfering alternate meanings.  But part of it is that the analogy really rests on information about contexts in which the words “flock” and “birds” both appear.  In particular, in a short text window featuring both words, you are going to see a huge spike of “of” appearing right after flock and right before birds.  A statement of the form “flock is to birds as X is to Y” can’t be true unless “X of Y” actually shows up in the corpus a lot.

Challenge problem:  Can you make word2vec do a good job with relations like “flock:birds”?  As I said above, I wouldn’t have been shocked if this had actually worked out of the box, so maybe there’s some minor tweak that makes it work.

Boys’ names, girls’ names

Back to gender-flipping.  What’s the “male version” of the name “Jennifer”?

There are various ways one can do this.  If you use the analogy engine from word2vec, finding the closest word to “Jennifer” + “he” – “she”, you get as your top 5:

David, Jason, Brian, Kevin, Chris

>>> model.most_similar(positive=[‘Jennifer’,’he’],negative=[‘she’])
[(u’David’, 0.6693146228790283), (u’Jason’, 0.6635637283325195), (u’Brian’, 0.6586753129959106), (u’Kevin’, 0.6520106792449951), (u’Chris’, 0.6505492925643921), (u’Mark’, 0.6491551995277405), (u’Matt’, 0.6386727094650269), (u’Daniel’, 0.6294828057289124), (u’Greg’, 0.6267883777618408), (u’Jeff’, 0.6265031099319458)]

But there’s another way:  you can look at the words closest to “Jennifer” (which are essentially all first names) and pick out the ones which are closer to “he” than to “she”.  This gives

Matthew, Jeffrey, Jason, Jesse, Joshua.

>>> [x[0] for x in model.most_similar(‘Jennifer’,topn=2000) if model.similarity(x[0],’he’) > model.similarity(x[0],’she’)]
[u’Matthew’, u’Jeffrey’, u’Jason’, u’Jesse’, u’Joshua’, u’Evan’, u’Brian’, u’Cory’, u’Justin’, u’Shawn’, u’Darrin’, u’David’, u’Chris’, u’Kevin’, u’3/dh’, u’Christopher’, u’Corey’, u’Derek’, u’Alex’, u’Matt’, u’Jeremy’, u’Jeff’, u’Greg’, u’Timothy’, u’Eric’, u’Daniel’, u’Wyvonne’, u’Joel’, u’Chirstopher’, u’Mark’, u’Jonathon’]

Which is a better list of “male analogues of Jennifer?”  Matthew is certainly closer to Jennifer in word2vec distance:

>>> model.similarity(‘Jennifer’,’Matthew’)

0.61308109388608356

>>> model.similarity(‘Jennifer’,’David’)

0.56257556538528708

But, for whatever, reason, “David” is coded as much more strongly male than “Matthew” is; that is, it is closer to “he” – “she”.  (The same is true for “man” – “woman”.)  So “Matthew” doesn’t score high in the first list, which rates names by a combination of how male-context they are and how Jennifery they are.  A quick visit to NameVoyager shows that Matthew and Jennifer both peaked sharply in the 1970s; David, on the other hand, has a much longer range of popularity and was biggest in the 1950s.

Let’s do it again, for Susan.  The two methods give

David, Robert, Mark, Richard, John

Robert, Jeffrey, Richard, David, Kenneth

And for Edith:

Ernest, Edwin, Alfred, Arthur, Bert

Ernest, Harold, Alfred, Bert, Arthur

Pretty good agreement!  And you can see that, in each case, the selected names are “cultural matches” to the starting name.

Sidenote:  In a way it would be more natural to project wordspace down to the orthocomplement of “he” – “she” and find the nearest neighbor to “Susan” after that projection; that’s like, which word is closest to “Susan” if you ignore the contribution of the “he” – “she” direction.  This is the operation Ben Schmidt calls “vector rejection” in his excellent post about his word2vec model trained on student evaluations.  

If you do that, you get “Deborah.”  In other words, those two names are similar in so many contextual ways that they remain nearest neighbors even after we “remove the contribution of gender.”  A better way to say it is that the orthogonal projection doesn’t really remove the contribution of gender in toto.  It would be interesting to understand what kind of linear projections actually make it hard to distinguish male surnames from female ones.

Google News is a big enough database that this works on non-English names, too.  The male “Sylvie”, depending on which protocol you pick, is

Alain, Philippe, Serge, Andre, Jean-Francois

or

Jean-Francois, Francois, Stephane, Alain, Andre

The male “Kyoko” is

Kenji, Tomohiko, Nobuhiro, Kazuo, Hiroshi

or

Satoshi, Takayuki, Yosuke, Michio, Noboru

French and Japanese speakers are encouraged to weigh in about which list is better!

Update:  Even a little more messing around with “changing the gender of words” in a followup post.

Tagged , , , , , ,

Social media benchmark

Sometimes I share my post to various social media sites.  This morning, as it happened, I shared a post almost simultaneously to Facebook, Twitter, and Google+.  On my hit counter, I now see

  • 80 hits via Facebook
  • 30 hits via Google Plus
  • 23 hits via Twitter

Wouldn’t you have expected G+ to come in a lot lower?  Despite the general sense that it’s a total failure, there really are a fair number of people using it. 

Just to reassure you the Earth is still on its axis, Google Search provided 58 hits and Bing 2.

 

Tagged , , ,

Graph search skepticism

Gary Marcus asks the right questions about Facebook’s new Graph Search:

Most search engines, including Google’s, mainly sort pages to see which come closest to some set of keywords (or their synonyms), but they do relatively little to integrate information across pages. If you want a list of all the books written by members of Congress in 2007, you can do a search, but you’ll end up lost. Unless someone has already compiled that information into a single page, you are likely to be directed to a series of individual pages, many with little relevance. It would be left to you to consolidate information across many, many pages; worse, you would have to start from scratch to get the same data for 2008.

But, in theory at least, Facebook Graph Search consolidates information over time and space (albeit in very limited ways). In effect, each user can now use Facebook as if it were a giant, custom-tailored database, not just a librarian that gives a list of documents that are most relevant to your query. Although the ideas behind Facebook Graph Search aren’t entirely original—Google can do similar things in limited domains, such as shopping, and Wolfram Alpha can do math (and draw graphs) based on data in its archives—it really is likely to change the way many people think about search…

Forget search engines. The real revolution will come when we have research engines, intelligent web helpers that can find out new things, not just what’s already been written. Facebook Graph Search isn’t anywhere near that good, but it’s a nice hint at greater things to come.

A nice hint at greater things to come?  Or, like Wolfram Alpha, another case where some of the best programmers in the world, given massive amounts of resources and time, fail to bring us appreciably closer to the dream of the research engine?  In other words, a hint that maybe there are not greater things to come, at least in this direction?

I’m not part of Facebook Graph Search’s “slow rollout,” but from the coverage I’ve read it sounds like it’s good at handling canned combinations of boolean searches.  That’s no joke, but does it really represent progress towards the goal that Marcus has in mind?

Wolfram Alpha, of course, has no idea what books members of Congress wrote in 2007, but that’s not quite fair, because Wolfram Alpha isn’t supposed to know about books.  What does Wolfram Alpha know about?  Well, a query for “Missouri Senate election 2010” gives you the results of that election, so we know it has state-level results for those elections.  But it can’t put these together to answer “How many Republican senators were elected in 2010?”   “Senators elected in 2010”, which you might think would give you a list, doesn’t – it does, though, tell you that 24 seats went to Republicans and 10 to Democrats, along with the meaningless data of the total votes cast in the US for GOP and Democratic Senate candidates.  “List of senators elected in 2010” gives the same result.  WA obviously has access, state by state, to the names of the Republican senators who won elections in 2010; but it apparently can’t put that information together into a single list.  Given that, I think gathering their book credits is pretty far off.

Meanwhile, a Google search for “Republican senators elected in 2010” gives you the relevant Wikipedia page, with the complete list, state-by-state results, and much more.  And searching for “books written by members of Congress” pulls up as the first result a Roll Call article about books written by members of Congress.  Hard to complain about that.

Were any of those books written in 2007?  Who knows?  More to the point — who cares?  That’s the genius of the Google approach.  You know how they tell you, if you’re confused about something in class and you want to know the answer, you should raise your hand and ask, because probably other people have the same question?  That’s the Google principle, except they take it one step further; if you need an answer, not only do other people have the same question, but one such person has already found the answer and put it on the web.  Google can’t tell you which states that entered the Union after 1875 have public universities with animals as their mascots, or which Congressional district ranks 10th by percentage of area covered by water, which is the kind of thing Wolfram Alpha is ace at; but that’s because no one has ever asked those questions, and no one ever will.

 

 

 

Tagged , , ,

When he finds out about regexps he’s going to totally freak

Thomas Friedman in today’s New York Times:

 To the contrary, there will surely be a new secretary of state visiting you next year with the umpteenth road map for “confidence-building measures” between Israelis and Palestinians. He or she may even tell you that “this is the year of decision.” Be careful. We’ve been there before. If you Google “Year of decision in the Middle East,” you’ll get more than 100,000,000 links.

Can this really be true?  Nope.  In fact if you Google that phrase you get fewer than 12,000 links.

The problem here is that Thomas Friedman apparently doesn’t know that when you search Google for a phrase you need to put quotes around it.  Without the quotes, you do indeed get more than 100,000,000 results.  That’s because a lot of web pages mention years, decisions, and things located either in the middle or to the east.

It seems plausible that long-time New York Times columnists might not know how to use Google, but it’s appalling if the people who edit and fact-check the columns don’t know how to use Google.

So that this post has some content and is not pure snark, here’s a relevant article by my friend Eszter Hargittai, whose research has taught us a lot about how people use search engines in the real world.

Tagged , , ,

In which Google+ is a good place to talk about math

Google+ may not have killed Facebook, but it is developing into a nice place for tearoom style chats about math; less formal than MathOverflow, more characters than FB.  This thread Allen Knutson started about circle packing is a case in point.  If I’m reading the thread and I say to myself “Matt Kahle should be weighing in on this,” I can just type in his name with a + prepended to it — and he’s summoned!  That’s a functionality that really doesn’t exist elsewhere.

Tagged , ,

The Google puzzle and the perils of averaging ratios

The following brain-teaser has been going around, identified as a question from a Google interview (though there’s some controversy about whether Google actually uses questions like this.)

There’s a certain country where everybody wants to have a son. Therefore each couple keeps having children until they have a boy; then they stop. What fraction of the population is female?

Steve Landsburg posted a version of this question on his blog.  “The answer they expect,” he writes, “is simple, definitive, and wrong… Are you smarter than the folks at Google?  What’s the answer?”

Things quickly went blooey.  Google’s purported answer — fiercely argued for by lots of Landsburg’s readers — is 1/2.  Landsburg said the right answer was less.  A huge comment thread and many follow-up posts ensued.  Lubos Motl took time out from his busy schedule of yelling at mathematicians about string theory to yell at Landsburg about probability theory.  Landsburg offered to bet Motl, or anybody else, $15,000 that a computer simulation would demonstrate the correctness of his answer.

What’s going on here?  How could a simple probability question have stirred up such a ruckus?

Here’s Landsburg’s explanation of the question:

What fraction of the population should we expect to be female? That is, in a large number of similar countries, what would be the average proportion of females?

If G is the number of girls, and B the number of boys, Landsburg is asking for the expected value E(G/(G+B)).  And let’s get one thing straight:  Landsburg is absolutely right about this expected value.  For any finite number of families, it is strictly less than 1/2.  (See the related Math Overflow thread for a good explanation.)  Landsburg has very patiently knocked down the many wrong arguments to the contrary in his comments section.  Anybody who bets against him, on his terms, is going to lose.

Nonetheless, I’m about to explain why Landsburg is wrong.

You see, Google’s version of the question doesn’t specify anything about expectation.  They might just as well have meant:  “What is the proportion of the expected number of females in the expected population?”  Which is to say, “What is E(G)/E(G) + E(B)”?  And the answer to that question is 1/2.  Just to emphasize the subtlety involved here:

On average, the number of boys and the number of girls are the same.  Furthermore, the proportion of girls is, on average, less than 1/2.

Weird, right?  E(G)/E(G) + E(B) isn’t what Landsburg was asking for — but, if Google’s answer was 1/2, it’s presumably the question they had in mind.  To accuse them of getting their own question “wrong” is a bit rich.

But let me go all in — I actually think Landsburg’s interpretation of the question is not only different from Google’s, but in some ways inferior!  Because averaging ratios with widely ranging denominators is kind of a weird thing to do.  You can certainly compute the average population density of all the U.S. states — but should you? What meaning or use would the result have?

I had a really pungent example ready to deploy, which illustrates the perils of averaging ratios and explains why Landsburg’s version of the question was a little weird.  Then I went to the Joint Meetings before getting around to writing this post.  And when I got back, I discovered that Landsburg had posted the same example on his own blogin support of his point of view!  Awesome.  Here it is:

There’s a certain country where everybody wants to have a son. Therefore each couple keeps having children until they have a boy; then they stop. In expectation, what is the ratio of boys to girls?

The answer to this question is, of course, infinity; in a finite population there might be no girls, so B/G is infinite with some positive probability, so E(B/G) is infinite as well.

But the correctness of that answer surely tells us this is a terrible question!  Averaging is a terribly cruel thing to do to a bunch of ratios.  One zero denominator and you’ve wiped out your entire dataset.

What if Landsburg had phrased his new question along the lines of Google’s original puzzle?

There’s a certain country where everybody wants to have a son. Therefore each couple keeps having children until they have a boy; then they stop. What is the ratio of boys to girls in this country?

Honest question:  does Landsburg truly think that infinity is the only “right answer” to this question?  Does he think infinity is a good answer?  Would he hire a person who gave that answer?  Would you?

Tagged , , , ,

Ngrams: one more way to win an argument using Google

I thought I’d never see a definitive answer to this one, but thanks to the brand-new Google NGrams Viewer, the facts are clear:

It is “another think coming,” and it has always been “another think coming.”

A lot of words and phrases (though not these) show a dip starting in 2000 or so.  I wonder if the nature of the corpus changes at that point to include more words?  You see the same effects with name frequencies — the frequency of any given name has been decreasing over the last twenty years, just because names are getting more and more widely distributed; the most popular names today take up a smaller share of namespace than much lower-ranked names did in the 1950s.  A quick and dirty thing to check would be the entropy of the word distribution; is it going up with time?

Lots of good ngram examples on Tom Scocca’s blog, here and here.

Oh, and here’s the Four Shortstops:

Ripken, appropriately, is showing great staying power.

Tagged , , , ,

A true-false quiz for the world

Google’s autocomplete suggestions for searches starting with “Is”:

  • Is Lady Gaga a man
  • Is Lady Gaga a hermaphrodite
  • Is the world going to end in 2012
  • Is bronchitis contagious
  • Is Santa real
  • Is pneumonia contagious
  • Is Khloe Kardashian pregnant
  • Is Walmart open on Christmas 2009
  • Is Wendy Williams a man
  • Is Limewire illegal

My answers:  No. No.  No.  Yes.  No.  Yes.  Don’t know.  I’d think so.  Judging from the name, no.  Judging from the name, yes.

Tagged , , ,

The chilling effect of Google on gag-based feature writing

I woke up the other morning thinking to myself, you know what would be funny? To go from Toni Morrison’s depiction of Bill Clinton as the first black president to the observation that Barack Obama, having missed his chance to be the first black president, could still be the first Jewish president: child of immigrants, excels in school, good at basketball, bad at bowling, subject to whispers that his religious commitments might bind him to America’s enemies, etc. etc.

But nowadays you’ve got to Google a gag before you deploy it. And you quickly find that Harold Pollack got there first at Huffington Post, back in January — which didn’t stop Howard Fineman from using the gag in March in Newsweek, or Josh Gerstein from bringing it back in the New York Sun this week.

You have to figure that Google has a certain chilling effect on gag-based feature writing. Of course ten people are going to come up with the same joke. And if that produces ten different columns, then maybe the funniest one has a chance to get popular. Is it really better if the first person to post the gag online salts the field for everybody else?

In this case, it’s for the best — Pollack’s piece is better than its successors, and better than what I would have written. But can we shed a single tear for the gag-based features that, thanks to Google, never tasted life?

Reader challenge: come up with a “Barack Obama is Jewish” gag that doesn’t appear in Google. “Obamulke” and “Baruch Obama” have both been done, but I think I can claim priority on “Barak Mitzvah.” For what that’s worth.

Tagged , , , ,

Read my book free

When it started up, Google Books had spotty coverage for literary fiction. But I’m happy to report that they now offer The Grasshopper King well, not the full text, but all of the first chapter, and enough of the rest to get a sense of the book.

Tagged ,
Follow

Get every new post delivered to your Inbox.

Join 714 other followers

%d bloggers like this: