Tag Archives: names

Double standards in baby names

People love to make fun of George Foreman because he named all his sons George Foreman.  But former Secretary of State Lawrence Eagleburger named all his sons Lawrence Eagleburger and nobody raises a peep!  There’s no justice.

Tagged , ,

Gendercycle: a dynamical system on words

By the way, here’s another fun word2vec trick.  Following Ben Schmidt, you can try to find “gender-neutralized synonyms” — words which are close to each other except for the fact that they have different gender connotations.   A quick and dirty way to “mascify” a word is to find its nearest neighbor which is closer to “he” than “she”:

def mascify(y): return [x[0] for x in model.most_similar(y,topn=200) if model.similarity(x[0],’she’) < model.similarity(x[0],’he’)][0]

“femify” is defined similarly.  We could put a threshold away from 0 in there, if we wanted to restrict to more strongly gender-coded words.

Anyway, if you start with a word and run mascify and femify alternately, you can ask whether you eventually wind up in a 2-cycle:  a pair of words which are each others gender counterparts in this loose sense.


gentle -> easygoing -> chatty -> talkative -> chatty -> …..

So “chatty” and “talkative” are a pair, with “chatty” female-coded and “talkative” male-coded.

beautiful -> magnificent -> wonderful -> marvelous -> wonderful -> …

So far, I keep hitting 2-cycles, and pretty quickly, though I don’t see why a longer cycle wouldn’t be possible or likely.  Update:  Kevin in comments explains very nicely why it has to terminate in a 2-cycle!

Some other pairs, female-coded word first:

overjoyed / elated

strident / vehement

fearful / worried

furious / livid

distraught / despondent

hilarious / funny

exquisite / sumptuous

thought_provoking / insightful

kick_ass / badass

Sometimes it’s basically giving the same word, in two different forms or with one word misspelled:

intuitive / intuitively

buoyant / bouyant

sad / Sad

You can do this for names, too, though you have to make the “topn” a little longer to find matches.  I found:

Jamie / Chris

Deborah / Jeffrey

Fran / Pat

Mary / Joseph

Pretty good pairs!  Note that you hit a lot of gender-mixed names (Jamie, Chris, Pat), just as you might expect:  the male-biased name word2vec-closest to a female name is likely to be a name at least some women have!  You can deal with this by putting in a threshold:

>> def mascify(y): return [x[0] for x in model.most_similar(y,topn=1000) if model.similarity(x[0],’she’) < model.similarity(x[0],’he’) – 0.1][0]

This eliminates “Jamie” and “Pat” (though “Chris” still reads as male.)

Now we get some new pairs:

Jody / Steve (this one seems to have a big basis of attraction, it shows up from a lot of initial conditions)

Kasey / Zach

Peter / Catherine (is this a Russia thing?)

Nicola / Dominic

Alison / Ian






Tagged , , , ,

Messing around with word2vec

Word2vec is a way of representing words and phrases as vectors in medium-dimensional space developed by Tomas Mikolov and his team at Google; you can train it on any corpus you like (see Ben Schmidt’s blog for some great examples) but the version of the embedding you can download was trained on about 100 billion words of Google News, and encodes words as unit vectors in 300-dimensional space.

What really got people’s attention, when this came out, was word2vec’s ability to linearize analogies.  For example:  if x is the vector representing “king,” and y the vector representing “woman,” and z the vector representing “man,” then consider

x + y – z

which you might think of, in semantic space, as being the concept “king” to which “woman” has been added and “man” subtracted — in other words, “king made more female.”  What word lies closest in direction to x+y-z?  Just as you might hope, the answer is “queen.”

I found this really startling.  Does it mean that there’s some hidden linear structure in the space of words?

It turns out it’s not quite that simple.  I played around with word2vec a bunch, using Radim Řehůřek’s gensim package that nicely pulls everything into python; here’s what I learned about what the embedding is and isn’t telling you.

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts — that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’  (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.)  If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)


The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.)  It’s positive when the words are close to each other, negative when the words are far.  For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)


Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words.  ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

You might ask:  well, what words in Word2Vec are farthest from “tremendous?”  You just get trash:

>>> model.most_similar(negative=[‘tremendous’])

[(u’By_DENISE_DICK’, 0.2792186141014099), (u’NAVARRE_CORPORATION’, 0.26894450187683105), (u’By_SEAN_BARRON’, 0.26745346188545227), (u’LEGAL_NOTICES’, 0.25829464197158813), (u’Ky.Busch_##-###’, 0.2564955949783325), (u’desultorily’, 0.2563159763813019), (u’M.Kenseth_###-###’, 0.2562236189842224), (u’J.McMurray_###-###’, 0.25608277320861816), (u’D.Earnhardt_Jr._###-###’, 0.2547803819179535), (u’david.brett_@_thomsonreuters.com’, 0.2520599961280823)]

If 3 million words were distributed randomly in the unit ball in R^300, you’d expect the farthest one from “tremendous” to have dot product about -0.3 from it.  So when you see a list whose largest score is around that size, you should think “there’s no structure there, this is just noise.”


Challenge problem:  Is there a way to accurately generate antonyms using the word2vec embedding?  That seems to me the sort of thing the embedding is not capturing.  Kyle McDonald had a nice go at this, but I think the lesson of his experiment is that asking word2vec to find analogies of the form “word is to antonym as happy is to?” is just going to generate a list of neighbors of “happy.”  McDonald’s results also cast some light on the structure of word2vec analogies:  for instance, he finds that

waste is to economise as happy is to chuffed

First of all, “chuffed” is a synonym of happy, not an antonym.  But more importantly:  The reason “chuffed” is there is because it’s a way that British people say “happy,” just as “economise” is a way British people spell “economize.”  Change the spelling and you get

waste is to economize as happy is to glad

Non-semantic properties of words matter to word2vec.  They matter a lot.  Which brings us to diction.

Word2Vec distance keeps track of diction

Lots of non-semantic stuff is going on in natural language.  Like diction, which can be high or low, formal or informal, flowery or concrete.    Look at the nearest neighbors of “pugnacity”:

>>> model.most_similar(‘pugnacity’)

[(u’pugnaciousness’, 0.6015268564224243), (u’wonkishness’, 0.6014434099197388), (u’pugnacious’, 0.5877301692962646), (u’eloquence’, 0.5875781774520874), (u’sang_froid’, 0.5873805284500122), (u’truculence’, 0.5838015079498291), (u’pithiness’, 0.5773230195045471), (u’irascibility’, 0.5772287845611572), (u’hotheadedness’, 0.5741063356399536), (u’sangfroid’, 0.5715578198432922)]

Some of these are close semantically to pugnacity, but others, like “wonkishness,” “eloquence”, and “sangfroid,” are really just the kind of elevated-diction words the kind of person who says “pugnacity” would also say.

In the other direction:

>>> model.most_similar(‘psyched’)

[(u’geeked’, 0.7244787216186523), (u’excited’, 0.6678282022476196), (u’jazzed’, 0.666187584400177), (u’bummed’, 0.662735104560852), (u’amped’, 0.6473385691642761), (u’pysched’, 0.6245862245559692), (u’exicted’, 0.6116108894348145), (u’awesome’, 0.5838013887405396), (u’enthused’, 0.581687331199646), (u’kinda_bummed’, 0.5701783299446106)]

“geeked” is a pretty good synonym, but “bummed” is an antonym.  You may also note that contexts where “psyched” is common are also fertile ground for “pysched.”  This leads me to one of my favorite classes of examples:

Misspelling analogies

Which words are closest to “teh”?

>>> model.most_similar(‘teh’)

[(u’ther’, 0.6910992860794067), (u’hte’, 0.6501408815383911), (u’fo’, 0.6458913683891296), (u’tha’, 0.6098173260688782), (u’te’, 0.6042138934135437), (u’ot’, 0.595798909664154), (u’thats’, 0.595078706741333), (u’od’, 0.5908242464065552), (u’tho’, 0.58894944190979), (u’oa’, 0.5846965312957764)]

Makes sense:  the contexts where “teh” is common are those contexts where a lot of words are misspelled.

Using the “analogy” gadget, we can ask; which word is to “because” as “teh” is to “the”?

>>> model.most_similar(positive=[‘because’,’teh’],negative=[‘the’])

[(u’becuase’, 0.6815075278282166), (u’becasue’, 0.6744950413703918), (u’cuz’, 0.6165347099304199), (u’becuz’, 0.6027254462242126), (u’coz’, 0.580410361289978), (u’b_c’, 0.5737690925598145), (u’tho’, 0.5647958517074585), (u’beacause’, 0.5630674362182617), (u’thats’, 0.5605655908584595), (u’lol’, 0.5597798228263855)]

Or “like”?

>>> model.most_similar(positive=[‘like’,’teh’],negative=[‘the’])

[(u’liek’, 0.678846001625061), (u’ok’, 0.6136218309402466), (u’hahah’, 0.5887773633003235), (u’lke’, 0.5840467214584351), (u’probly’, 0.5819578170776367), (u’lol’, 0.5802655816078186), (u’becuz’, 0.5771245956420898), (u’wierd’, 0.5759704113006592), (u’dunno’, 0.5709049701690674), (u’tho’, 0.565370500087738)]

Note that this doesn’t always work:

>>> model.most_similar(positive=[‘should’,’teh’],negative=[‘the’])

[(u’wil’, 0.63351970911026), (u’cant’, 0.6080706715583801), (u’wont’, 0.5967696309089661), (u’dont’, 0.5911301970481873), (u’shold’, 0.5908039212226868), (u’shoud’, 0.5776053667068481), (u’shoudl’, 0.5491836071014404), (u”would’nt”, 0.5474458932876587), (u’shld’, 0.5443994402885437), (u’wouldnt’, 0.5413904190063477)]

What are word2vec analogies?

Now let’s come back to the more philosophical question.  Should the effectiveness of word2vec at solving analogy problems make us think that the space of words really has linear structure?

I don’t think so.  Again, I learned something important from the work of Levy and Goldberg.  When word2vec wants to find the word w which is to x as y is to z, it is trying to find w maximizing the dot product

w . (x + y – z)

But this is the same thing as maximizing

w.x + w.y – w.z.

In other words, what word2vec is really doing is saying

“Show me words which are similar to x and y but are dissimilar to z.”

This notion makes sense applied any notion of similarity, whether or not it has anything to do with embedding in a vector space.  For example, Levy and Goldberg experiment with minimizing

log(w.x) + log(w.y) – log(w.z)

instead, and get somewhat superior results on the analogy task.  Optimizing this objective has nothing to do with the linear combination x+y-z.

None of which is to deny that the analogy engine in word2vec works well in many interesting cases!  It has no trouble, for instance, figuring out that Baltimore is to Maryland as Milwaukee is to Wisconsin.  More often than not, the Milwaukee of state X correctly returns the largest city in state X.  And sometimes, when it doesn’t, it gives the right answer anyway:  for instance, the Milwaukee of Ohio is Cleveland, a much better answer than Ohio’s largest city (Columbus — you knew that, right?)  The Milwaukee of Virginia, according to word2vec, is Charlottesville, which seems clearly wrong.  But maybe that’s OK — maybe there really isn’t a Milwaukee of Virginia.  One feels Richmond is a better guess than Charlottesville, but it scores notably lower.  (Note:  Word2Vec’s database doesn’t have Virginia_Beach, the largest city in Virginia.  That one I didn’t know.)

Another interesting case:  what is to state X as Gainesville is to Florida?  The answer should be “the location of the, or at least a, flagship state university, which isn’t the capital or even a major city of the state,” when such a city exists.  But this doesn’t seem to be something word2vec is good at finding.  The Gainesville of Virginia is Charlottesville, as it should be.  But the Gainesville of Georgia is Newnan.  Newnan?  Well, it turns out there’s a Newnan, Georgia, and there’s also a Newnan’s Lake in Gainesville, FL; I think that’s what’s driving the response.  That, and the fact that “Athens”, the right answer, is contextually separated from Georgia by the existence of Athens, Greece.

The Gainesville of Tennessee is Cookeville, though Knoxville, the right answer, comes a close second.

Why?  You can check that Knoxville, according to word2vec, is much closer to Gainesville than Cookeville is.

>>> model.similarity(‘Cookeville’,’Gainesville’)


>>> model.similarity(‘Knoxville’,’Gainesville’)


But Knoxville is placed much closer to Florida!

>>> model.similarity(‘Cookeville’,’Florida’)


>>> model.similarity(‘Knoxville’,’Florida’)


Remember:  what word2vec is really optimizing for here is “words which are close to Gainesville and close to Tennessee, and which are not close to Florida.”  And here you see that phenomenon very clearly.  I don’t think the semantic relationship between ‘Gainesville’ and ‘Florida’ is something word2vec is really capturing.  Similarly:  the Gainesville of Illinois is Edwardsville (though Champaign, Champaign_Urbana, and Urbana are all top 5) and the Gainesville of Indiana is Connersville.  (The top 5 for Indiana are all cities ending in “ville” — is the phonetic similarity playing some role?)

Just for fun, here’s a scatterplot of the 1000 nearest neighbors of ‘Gainesville’, with their similarity to ‘Gainesville’ (x-axis) plotted against their similarity to ‘Tennessee’ (y-axis):

Screen Shot 2016-01-14 at 14 Jan 4.37.PM

The Pareto frontier consists of “Tennessee” (that’s the one whose similarity to “Tennessee” is 1, obviously..) Knoxville, Jacksonville, and Tallahassee.

Bag of contexts

One popular simple linear model of word space is given by representing a word as a “bag of contexts” — perhaps there are several thousand contexts, and each word is given by a sparse vector in the space spanned by contexts:  coefficient 0 if the word is not in that context, 1 if it is.  In that setting, certain kinds of analogies would be linearized and certain kinds would not.  If “major city” is a context, then “Houston” and “Dallas” might have vectors that looked like “Texas” with the coodinate of “major city” flipped from 0 to 1.  Ditto, “Milwaukee” would be “Wisconsin” with the same basis vector added.  So

“Texas” + “Milwaukee” – “Wisconsin”

would be pretty close, in that space, to “Houston” and “Dallas.”

On the other hand, it’s not so easy to see what relations antonyms would have in that space. That’s the kind of relationship the bag of contexts may not make linear.

The word2vec space is only 300-dimensional, and the vectors aren’t sparse at all.  But maybe we should think of it as a random low-dimensional projection of a bag-of-contexts embedding!  By the Johnson-Lindenstrauss lemma, a 300-dimensional projection is plenty big enough to preserve the distances between 3 million points with a small distortion factor; and of course it preserves all linear relationships on the nose.

Perhaps this point of view gives some insight into which kind of word relationships manifest as linear relationships in word2vec.  “flock:birds” is an interesting example.  If you imagine “group of things” as a context, you can maybe imagine word2vec picking this up.  But actually, it doesn’t do well:

>> model.most_similar(positive=[‘fish’,’flock’],negative=[‘birds’])
[(u’crays’, 0.4601619839668274), (u’threadfin_salmon’, 0.4553075134754181), (u’spear_fishers’, 0.44864755868911743), (u’slab_crappies’, 0.4483765661716461), (u’flocked’, 0.44473177194595337), (u’Siltcoos_Lake’, 0.4429660737514496), (u’flounder’, 0.4414420425891876), (u’catfish’, 0.4413948059082031), (u’yellowtail_snappers’, 0.4410281181335449), (u’sockeyes’, 0.4395104944705963)]

>> model.most_similar(positive=[‘dogs’,’flock’],negative=[‘birds’])
[(u’dog’, 0.5390862226486206), (u’pooches’, 0.5000904202461243), (u’Eminem_Darth_Vader’, 0.48777419328689575), (u’Labrador_Retrievers’, 0.4792211949825287), (u’canines’, 0.4766522943973541), (u’barked_incessantly’, 0.4709487557411194), (u’Rottweilers_pit_bulls’, 0.4708423614501953), (u’labradoodles’, 0.47032350301742554), (u’rottweilers’, 0.46935927867889404), (u’forbidding_trespassers’, 0.4649636149406433)]

The answers “school” and “pack” don’t appear here.  Part of this, of course, is that “flock,” “school”, and “pack” all have interfering alternate meanings.  But part of it is that the analogy really rests on information about contexts in which the words “flock” and “birds” both appear.  In particular, in a short text window featuring both words, you are going to see a huge spike of “of” appearing right after flock and right before birds.  A statement of the form “flock is to birds as X is to Y” can’t be true unless “X of Y” actually shows up in the corpus a lot.

Challenge problem:  Can you make word2vec do a good job with relations like “flock:birds”?  As I said above, I wouldn’t have been shocked if this had actually worked out of the box, so maybe there’s some minor tweak that makes it work.

Boys’ names, girls’ names

Back to gender-flipping.  What’s the “male version” of the name “Jennifer”?

There are various ways one can do this.  If you use the analogy engine from word2vec, finding the closest word to “Jennifer” + “he” – “she”, you get as your top 5:

David, Jason, Brian, Kevin, Chris

>>> model.most_similar(positive=[‘Jennifer’,’he’],negative=[‘she’])
[(u’David’, 0.6693146228790283), (u’Jason’, 0.6635637283325195), (u’Brian’, 0.6586753129959106), (u’Kevin’, 0.6520106792449951), (u’Chris’, 0.6505492925643921), (u’Mark’, 0.6491551995277405), (u’Matt’, 0.6386727094650269), (u’Daniel’, 0.6294828057289124), (u’Greg’, 0.6267883777618408), (u’Jeff’, 0.6265031099319458)]

But there’s another way:  you can look at the words closest to “Jennifer” (which are essentially all first names) and pick out the ones which are closer to “he” than to “she”.  This gives

Matthew, Jeffrey, Jason, Jesse, Joshua.

>>> [x[0] for x in model.most_similar(‘Jennifer’,topn=2000) if model.similarity(x[0],’he’) > model.similarity(x[0],’she’)]
[u’Matthew’, u’Jeffrey’, u’Jason’, u’Jesse’, u’Joshua’, u’Evan’, u’Brian’, u’Cory’, u’Justin’, u’Shawn’, u’Darrin’, u’David’, u’Chris’, u’Kevin’, u’3/dh’, u’Christopher’, u’Corey’, u’Derek’, u’Alex’, u’Matt’, u’Jeremy’, u’Jeff’, u’Greg’, u’Timothy’, u’Eric’, u’Daniel’, u’Wyvonne’, u’Joel’, u’Chirstopher’, u’Mark’, u’Jonathon’]

Which is a better list of “male analogues of Jennifer?”  Matthew is certainly closer to Jennifer in word2vec distance:

>>> model.similarity(‘Jennifer’,’Matthew’)


>>> model.similarity(‘Jennifer’,’David’)


But, for whatever, reason, “David” is coded as much more strongly male than “Matthew” is; that is, it is closer to “he” – “she”.  (The same is true for “man” – “woman”.)  So “Matthew” doesn’t score high in the first list, which rates names by a combination of how male-context they are and how Jennifery they are.  A quick visit to NameVoyager shows that Matthew and Jennifer both peaked sharply in the 1970s; David, on the other hand, has a much longer range of popularity and was biggest in the 1950s.

Let’s do it again, for Susan.  The two methods give

David, Robert, Mark, Richard, John

Robert, Jeffrey, Richard, David, Kenneth

And for Edith:

Ernest, Edwin, Alfred, Arthur, Bert

Ernest, Harold, Alfred, Bert, Arthur

Pretty good agreement!  And you can see that, in each case, the selected names are “cultural matches” to the starting name.

Sidenote:  In a way it would be more natural to project wordspace down to the orthocomplement of “he” – “she” and find the nearest neighbor to “Susan” after that projection; that’s like, which word is closest to “Susan” if you ignore the contribution of the “he” – “she” direction.  This is the operation Ben Schmidt calls “vector rejection” in his excellent post about his word2vec model trained on student evaluations.  

If you do that, you get “Deborah.”  In other words, those two names are similar in so many contextual ways that they remain nearest neighbors even after we “remove the contribution of gender.”  A better way to say it is that the orthogonal projection doesn’t really remove the contribution of gender in toto.  It would be interesting to understand what kind of linear projections actually make it hard to distinguish male surnames from female ones.

Google News is a big enough database that this works on non-English names, too.  The male “Sylvie”, depending on which protocol you pick, is

Alain, Philippe, Serge, Andre, Jean-Francois


Jean-Francois, Francois, Stephane, Alain, Andre

The male “Kyoko” is

Kenji, Tomohiko, Nobuhiro, Kazuo, Hiroshi


Satoshi, Takayuki, Yosuke, Michio, Noboru

French and Japanese speakers are encouraged to weigh in about which list is better!

Update:  Even a little more messing around with “changing the gender of words” in a followup post.

Tagged , , , , , ,

Names and words

When you get the copy-edited manuscript of a book back, it comes with a document called “Names and Words,” this is a list of proper names or unusual words in the book which might admit variant spelling or typography, and the list is there to keep everybody on the production team uniform.

Here’s the A-B section of my list.  I think it gives a pretty good sense of what the book is about.

Niels Henrik Abel

Mahmoud Ahmadinejad

Aish HaTorah

Alcmaeon of Croton

Alhazen (Abu ‘Ali al-Hasan ibn al-Haytham)

Spike Albrecht

Ray Allen

Scott Allen

Akhil and Vikram Amar

Apollonius of Perga

Yasser Arafat

John Arbuthnot

Dan Ariely

Kenneth Arrow

John Ashbery

Daryl Renard Atkins

Yigal Attali

David Bakan

Stefan Banach

Dror Bar-Natan

Joseph-Émile Barbier

Leroy E. Burney

Andrew Beal

Nicholas Beaudrot

Bernd Beber

Gary Becker

Madeleine Beekman

Armando Benitez

Craig Bennett

Jim Bennett

George Berkeley

Joseph Berkson

Daniel Bernoulli

Jakob Bernoulli

Nicholas Bernoulli

Alphonse Bertillon


Joseph Bertrand

best seller


R. H. Bing

Otto Blumenthal

Usain Bolt

Farkas Bolyai

János Bolyai

Jean-Charles de Borda

Bose-Chaudhuri-Hocquenghem code

Nick Bostrom

David Brooks

Derren Brown

Filippo Brunelleschi

Pat Buchanan

Georges-Louis LeClerc, Comte de Buffon

Dylan Byers

Daniel Byman

David Byrne


Tagged ,

In which I have a quarter-million friends of friends on Facebook

One of the privacy options Facebook allows is “restrict to friends of friends.”  I was discussing with Tom Scocca the question of how many people this actually amounts to.  FB doesn’t seem to offer an easy way to get a definitive accounting, so I decided to use the new Facebook Graph Search to make a quick and dirty estimate.  If you ask it to show you all the friends of your friends, it just tells you that there are more than 1000, but doesn’t supply an exact number.  If you want a count, you have to ask it something more specific, like “How many friends of my friends are named Constance?”

In my case, the answer is 25.

So what does that mean?  Well, according to the amazing NameVoyager, between 100 and 300 babies per million are named Constance, at least in the birthdate range that contains most of Facebook’s user base and, I expect, most of my friends-of-friends (herafter, FoFs) as well.  So under the assumption that my FoFs are as likely as the average American to be named Constance, there should be between 85,000 and 250,000 FoFs.

That assumption is massively unlikely, of course; name choices have strong correlations with geography, ethnicity, and socioeconomic thingamabobs.  But you can just do this redundantly to get a sense of what’s going on.  59 of my FoFs are named Marianne, a name whose frequency ranges from 150-300 parts per million; that suggests a FoF range of about 200-400K.

I did this for a few names (50 Geralds, 18 Charitys (Charities??)) and the overlaps of the ranges seemed to hump at around 250,000, so that’s my vague estimate for the number.

Bu then I remembered that there was actually a paper about this on the arXiv, “The Anatomy of the Facebook Graph,” by Ugander, Karrer, Backstrom, and Marlow, which studies exactly this question.  They found something which is, to me, rather surprising; that the number of FoFs grows approximately linearly in the number of friends.  The appropriate coefficients have surely changed since 2011, but they get a good fit with

#FoF = 355(#friends) – 15057.

For me, with 680 friends, that’s 226,343.  Good fit!

This 2012 study from Pew (on which Marlow is also an author) studies a sample in which the respondents had a mean 245 Facebook friends, and finds that the mean number of FoFs was 156,569.  Interestingly, the linear model from the earlier paper gives only 72,000, though to my eye it looks like 245 is well within the range where the fit to the line is very good.

The math question this suggests:  in the various random-graph models that people like to use to study social networks, what is the mean size of the 2-neighborhood of x (i.e. the number of FoFs) conditional on x having degree k?  Is it ever linear in k, or approximately linear over some large range of k?

Tagged , , ,

10,000 baby names of Harvard

My 20th Harvard reunion book is in hand, offering a social snapshot of a certain educationally (and mostly financially) elite slice of the US population.

Here is what Harvard alums name their kids.  These are chosen by alphabetical order of surname from one segment of the book.  Most of these children are born between 2003 and the present.  They are grouped by family.

Molly, Danielle

Zachary, Zoe, Alex

Elias, Ella, Irena

Sawyer, Luke

Peyton, Aiden

Richard, Sonya

Grayson, Parker, Saya

Yoomi, Dae-il

Io, Pico, Daphne

Lucine, Mayri

Matthew, Christopher

Richard, Annalise, Ryan


Christopher, Sarah, Zachary, Claire

Shaiann, Zaccary

Alexandra, Victoria, Arianna, Madeline


Grace, Luke, Anna

William, Cecilia, Maya

Bode, Tyler

Daniel, Catherine

Alex, Gretchen

Nathan, Spencer, Benjamin

Ezekiel, Jesse

Matthew, Lauren, Ava, Nathan

Samuel, Katherine, Peter, Sophia

Ameri, Charles


Andrew, Zachary, Nathan

Alexander, Gabriella


Andrew, Nadia

Caroline, Elizabeth

Paul, Andrew

Shania, Tell, Delia

Saxon, Beatrix


Nathan, Lukas, Jacob

Noah, Haydn, Ellyson


Leonidas, Cyrus

Isabelle, Emma

Joseph, Theodore

Asha, Sophie, Tejas

Gabriela, Carlos, Sebastian

Brendan, Katherine


James, Seeger, Arden

Helena, Freya

Alexandra, Matthew


If you saw these names, would you be able to guess roughly what part of the culture they were drawn from?  Are there ways in which the distribution is plainly different from “standard” US naming practice?


Tagged , ,

Humanities Hackathon

Had a great time today talking graph theory with a roomful of students and faculty in the humanities at the Humanities Hackathon.  Here’s a (big .ppt file) link to my slides.  One popular visualization was this graph of baby boys’ names from 2011, where two names are adjacent if their popularity profile across 12 representative states is very similar.  (For example, names similar to “Malachi” on this measure include “Ashton” and “Kaden,” while names similar to “Patrick” include “Thomas,” “John,” “Sean,” and “Ryan.”)


The visualization is by the open-source graph-viz tool gephi.

I came home only to encounter this breathless post from the Science blog about a claim that you can use network invariants (e.g. clustering coefficient, degree distribution, correlation of degree between adjacent nodes) to distinguish factually grounded narratives like the Iliad from entirely fictional ones like Harry Potter.  The paper itself is not so convincing.  For instance, its argument on “assortativity,” the property that high-degree nodes tend to be adjacent to one another, goes something like this:

Real-life social networks tend to be assortative, in the sense that the number of friends I have is positively correlated with the number of friends my friends have.

The social network they write down for the Iliad isn’t assortative, so they remove all the interactions classified as “hostile,” and then it is.

The social network for Beowulf isn’t assortative, so they remove all the interactions classified as “hostile,” and then it still isn’t, so they take out Beowulf himself, and then it is, but just barely.

Conclusion: The social networks of Beowulf and the Iliad are assortative, just like real social networks.

Digital humanities can be better than this!


Tagged , , ,

Guess the state by its popular names

I’m messing around with some Social Security baby name data, preparing examples for the upcoming Humanities Hackathon at the Wisconsin Institute for Discovery, where I’ll be teaching a one-day course on networks and graphs.

So here’s a quiz.  I give a list of baby names which were strongly overrepresented among babies born in state X in 2011.  You name the state.

  1. [‘Jamison’, ‘Keagan’, ‘Nolan’, ‘Cullen’, ‘Finley’, ‘Dane’, ‘Bennett’, ‘Clay’, ‘Clayton’, ‘Sullivan’]
  2. [‘Santiago’, ‘Roberto’, ‘Iker’, ‘Alberto’, ‘Joe’, ‘Jose’, ‘Arturo’, ‘Gael’, ‘Armando’, ‘Raul’, ‘Gustavo’, ‘Juan’, ‘Mauricio’, ‘Julio’]
  3. [‘Francis’, ‘Nasir’, ‘Semaj’, ‘Shane’]
  4. [‘Ivan’, ‘Ismael’, ‘Edgar’, ‘Uriel’, ‘Francisco’, ‘Ramon’, ‘Damian’, ‘Gerardo’, ‘Emiliano’, ‘Sergio’, ‘Fernando’, ‘Esteban’, ‘Joaquin’, ‘Ernesto’, ‘Cesar’, ‘Moises’, ‘Diego’, ‘Ruben’, ‘Maximiliano’, ‘Johnny’]
  5. [‘Brendan’, ‘Conor’, ‘Seamus’, ‘Ronan’, ‘Theodore’, ‘Jadiel’, ‘Jacoby’]
  6. [‘Lawson’, ‘Khalil’, ‘Jamari’, ‘Chandler’, ‘Brantley’, ‘Cason’, ‘Davis’, ‘Braylen’, ‘Mekhi’]
  7. [‘Trenton’, ‘Remington’, ‘Kale’, ‘Dayton’, ‘Blaine’, ‘Clark’, ‘Karter’, ‘Jase’, ‘Lane’, ‘Gunner’]
  8. [‘Alonzo’, ‘Zaiden’, ‘Ezekiel’, ‘Trace’, ‘Orion’, ‘Cruz’, ‘Asher’, ‘Milo’, ‘Brodie’, ‘Jonas’, ‘Finley’, ‘Soren’, ‘Archer’, ‘Kellan’, ‘Ryker’, ‘Dillon’, ‘Zane’, ‘Kade’, ‘Nash’, ‘Kian’, ‘Cyrus’, ‘River’, ‘Uriah’, ‘Porter’]
  9. [‘Muhammad’, ‘Ali’, ‘Ahmed’, ‘Mohamed’, ‘Moshe’, ‘Aron’, ‘Solomon’, ‘Mohammed’, ‘Justin’, ‘Alvin’, ‘Mohammad’, ‘Wilson’, ‘Abraham’, ‘Ibrahim’]

Hint:  most of these are pretty big states, and one of them is Wisconsin.

Hint 2:  Knowing about football will help you with one of these and knowing about baseball with another.

I think these are pretty hard.

Answers below the fold:

Continue reading

Tagged , ,


Since someone asked me today: yes, the Stanley Higgs who appears in my novel was named after the Higgs boson. I thought it would add a very slight tinge of cosmic mystery to the character. Not any more, I guess.

Tagged , ,

And the Drungo Larue Hazewood Award for the year’s best sports name goes to

Nimrod Hilliard IV, of Madison East HS, this year’s “Mr. Basketball Wisconsin.”

If it’s not enough that his name is Nimrod Hilliard IV, the Madison East team is called the Purgolders.

Purgolder Nimrod Hilliard IV.

Drungo approves.

Tagged ,
%d bloggers like this: