## Matrices with low rank and root-of-unity entries, in the morning, in the evening

A couple of days ago I started and ended my workday attending very interesting seminars. At 9am I was watching Zeev Dvir talk about his new paper with Manik Dhar proving the Kakeya conjecture over Z/mZ for m squarefree. At 5, as the sun set, Bjorn Poonen was talking at IAS (ok, at his office at MIT, but virtually at IAS) about his work with Kedlaya, Kolpakov and Rubinstein on space vectors forming rational angles. Those two things sound pretty different, but to my surprise, a very similar question appeared at the heart of both papers! Here it is, phrased a bit vaguely:

Describe the N x N matrices with rank much less than N whose entries are all 0 or roots of unity.

Such matrices exist, of course; A could be 0, or A could be the rank 1 matrix $v^T w$ where v and w have root-of-unity entries. Or you could just make a block-diagonal matrix with each block of rank 1 as above. But are there more interesting cases? And what does this have to do with the results above? (Update: it occurs to me that you could pithily phrase the question as: what can we say about matrices of low rank and low height?)

Let’s start with Poonen, where the connection is a little more direct. He and his collaborators are interested in tetrahedra whose face angles are all rational. The unit vectors normal to those faces will thus have the property that

$\langle u,v \rangle = \cos ((p/q)\pi)$

for every pair of vectors u,v. Bjorn & co. decide to go further, asking for a classification of all sets of unit vectors in R^3 such that every pair forms a rational angle. If you have a set of k such vectors, you can put them together into a kx3 matrix U, and the condition above says that the matrix UU^T (the so-called Gram matrix) has all entries of the form $\cos (p/q)\pi$. But more is true: U U^T visibly has rank at most 3! Now UU^T doesn’t have entries that are roots of unity, but it does have roots drawn from a very similar countable subset. Conversely, if you have a matrix with 1’s on the diagonal, all other entries rational cosines, and rank at most r, then you have found a subset of S^{r-1} which forms only rational angles. So this has the same flavor as the question above. And indeed one way you could obtain such a matrix is by finding a matrix A of rank < r/2 with all entries roots of unity and then averaging A with its conjugate.

Now what about Zeev? He’s working on one of my favorite problems, Kakeya over rings. Of course it was Zeev himself who famously proved the Kakeya conjecture over finite fields, showing that if S is a subset of F_q^n containing a line in every direction, then |S| must be at least c_n q^n. This is meant to be an analogue of the original Kakeya problem from harmonic analysis, which asks whether a subset of R^n containing a unite line segment in every direction has to have full Hausdorff dimension. Dvir’s theorem proves this over finite fields; but actually, it proves more — that the Kakeya set has positive measure (namely, measure at least c_n, normalizing the whole space to have measure 1.) That’s a problem if you like analogies, because the analysis analogue of that statement isn’t true! A famous example of Besicovitch shows that a set can have a unit line segment in every direction but have measure 0. (It’s hard to draw. Very spiky.) R and F_p are different!

One reason they’re different is that R has scales and F_p does not. Two distinct real numbers can be very close together or very far apart. In a finite field, two numbers are either the same or they’re not. A better analogue for R (of course I’d think this, I’m a number theorist) is the ring of p-adic integers, Z_p, which metrically looks a lot more like the reals. Even the finite ring Z/p^k Z has scales, though only k, not infinitely many. (For more about this, see this old post.)

Dhar and Dvir don’t prove Kakeya over Z/p^k Z, a long-time goal of mine, but they do prove it over Z/mZ for m squarefree, which is very cool, and they lay out a very appealing strategy for proving it over R = Z/p^k Z, which I’ll now explain. Suppose S is a subset of R^n which contains a line in every direction. (By a “direction” let’s just mean a nonzero vector y in R^n; we could identify vectors that differ by a scalar but I don’t think that makes a real difference.) For each direction y, choose a line L_y in that direction whose p^k points are all contained in S. Now we have a linear map F_S from the C-vector space spanned by directions to the C-vector space spanned by S, defined by

$F_S(y) = \sum_{s \in L_y} s$.

And we have a map G from C[S] to the space spanned by R^n, which is just the discrete Fourier transform

$G(s) = \sum_{v \in R^n} \zeta^{\langle s,v \rangle} v$

with $\zeta$ a primitive p^k-th root of unity in C.

What happens if we compose F with G? The points of L_y are just x+ty for some x in R^n, with t ranging over R. So

$G(F_S(y)) = \sum_{v \in R^n} \sum_{t \in R} \zeta^{\langle x + ty, v \rangle} v$

When $\langle y,v \rangle$ is anything but 0, the sum over t cancels everything and you get a coefficient of 0. When $\langle y,v \rangle = 0$, on the other hand, the dependence on t in the summand disappears and you just get $p^k \langle x,v \rangle$. Write M for the matrix expressing the linear transformation $p^{-k} G F_S$.

To sum up: this composition has rank at most |S|, because it factors through C[S]. On the other hand, expressed as a matrix, M has y,v entry 0 whenever y and v are not orthogonal, and some p^k-th root of unity whenever y and v are orthogonal. Dhar and Dvir ask, in the spirit of the question we started with: how small can the rank of such a matrix actually be? Any lower bound on the rank of M is a lower bound for the size of the Kakeya set S.

Is that in the spirit of the question we started with? The fact that the order of the root of unity is known makes it feel a little different from the question arising for Bjorn, where bounding the order is part of the game. Your allowable entries are drawn from a finite set instead of an infinite one. And of course in this case you’re specifying which matrix entries are roots of unity and which are zero.

Still, I found it an enjoyably thematic day of math!

## Political coordinates test

A popular political quiz on the internet purports to place you on a Cartesian plane with “left-right” on one axis and “libertarian-communitarian” on the other, by presenting you with 36 assertions you’re suppposed to agree or disagree with. One of them is

“There are too many wasteful government programs.”

Well, of course there are! For this not to be the case, the government would have to be uniquely unwasteful among all large institutions. The quiz does not ask whether you agree that

“There are too many wasteful private enterprises.”

I would like to agree with both, but the test only allows me to agree with the first while remaining silent above the second, which makes me seem more of a free-market purist than I really am. Which questions you choose to ask affects which answers you’re able to get.

## Lo!

“A naked man in a city street — the track of a horse in volcanic mud — the mystery of reindeer’s ears — a huge, black form, like a whale, in the sky, and it drips red drops as if attacked by celestial swordfishes — an appalling cherub appears in the sea —

Confusions.”

## France

Back from France!  Just there for 2 1/2 days on the way back from Israel.

1.  Cheeses eaten:  Bethmale, Camembert, unidentified Basque sheep’s milk, Chabechou, Crottin de Chavignol, 28-month aged Comté, Vieux Cantal, Brie de Nangis.  The Brie and the Chabechou (both from La Fermette) were the highlights.
2.  On the other hand, Berthillon ice cream not as amazing as I remembered — possibly because American ice cream has gotten a lot better since 2004, the last time I was in Paris?
3. The Louvre is by far the most difficult major world museum to navigate.  Why, for instance, the system where the rooms have numbers but the room numbers aren’t on the map?
4. I wonder museums with long entry lines have considered opening a separate, presumably shorter line for people willing to pay double (3x, 5x?) the usual entry fee.
5. In the most Thomas Friedman moment of my life, an Algerian cab driver heard my accent and immediately began telling me how much he loved Trump.  If you want to know the Algerian cab driver conventional wisdom on this topic, it’s that Trump is a businessman and will “calmer” once he becomes President.  I am doubtful.  Tune in later to see who was right, the Algerian cab driver or his skeptical passenger.
6. AB really loved the Rodin sculptures in the Musee d’Orsay.  I think because her height is such that she’s quite close to their feet.  Rodin sculpted a hell of a foot.
7. I learned from Leila Schneps that the masters of the Académie Française have declared that “oignon” is to be spelled “ognon” from now on!  I can’t adjust.

## Are the 2016 Orioles the slowest team in baseball history?

The Orioles are last in the AL in stolen bases, with 17.  They also have the fewest times caught stealing, with 11; they’re so slow they’re not even trying to run.

But here’s the thing that really jumps out at you.  With just 17 games to play, the Orioles have 6 triples on the season.  And this is a team with power, a team that hits the ball to the deep outfield a lot.  Six triples.  You know what the record is for the fewest triples ever recorded by a team?  11.  By the 1998 Orioles.  This year’s team is like the 1998 squad without the speed machine that was 34-year-old Brady Anderson.   They are going to set a fewest-triples record that may never be broken.

## Very Aomby

This song, by Damily, is amazing:

This is tsapiky, a currently dominant style in popular music of southern Madagascar.  There isn’t much tsapiky on Spotify, but what there is is pretty good.  (None of it equals “Very Aomby,” though.)

Tagged , , ,

## Those who leave and those who stay

Just finished the third of Ferrante’s Neapolitan novels.  Greco, the narrator, is constantly yearning for a quiet space, away from competition.  The sense is that you can only make art in such a quiet space.  But it seems there’s no interaction between people without one striving to fuck, thwart, or destroy the other.   So maybe no quiet space exists, though Greco again and again almost seems to find it.  Ferrante puts the football down in front of her, Ferrante pulls it away.  And you’re surprised every time.

Tagged ,

## Voices from Chernobyl

Voices from Chernobyl is an oral history of the atomic disaster and its aftermath, by Svetlana Alexievich,the first journalist to win the Nobel Prize for Literature.  (Steinbeck maybe?  But he didn’t win on his journalism.)

Nina Konstantinovnva, a literature teacher:

I teach Russian literature to kids who are not like the kids I taught ten years ago.  They are constantly seeing someone or something get buried, get placed underground.  Houses and trees, everything gets buried.  If they stand in line for fifteen, twenty minutes, some of them start fainting, their noses bleed.  You can’t surprise them with anything and you can’t make them happy.  They’re always tired and sleepy.  Their faces are pale and gray.  They don’t play and they don’t fool around.  If they fight or accidentally break a window, the teachers are pleased.  We don’t yell at them, because they’re not like kids.  And they’re growing so slowly.  You ask them to repeat something during a lesson, and the child can’t, it gets to the point where you simply ask him to repeat a sentence, and he can’t.  You want to ask him, “Where are you?  Where?”

Major Oleg Pavlov, a helicopter pilot:

Every April 26 we get together, the guys who were there.  We remember how it was.  You were a soldier, at war, you were necessary.  We forget the bad parts and remember that.  We remember that they couldn’t have made it without us.  Our system, it’s a military system, essentially, and it works great in emergencies.  You’re finally free there, and necessary.  Freedom!  And in those times the Russian shows how great he is.  How unique.  We’ll never be Dutch or German.  And we’ll never have proper asphalt or manicured lawns.  But there’ll always be plenty of heroes.

Translated by Keith Gessen.

Tagged , ,

## Room to grow

However, Apple CEO Tim Cook pointed out that 60% of people who owned an iPhone before the launch of the iPhone 6 haven’t upgraded to the most recent models, which implies that there is still room to grow, Reuters notes.

Doesn’t it imply that a) people are no longer on contracts incentivizing biannual upgrade; and b) Apple hasn’t figured out a way to make a new phone that’s different enough from the iPhone 5 to make people want to switch?

## Messing around with word2vec

Word2vec is a way of representing words and phrases as vectors in medium-dimensional space developed by Tomas Mikolov and his team at Google; you can train it on any corpus you like (see Ben Schmidt’s blog for some great examples) but the version of the embedding you can download was trained on about 100 billion words of Google News, and encodes words as unit vectors in 300-dimensional space.

What really got people’s attention, when this came out, was word2vec’s ability to linearize analogies.  For example:  if x is the vector representing “king,” and y the vector representing “woman,” and z the vector representing “man,” then consider

x + y – z

which you might think of, in semantic space, as being the concept “king” to which “woman” has been added and “man” subtracted — in other words, “king made more female.”  What word lies closest in direction to x+y-z?  Just as you might hope, the answer is “queen.”

I found this really startling.  Does it mean that there’s some hidden linear structure in the space of words?

It turns out it’s not quite that simple.  I played around with word2vec a bunch, using Radim Řehůřek’s gensim package that nicely pulls everything into python; here’s what I learned about what the embedding is and isn’t telling you.

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts — that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’  (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.)  If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.)  It’s positive when the words are close to each other, negative when the words are far.  For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words.  ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

You might ask:  well, what words in Word2Vec are farthest from “tremendous?”  You just get trash:

>>> model.most_similar(negative=[‘tremendous’])

[(u’By_DENISE_DICK’, 0.2792186141014099), (u’NAVARRE_CORPORATION’, 0.26894450187683105), (u’By_SEAN_BARRON’, 0.26745346188545227), (u’LEGAL_NOTICES’, 0.25829464197158813), (u’Ky.Busch_##-###’, 0.2564955949783325), (u’desultorily’, 0.2563159763813019), (u’M.Kenseth_###-###’, 0.2562236189842224), (u’J.McMurray_###-###’, 0.25608277320861816), (u’D.Earnhardt_Jr._###-###’, 0.2547803819179535), (u’david.brett_@_thomsonreuters.com’, 0.2520599961280823)]

If 3 million words were distributed randomly in the unit ball in R^300, you’d expect the farthest one from “tremendous” to have dot product about -0.3 from it.  So when you see a list whose largest score is around that size, you should think “there’s no structure there, this is just noise.”

Antonyms

Challenge problem:  Is there a way to accurately generate antonyms using the word2vec embedding?  That seems to me the sort of thing the embedding is not capturing.  Kyle McDonald had a nice go at this, but I think the lesson of his experiment is that asking word2vec to find analogies of the form “word is to antonym as happy is to?” is just going to generate a list of neighbors of “happy.”  McDonald’s results also cast some light on the structure of word2vec analogies:  for instance, he finds that

waste is to economise as happy is to chuffed

First of all, “chuffed” is a synonym of happy, not an antonym.  But more importantly:  The reason “chuffed” is there is because it’s a way that British people say “happy,” just as “economise” is a way British people spell “economize.”  Change the spelling and you get

waste is to economize as happy is to glad

Non-semantic properties of words matter to word2vec.  They matter a lot.  Which brings us to diction.

Word2Vec distance keeps track of diction

Lots of non-semantic stuff is going on in natural language.  Like diction, which can be high or low, formal or informal, flowery or concrete.    Look at the nearest neighbors of “pugnacity”:

>>> model.most_similar(‘pugnacity’)

[(u’pugnaciousness’, 0.6015268564224243), (u’wonkishness’, 0.6014434099197388), (u’pugnacious’, 0.5877301692962646), (u’eloquence’, 0.5875781774520874), (u’sang_froid’, 0.5873805284500122), (u’truculence’, 0.5838015079498291), (u’pithiness’, 0.5773230195045471), (u’irascibility’, 0.5772287845611572), (u’hotheadedness’, 0.5741063356399536), (u’sangfroid’, 0.5715578198432922)]

Some of these are close semantically to pugnacity, but others, like “wonkishness,” “eloquence”, and “sangfroid,” are really just the kind of elevated-diction words the kind of person who says “pugnacity” would also say.

In the other direction:

>>> model.most_similar(‘psyched’)

[(u’geeked’, 0.7244787216186523), (u’excited’, 0.6678282022476196), (u’jazzed’, 0.666187584400177), (u’bummed’, 0.662735104560852), (u’amped’, 0.6473385691642761), (u’pysched’, 0.6245862245559692), (u’exicted’, 0.6116108894348145), (u’awesome’, 0.5838013887405396), (u’enthused’, 0.581687331199646), (u’kinda_bummed’, 0.5701783299446106)]

“geeked” is a pretty good synonym, but “bummed” is an antonym.  You may also note that contexts where “psyched” is common are also fertile ground for “pysched.”  This leads me to one of my favorite classes of examples:

Misspelling analogies

Which words are closest to “teh”?

>>> model.most_similar(‘teh’)

[(u’ther’, 0.6910992860794067), (u’hte’, 0.6501408815383911), (u’fo’, 0.6458913683891296), (u’tha’, 0.6098173260688782), (u’te’, 0.6042138934135437), (u’ot’, 0.595798909664154), (u’thats’, 0.595078706741333), (u’od’, 0.5908242464065552), (u’tho’, 0.58894944190979), (u’oa’, 0.5846965312957764)]

Makes sense:  the contexts where “teh” is common are those contexts where a lot of words are misspelled.

Using the “analogy” gadget, we can ask; which word is to “because” as “teh” is to “the”?

>>> model.most_similar(positive=[‘because’,’teh’],negative=[‘the’])

[(u’becuase’, 0.6815075278282166), (u’becasue’, 0.6744950413703918), (u’cuz’, 0.6165347099304199), (u’becuz’, 0.6027254462242126), (u’coz’, 0.580410361289978), (u’b_c’, 0.5737690925598145), (u’tho’, 0.5647958517074585), (u’beacause’, 0.5630674362182617), (u’thats’, 0.5605655908584595), (u’lol’, 0.5597798228263855)]

Or “like”?

>>> model.most_similar(positive=[‘like’,’teh’],negative=[‘the’])

[(u’liek’, 0.678846001625061), (u’ok’, 0.6136218309402466), (u’hahah’, 0.5887773633003235), (u’lke’, 0.5840467214584351), (u’probly’, 0.5819578170776367), (u’lol’, 0.5802655816078186), (u’becuz’, 0.5771245956420898), (u’wierd’, 0.5759704113006592), (u’dunno’, 0.5709049701690674), (u’tho’, 0.565370500087738)]

Note that this doesn’t always work:

>>> model.most_similar(positive=[‘should’,’teh’],negative=[‘the’])

[(u’wil’, 0.63351970911026), (u’cant’, 0.6080706715583801), (u’wont’, 0.5967696309089661), (u’dont’, 0.5911301970481873), (u’shold’, 0.5908039212226868), (u’shoud’, 0.5776053667068481), (u’shoudl’, 0.5491836071014404), (u”would’nt”, 0.5474458932876587), (u’shld’, 0.5443994402885437), (u’wouldnt’, 0.5413904190063477)]

What are word2vec analogies?

Now let’s come back to the more philosophical question.  Should the effectiveness of word2vec at solving analogy problems make us think that the space of words really has linear structure?

I don’t think so.  Again, I learned something important from the work of Levy and Goldberg.  When word2vec wants to find the word w which is to x as y is to z, it is trying to find w maximizing the dot product

w . (x + y – z)

But this is the same thing as maximizing

w.x + w.y – w.z.

In other words, what word2vec is really doing is saying

“Show me words which are similar to x and y but are dissimilar to z.”

This notion makes sense applied any notion of similarity, whether or not it has anything to do with embedding in a vector space.  For example, Levy and Goldberg experiment with minimizing

log(w.x) + log(w.y) – log(w.z)

instead, and get somewhat superior results on the analogy task.  Optimizing this objective has nothing to do with the linear combination x+y-z.

None of which is to deny that the analogy engine in word2vec works well in many interesting cases!  It has no trouble, for instance, figuring out that Baltimore is to Maryland as Milwaukee is to Wisconsin.  More often than not, the Milwaukee of state X correctly returns the largest city in state X.  And sometimes, when it doesn’t, it gives the right answer anyway:  for instance, the Milwaukee of Ohio is Cleveland, a much better answer than Ohio’s largest city (Columbus — you knew that, right?)  The Milwaukee of Virginia, according to word2vec, is Charlottesville, which seems clearly wrong.  But maybe that’s OK — maybe there really isn’t a Milwaukee of Virginia.  One feels Richmond is a better guess than Charlottesville, but it scores notably lower.  (Note:  Word2Vec’s database doesn’t have Virginia_Beach, the largest city in Virginia.  That one I didn’t know.)

Another interesting case:  what is to state X as Gainesville is to Florida?  The answer should be “the location of the, or at least a, flagship state university, which isn’t the capital or even a major city of the state,” when such a city exists.  But this doesn’t seem to be something word2vec is good at finding.  The Gainesville of Virginia is Charlottesville, as it should be.  But the Gainesville of Georgia is Newnan.  Newnan?  Well, it turns out there’s a Newnan, Georgia, and there’s also a Newnan’s Lake in Gainesville, FL; I think that’s what’s driving the response.  That, and the fact that “Athens”, the right answer, is contextually separated from Georgia by the existence of Athens, Greece.

The Gainesville of Tennessee is Cookeville, though Knoxville, the right answer, comes a close second.

Why?  You can check that Knoxville, according to word2vec, is much closer to Gainesville than Cookeville is.

>>> model.similarity(‘Cookeville’,’Gainesville’)

0.5457580604439547

>>> model.similarity(‘Knoxville’,’Gainesville’)

0.64010456774402158

But Knoxville is placed much closer to Florida!

>>> model.similarity(‘Cookeville’,’Florida’)

0.2044376252927515

>>> model.similarity(‘Knoxville’,’Florida’)

0.36523836770416895

Remember:  what word2vec is really optimizing for here is “words which are close to Gainesville and close to Tennessee, and which are not close to Florida.”  And here you see that phenomenon very clearly.  I don’t think the semantic relationship between ‘Gainesville’ and ‘Florida’ is something word2vec is really capturing.  Similarly:  the Gainesville of Illinois is Edwardsville (though Champaign, Champaign_Urbana, and Urbana are all top 5) and the Gainesville of Indiana is Connersville.  (The top 5 for Indiana are all cities ending in “ville” — is the phonetic similarity playing some role?)

Just for fun, here’s a scatterplot of the 1000 nearest neighbors of ‘Gainesville’, with their similarity to ‘Gainesville’ (x-axis) plotted against their similarity to ‘Tennessee’ (y-axis):

The Pareto frontier consists of “Tennessee” (that’s the one whose similarity to “Tennessee” is 1, obviously..) Knoxville, Jacksonville, and Tallahassee.

Bag of contexts

One popular simple linear model of word space is given by representing a word as a “bag of contexts” — perhaps there are several thousand contexts, and each word is given by a sparse vector in the space spanned by contexts:  coefficient 0 if the word is not in that context, 1 if it is.  In that setting, certain kinds of analogies would be linearized and certain kinds would not.  If “major city” is a context, then “Houston” and “Dallas” might have vectors that looked like “Texas” with the coodinate of “major city” flipped from 0 to 1.  Ditto, “Milwaukee” would be “Wisconsin” with the same basis vector added.  So

“Texas” + “Milwaukee” – “Wisconsin”

would be pretty close, in that space, to “Houston” and “Dallas.”

On the other hand, it’s not so easy to see what relations antonyms would have in that space. That’s the kind of relationship the bag of contexts may not make linear.

The word2vec space is only 300-dimensional, and the vectors aren’t sparse at all.  But maybe we should think of it as a random low-dimensional projection of a bag-of-contexts embedding!  By the Johnson-Lindenstrauss lemma, a 300-dimensional projection is plenty big enough to preserve the distances between 3 million points with a small distortion factor; and of course it preserves all linear relationships on the nose.

Perhaps this point of view gives some insight into which kind of word relationships manifest as linear relationships in word2vec.  “flock:birds” is an interesting example.  If you imagine “group of things” as a context, you can maybe imagine word2vec picking this up.  But actually, it doesn’t do well:

>> model.most_similar(positive=[‘fish’,’flock’],negative=[‘birds’])
[(u’crays’, 0.4601619839668274), (u’threadfin_salmon’, 0.4553075134754181), (u’spear_fishers’, 0.44864755868911743), (u’slab_crappies’, 0.4483765661716461), (u’flocked’, 0.44473177194595337), (u’Siltcoos_Lake’, 0.4429660737514496), (u’flounder’, 0.4414420425891876), (u’catfish’, 0.4413948059082031), (u’yellowtail_snappers’, 0.4410281181335449), (u’sockeyes’, 0.4395104944705963)]

>> model.most_similar(positive=[‘dogs’,’flock’],negative=[‘birds’])
[(u’dog’, 0.5390862226486206), (u’pooches’, 0.5000904202461243), (u’Eminem_Darth_Vader’, 0.48777419328689575), (u’Labrador_Retrievers’, 0.4792211949825287), (u’canines’, 0.4766522943973541), (u’barked_incessantly’, 0.4709487557411194), (u’Rottweilers_pit_bulls’, 0.4708423614501953), (u’labradoodles’, 0.47032350301742554), (u’rottweilers’, 0.46935927867889404), (u’forbidding_trespassers’, 0.4649636149406433)]

The answers “school” and “pack” don’t appear here.  Part of this, of course, is that “flock,” “school”, and “pack” all have interfering alternate meanings.  But part of it is that the analogy really rests on information about contexts in which the words “flock” and “birds” both appear.  In particular, in a short text window featuring both words, you are going to see a huge spike of “of” appearing right after flock and right before birds.  A statement of the form “flock is to birds as X is to Y” can’t be true unless “X of Y” actually shows up in the corpus a lot.

Challenge problem:  Can you make word2vec do a good job with relations like “flock:birds”?  As I said above, I wouldn’t have been shocked if this had actually worked out of the box, so maybe there’s some minor tweak that makes it work.

Boys’ names, girls’ names

Back to gender-flipping.  What’s the “male version” of the name “Jennifer”?

There are various ways one can do this.  If you use the analogy engine from word2vec, finding the closest word to “Jennifer” + “he” – “she”, you get as your top 5:

David, Jason, Brian, Kevin, Chris

>>> model.most_similar(positive=[‘Jennifer’,’he’],negative=[‘she’])
[(u’David’, 0.6693146228790283), (u’Jason’, 0.6635637283325195), (u’Brian’, 0.6586753129959106), (u’Kevin’, 0.6520106792449951), (u’Chris’, 0.6505492925643921), (u’Mark’, 0.6491551995277405), (u’Matt’, 0.6386727094650269), (u’Daniel’, 0.6294828057289124), (u’Greg’, 0.6267883777618408), (u’Jeff’, 0.6265031099319458)]

But there’s another way:  you can look at the words closest to “Jennifer” (which are essentially all first names) and pick out the ones which are closer to “he” than to “she”.  This gives

Matthew, Jeffrey, Jason, Jesse, Joshua.

>>> [x[0] for x in model.most_similar(‘Jennifer’,topn=2000) if model.similarity(x[0],’he’) > model.similarity(x[0],’she’)]
[u’Matthew’, u’Jeffrey’, u’Jason’, u’Jesse’, u’Joshua’, u’Evan’, u’Brian’, u’Cory’, u’Justin’, u’Shawn’, u’Darrin’, u’David’, u’Chris’, u’Kevin’, u’3/dh’, u’Christopher’, u’Corey’, u’Derek’, u’Alex’, u’Matt’, u’Jeremy’, u’Jeff’, u’Greg’, u’Timothy’, u’Eric’, u’Daniel’, u’Wyvonne’, u’Joel’, u’Chirstopher’, u’Mark’, u’Jonathon’]

Which is a better list of “male analogues of Jennifer?”  Matthew is certainly closer to Jennifer in word2vec distance:

>>> model.similarity(‘Jennifer’,’Matthew’)

0.61308109388608356

>>> model.similarity(‘Jennifer’,’David’)

0.56257556538528708

But, for whatever, reason, “David” is coded as much more strongly male than “Matthew” is; that is, it is closer to “he” – “she”.  (The same is true for “man” – “woman”.)  So “Matthew” doesn’t score high in the first list, which rates names by a combination of how male-context they are and how Jennifery they are.  A quick visit to NameVoyager shows that Matthew and Jennifer both peaked sharply in the 1970s; David, on the other hand, has a much longer range of popularity and was biggest in the 1950s.

Let’s do it again, for Susan.  The two methods give

David, Robert, Mark, Richard, John

Robert, Jeffrey, Richard, David, Kenneth

And for Edith:

Ernest, Edwin, Alfred, Arthur, Bert

Ernest, Harold, Alfred, Bert, Arthur

Pretty good agreement!  And you can see that, in each case, the selected names are “cultural matches” to the starting name.

Sidenote:  In a way it would be more natural to project wordspace down to the orthocomplement of “he” – “she” and find the nearest neighbor to “Susan” after that projection; that’s like, which word is closest to “Susan” if you ignore the contribution of the “he” – “she” direction.  This is the operation Ben Schmidt calls “vector rejection” in his excellent post about his word2vec model trained on student evaluations.

If you do that, you get “Deborah.”  In other words, those two names are similar in so many contextual ways that they remain nearest neighbors even after we “remove the contribution of gender.”  A better way to say it is that the orthogonal projection doesn’t really remove the contribution of gender in toto.  It would be interesting to understand what kind of linear projections actually make it hard to distinguish male surnames from female ones.

Google News is a big enough database that this works on non-English names, too.  The male “Sylvie”, depending on which protocol you pick, is

Alain, Philippe, Serge, Andre, Jean-Francois

or

Jean-Francois, Francois, Stephane, Alain, Andre

The male “Kyoko” is

Kenji, Tomohiko, Nobuhiro, Kazuo, Hiroshi

or

Satoshi, Takayuki, Yosuke, Michio, Noboru

French and Japanese speakers are encouraged to weigh in about which list is better!

Update:  Even a little more messing around with “changing the gender of words” in a followup post.