## Learning a made-up language, II

Following up on yesterday’s post: one could also ask whether GPT-3 can compose in a made-up human language. For instance, you could take English and modify it a bit:

Henderson’s native language is Gravinic, and when he speaks English, it sounds like this:

“I to the store did go and some cabbage did buy. Now I some dinner am making and I some cake will later eat. I you am not very much liking. I for you to go away am politely asking.”

In other words, the new rules are that verbs go to the end of the sentence, all verbs take an auxiliary verb indicating tense, adverbs go between auxiliary verb and main verb. Is this short text enough for GPT-3 to locate context in a language that isn’t one of the languages in its training set but rather “English with modifications inferred from prompt?”

Not yet:

Write more dialogue where Henderson tells you about his day.

GPT-3: “I then went to the park and did some reading. The sun was shining and it made me feel so relaxed. I saw some birds flying about, and it was quite a sight. After that, I went to the beach and enjoyed the sound of the waves. I even went for a swim! Finally, I returned home and watched some television before going to bed.”

It’s totally possible that more prompting would make this work, though – I didn’t have the patience to manually type in a bunch more Gravinic-inflected English. I’m not actually sure what kind of context GPT-3 is placing itself in with this response.

Tagged , , , ,

LLMs seem to be really good at swift, reliable context location — e.g. it takes only a little bit of Python or romance novel or instruction manual for the LLM to recognize a context which requires specific choices of syntax and diction. I wondered though whether GPT-3 was recognizing the context of “Python” or a more general context of “programming language” which is further fine-tuned to a subcontext of Python. It seems like it’s the latter! Here was my way of testing. I gave it the prompt

The following is code in a programming language called Ranger that computes the first n triangular numbers.

ask('Please enter a number: '),n)

#initialize the triangular number

triangular_number : 0

as i ranges to n+1
triangular_number : triangular_number + i
write i then triangular_number


Ranger doesn’t exist: I just made something up where for loops are written “as i ranges to” and assignment of variables is done with a : and you don’t enclose the argument of a print (or “write”) statement in parentheses. To be fair, it’s basically python with some dictionary changes. Still, I thought I would see if GPT-3 can write code in Ranger even though there’s no Ranger code in its training set and so no actual “Ranger context” to locate. And it can!

Write a program in Ranger to compute the 10th prime number.

#initialize the prime number
prime_number : 1

#initialize the count

count : 0

#iterate from 2 to infinite

as i ranges to infinite
is_prime : true

as j ranges to i-1
if i % j is 0
is_prime : false
break

if is_prime is true
count : count + 1

if count is 10
prime_number : i
break

write prime_number


So you can see that GPT-3 correctly infers that the right context is “C/Python-like code” and correctly recognizes from the prompt that : means assignment, “as … ranges” is how you do a for loop, etc. It’s interesting that it’s way of testing primality is to check divisors of i all the way up to i; surely there isn’t a lot of existing code in the training set that does it in this massively inefficient way!

To be fair, this is an easy problem, and if you give it a somewhat harder one, it does this:

Write a program in Ranger to count the number of binary strings of length 10 which have no occurrence of the substring ‘0101’.

# Initialize the number of binary strings
num_binary_strings : 0

# Iterate over all binary strings of length 10
as binary_string ranges to 2^10
# Check if the binary string contains the substring '0101'
if binary_string does not contain '0101'
num_binary_strings : num_binary_strings + 1

# Print the result
write 'Number of binary strings of length 10 which have no occurence of the substring "0101": ', num_binary_strings


I guess this is sort of pseudocode? It doesn’t really know how to iterate over binary strings but knows there are 2^10 of them so it just “ranges” to that. Nor does it know how to check string containment in Ranger (how could it?) so it switches to English. Not a bad answer, really!

It would be interesting to try something like this where the invented language is a little more different from existing languages than “Python with some 1-for-1 word and symbol changes.”

Tagged , , , , ,

## GPT-3 me, by the book

Not gonna lie, it has always been a dream of mine, a marker of being a Real Writer, to do a New York Times Book Review “By the Book” interview, and this summer I did it. Really fun — I got to praise a lot of writers I truly admire (including Janet Malcolm, who I sort of hoped would read it, and who died the day before the interview came out.) and hopefully helped generate some publicity for Shape.

I also recently got beta access to Open AI’s “GPT-3” text-generation engine. So I thought it would be fun to give it part of my interview and see what happened if I asked it to continue.

Here are some results. What’s bold is text I supplied to GPT-3 (along with the whole text of the interview previous to that line.) This uses the davinci engine, with temperature set to 0.7.

What kind of reader were you as a child? Which childhood books and authors stick with you most?

I read “The Little House on the Prairie” series as it came out, and I think I really liked them. I reread “The Hobbit” and “The Lord of the Rings” many times. I loved “The Hobbit” so much I would reread the whole thing and just skip the long chapters. The first adult book I read was “The Heart Is a Lonely Hunter” by Carson McCullers, when I was 10 or 11, because I loved the title so much. It was a little hard to get into because it was so slow, but I was so happy to be reading a book that was all about people being sad, because I was so sad!

“The Go-Between,” by L. P. Hartley. I know it’s a cliché, but it’s so perfect.

If you could require the president to read one book, what would it be?

Note that GPT-3 has enough text to recognize this as a “By The Book” interview, because the two questions it generates are indeed on the standard list of questions that the NYTBR asks authors. “Temperature” is a measure of how constrained GPT-3 feels to pick the word it seems most likely at each stage. One cool thing about the interface is it gives you the probability distribution it drew from for each word. THe word “Heart” above, for instance, is one of the lowest-probability words in the above response; GPT-3 rated it the 11th most likely next word, with a probability of 0.8%. Top three were “Cat” (4.0%), “Aut” (1.7%), and “Fountain” (1.7%). Wait, “Aut”? Yeah — GPT-3 uses “tokens,” not words, and sometimes these are parts of words, or even just initial letters. I imagine “Aut” could have been the beginning of “Autobiography” or “Author.” Presumably running the algorithm at a lower temperature would have made it much more likely I’d have read something starting with “The Cat.” Having settled on “Heart,” the following word “is” came in with an 87% probability, far above the next most likely candidate, “of.”

In fact, it’s kind of cool how the OpenAI interface displays these probabilities, so I’m gonna screenshot and include:

Red means unlikely, green means likely — so you can see very clearly the red “Heart” and the green “Is.” Note that the standard questions from “By the Book” are almost all green once you get past the first word; you’re on a track the algorithm’s very confident about. “The Go-Between,” by L.P. Hartley is a real book, by the way, though not one I’d ever heard of, and not one I think my earlier responses could fairly be said to make you think I’d like. The answer itself, though, while it’s not my style, is very much “By the Book” style.

Here’s what you get at temperature 0:

What kind of reader were you as a child? Which childhood books and authors stick with you most?

I was a voracious reader, but I don’t remember any particular books. I remember the feeling of being in a room with a book. I remember the feeling of being in a room with a book and a window. I remember the feeling of being in a room with a book and a window and a tree outside the window. I remember the feeling of being in a room with a book and a window and a tree outside the window and a friend in the room. I remember the feeling of being in a room with a book and a window and a tree outside the window and a friend in the room and a dog in the room. I remember the feeling of being in a room with a book and a window and a tree outside the window and a friend in the room and a dog in the room and a cat in the room. I remember the feeling of being in a room with a book and a window and a tree outside the window and a friend in the room and a dog in the room and a cat in the room and a hamster in the room. I remember the feeling of being in a room with a book and a window and a tree outside the window and a friend in the room and a dog in the room and a cat in the room and a hamster in the room and a goldfish in the room. I remember the feeling of being in a room with a book and a window and a tree outside the window and a friend in the room and a dog in the room and a cat in the room and a hamster in the room and a goldfish in the room and a bird in the room. I remember the feeling of being in a room with a book and a window and a tree outside the window and a friend in the room and a dog in

This is pretty typical. If you pick the most likely “next word” every time, you get stuck repeating yourself, getting more and more certain that staying in the loop is the right thing to do because after all it’s what you’ve been doing! The only even moderately low-probability words here occur when a new animal is introduced to the room.

What about temperature 1, giving GPT-3 maximal license to draw less-likely next tokens? (Maybe temp 1 means literally draw from the imputed distribution? I’m not sure.)

What kind of reader were you as a child? Which childhood books and authors stick with you most?

Neil Postman recommended “Wind in the Willows,” which my mother read to me and I read at my mother’s knee, many many times, more times than I can count. I come back to it every year or so, and it’s still a great book. I also was nuts about “Blondie”; the single joke in the joke book never loses its appeal for me. I loved “Black Beauty” and several other Victorian kids’ classics with similar emotional power; I haven’t read them again yet because I never have any time. I can’t forget how much I loved it the first time I read it.

Anything you’d recommend to someone new to your books?

“Lovecraft Country,” without a doubt. Also my essay in the Yale series, “Mao II and Then Some.” I also think my Song of Ice and Fire “house style” recaps (here, here, here and piecemeal here); they’re the most comfortable thing I’ve written. Even my husband likes them, and he doesn’t care about the show or the books. And the recent article on swaddling.

Edited to make explicit that in-kind donations by the author also serve as p.r. I should have included them, which I wish I had done.

Well, this is the most fun at any rate. I love the idea that Neil Postman was recommending books to childhood me, and to my mom, I guess. This one actually feels to me the most like my own prose style, even as the things it says don’t quite make sense? Maybe there’s a tradeoff between style and sense. At any rate, there is no actual essay called “Mao II and Then Some,” but that’s 100% a title I would use, and DeLillo really is a writer I admire and didn’t mention in the piece. “Anything you’d recommend to someone new to your books?” is not one of NYTBR’s standard questions for this feature, but doesn’t it sound like one? Oh, here’s the color chart so you can see how much redder and more daring this output is.

Impressive work — and let me emphasize that I just did those three runs and that’s what I showed you, no cherry-picking of the best output. Not something that makes me feel easily impersonable, of course. But I didn’t give it that much of my writing….!

## Suriya Gunasekar, optimization geometry, loss minimization as dynamical system

Awesome SILO seminar this week by Suriya Gunasekar of TTI Chicago.  Here’s the idea, as I understand it.  In a classical optimization problem, like linear regression, you are trying to solve a problem which typically has no solution (draw a line that passes through every point in this cloud!) and the challenge is to find the best approximate solution.  Algebraically speaking:  you might be asked to solve

$Ax = b$

for x; but since x may not be in the image of the linear transformation A, you settle for minimizing

$||Ax-b||$

in whatever norm you like (L^2 for standard linear regression.)

In many modern optimization problems, on the other hand, the problem you’re trying to solve may have a lot more degrees of freedom.  Maybe you’re setting up an RNN with lots and lots and lots of parameters.  Or maybe, to bring this down to earth, you’re trying to pass a curve through lots of points but the curve is allowed to have very high degree.  This has the advantage that you can definitely find a curve that passes through all the points.  But it also has the disadvantage that you can definitely find a curve that passes through all the points.  You are likely to overfit!  Your wildly wiggly curve, engineered to exactly fit the data you trained on, is unlikely to generalize well to future data.

Everybody knows about this problem, everybody knows to worry about it.  But here’s the thing.  A lot of modern problems are of this form, and yet the optima we find on training data often do generalize pretty well to test data!  Why?

Make this more formal.  Let’s say for the sake of argument you’re trying to learn a real-valued function F, which you hypothesize is drawn from some giant space X.  (Not necessarily a vector space, just any old space.)  You have N training pairs (x_i, y_i), and a good choice for F might be one such that F(x_i) = y_i.  So you might try to find F such that

$F(x_i) = y_i$

for all i.  But if X is big enough, there will be a whole space of functions F which do the trick!  The solution set to

$F(\mathbf{x}) = \mathbf{y}$

will be some big subspace F_{x,y} of X.  How do you know which of these F’s to pick?

One popular way is to regularize; you decide that some elements of X are just better than others, and choose the point of F_{x,y} that optimizes that objective.  For instance, if you’re curve-fitting, you might try to find, among those curves passing through your N points, the least wiggly one (e.g. the one with the least total curvature.)  Or you might optimize for some combination of hitting the points and non-wiggliness, arriving at a compromise curve that wiggles only mildly and still passes near most of the points.  (The ultimate version of this strategy would be to retreat all the way back to linear regression.)

But it’s not obvious what regularization objective to choose, and maybe trying to optimize that objective is yet another hard computational problem, and so on and so on.  What’s really surprising is that something much simpler often works pretty well.  Namely:  how would you find F such that F(x) = y in the first place?  You would choose some random F in X, then do some version of gradient descent.  Find the direction in the tangent space to X at F that decreases $||F(\mathbf{x})-\mathbf{y}||$ most steeply, perturb F a bit in that direction, lather, rinse, repeat.

If this process converges, it ought to get you somewhere on the solution space F_{x,y}. But where?  And this is really what Gunasekar’s work is about.  Even if your starting F is distributed broadly, the distribution of the spot where gradient descent “lands” on F_{x,y} can be much more sharply focused.  In some cases, it’s concentrated on a single point!  The “likely targets of gradient descent” seem to generalize better to test data, and in some cases Gunasekar et al can prove gradient descent likes to find the points on F_{x,y} which optimize some regularizer.

I was really struck by this outlook.  I have tended to think of function learning as a problem of optimization; how can you effectively minimize the training loss ||F(x)  – y||?  But Gunasekar asks us instead to think about the much richer mathematical structure of the dynamical system of gradient descent on X guided by the loss function.  (Or I should say dynamical systems; gradient descent comes in many flavors.)

The dynamical system has a lot more stuff in it!  Think about iterating a function; knowing the fixed points is one thing, but knowing which fixed points are stable and which aren’t, and knowing which stable points have big basins of attraction, tells you way more.

What’s more, the dynamical system formulation is much more natural for learning problems as they are so often encountered in life, with streaming rather than static training data.  If you are constantly observing more pairs (x_i,y_i), you don’t want to have to start over every second and optimize a new loss function!  But if you take the primary object of study to be, not the loss function, but the dynamical system on the hypothesis space X, new data is no problem; your gradient is just a longer and longer sum with each timestep (or you exponentially deweight the older data, whatever you want my friend, the world is yours.)

Anyway.  Loved this talk.  Maybe this dynamical framework is the way other people are already accustomed to think of it but it was news to me.

Slides for a talk of Gunasekar’s similar to the one she gave here

“Characterizing Implicit Bias in terms of Optimization Geometry” (2018)

“Convergence of Gradient Descent on Separable Data” (2018)

A little googling for gradient descent and dynamical systems shows me that, unsurprisingly, Ben Recht is on this train.

## Peter Norvig, the meaning of polynomials, debugging as psychotherapy

I saw Peter Norvig give a great general-audience talk on AI at Berkeley when I was there last month.  A few notes from his talk.

• “We have always prioritized fast and cheap over safety and privacy — maybe this time we can make better choices.”
• He briefly showed a demo where, given values of a polynomial, a machine can put together a few lines of code that successfully computes the polynomial.  But the code looks weird to a human eye.  To compute some quadratic, it nests for-loops and adds things up in a funny way that ends up giving the right output.  So has it really ”learned” the polynomial?  I think in computer science, you typically feel you’ve learned a function if you can accurately predict its value on a given input.  For an algebraist like me, a function determines but isn’t determined by the values it takes; to me, there’s something about that quadratic polynomial the machine has failed to grasp.  I don’t think there’s a right or wrong answer here, just a cultural difference to be aware of.  Relevant:  Norvig’s description of “the two cultures” at the end of this long post on natural language processing (which is interesting all the way through!)
• Norvig made the point that traditional computer programs are very modular, leading to a highly successful debugging tradition of zeroing in on the precise part of the program that is doing something wrong, then fixing that part.  An algorithm or process developed by a machine, by contrast, may not have legible “parts”!  If a neural net is screwing up when classifying something, there’s no meaningful way to say “this neuron is the problem, let’s fix it.”  We’re dealing with highly non-modular complex systems which have evolved into a suboptimally functioning state, and you have to find a way to improve function which doesn’t involve taking the thing apart and replacing the broken component.  Of course, we already have a large professional community that works on exactly this problem.  They’re called therapists.  And I wonder whether the future of debugging will look a lot more like clinical psychology than it does like contemporary software engineering.
Tagged , , ,

## Mathematicians becoming data scientists: Should you? How to?

I was talking the other day with a former student at UW, Sarah Rich, who’s done degrees in both math and CS and then went off to Twitter.  I asked her:  so what would you say to a math Ph.D. student who was wondering whether they would like being a data scientist in the tech industry?  How would you know whether you might find that kind of work enjoyable?  And if you did decide to pursue it, what’s the strategy for making yourself a good job candidate?

Sarah exceeded my expectations by miles and wrote the following extremely informative and thorough tip sheet, which she’s given me permission to share.  Take it away, Sarah!

Tagged , , ,

## LMFDB!

Very happy to see that the L-functions and Modular Forms Database is now live!

When I was a kid we looked up our elliptic curves in Cremona’s tables, on paper.  Then William Stein created the Modular Forms Database (you can still go there but it doesn’t really work) and suddenly you could look at the q-expansions of cusp forms in whatever weight and level you wanted, up to the limits of what William had computed.

The LMFDB is a sort of massively souped up version of Cremona and Stein, put together by a team of dozens and dozens of number theorists, including too many friends of mine to name individually.  And it’s a lot more than what the title suggests:  the incredibly useful Jones-Roberts database of local fields is built in; there’s a database of genus 2 curves over Q with small conductor; it even has Maass forms!  I’ve been playing with it all night.  It’s like an adventure playground for number theorists.

One of my first trips through Stein’s database came when I was a postdoc and was thinking about Ljunggren’s equation y^2 + 1 = 2x^4.  This equation has a large solution, (13,239), which has to do with the classical identity

$\pi/4 = 4\arctan(1/5) - \arctan(1/239)$.

It turns out, as I explain in an old survey paper, that the existence of such a large solution is “explained” by the presence of a certain weight-2 cuspform in level 1024 whose mod-5 Galois representation is reducible.

With the LMFDB, you can easily wander around looking for more such examples!  For instance, you can very easily ask the database for non-CM elliptic curves whose mod-7 Galois representation is nonsurjective.  Among those, you can find this handsome curve of conductor 1296, which has very large height relative to its conductor.  Applying the usual Frey curve trick you can turn this curve into the Diophantine oddity

3*48383^2 – (1915)^3 = 2^13.

Huh — I wonder whether people ever thought about this Diophantine problem, when can the difference between a cube and three times a square be a power of 2?  Of course they did!  I just Googled

48383 Diophantine

and found this paper of Stanley Rabinowitz from 1978, which finds all solutions to that problem, of which this one is the largest.

Now whether you can massage this into an arctan identity, that I don’t know!

## AI challop

When I was a kid people thought it would be a long time before computers could adequately translate natural language text, or play Go against a human being, because you’d need some kind of AI to do those things, and AI seemed really hard.

Now we know that you can get pretty decent translation and Go without anything like AI.  But AI still seems really hard.

Tagged , , ,

## Gendercycle: a dynamical system on words

By the way, here’s another fun word2vec trick.  Following Ben Schmidt, you can try to find “gender-neutralized synonyms” — words which are close to each other except for the fact that they have different gender connotations.   A quick and dirty way to “mascify” a word is to find its nearest neighbor which is closer to “he” than “she”:

def mascify(y): return [x[0] for x in model.most_similar(y,topn=200) if model.similarity(x[0],’she’) < model.similarity(x[0],’he’)][0]

“femify” is defined similarly.  We could put a threshold away from 0 in there, if we wanted to restrict to more strongly gender-coded words.

Anyway, if you start with a word and run mascify and femify alternately, you can ask whether you eventually wind up in a 2-cycle:  a pair of words which are each others gender counterparts in this loose sense.

e.g.

gentle -> easygoing -> chatty -> talkative -> chatty -> …..

So “chatty” and “talkative” are a pair, with “chatty” female-coded and “talkative” male-coded.

beautiful -> magnificent -> wonderful -> marvelous -> wonderful -> …

So far, I keep hitting 2-cycles, and pretty quickly, though I don’t see why a longer cycle wouldn’t be possible or likely.  Update:  Kevin in comments explains very nicely why it has to terminate in a 2-cycle!

Some other pairs, female-coded word first:

overjoyed / elated

strident / vehement

fearful / worried

furious / livid

distraught / despondent

hilarious / funny

exquisite / sumptuous

thought_provoking / insightful

Sometimes it’s basically giving the same word, in two different forms or with one word misspelled:

intuitive / intuitively

buoyant / bouyant

You can do this for names, too, though you have to make the “topn” a little longer to find matches.  I found:

Jamie / Chris

Deborah / Jeffrey

Fran / Pat

Mary / Joseph

Pretty good pairs!  Note that you hit a lot of gender-mixed names (Jamie, Chris, Pat), just as you might expect:  the male-biased name word2vec-closest to a female name is likely to be a name at least some women have!  You can deal with this by putting in a threshold:

>> def mascify(y): return [x[0] for x in model.most_similar(y,topn=1000) if model.similarity(x[0],’she’) < model.similarity(x[0],’he’) – 0.1][0]

This eliminates “Jamie” and “Pat” (though “Chris” still reads as male.)

Now we get some new pairs:

Jody / Steve (this one seems to have a big basis of attraction, it shows up from a lot of initial conditions)

Kasey / Zach

Peter / Catherine (is this a Russia thing?)

Nicola / Dominic

Alison / Ian

## Messing around with word2vec

Word2vec is a way of representing words and phrases as vectors in medium-dimensional space developed by Tomas Mikolov and his team at Google; you can train it on any corpus you like (see Ben Schmidt’s blog for some great examples) but the version of the embedding you can download was trained on about 100 billion words of Google News, and encodes words as unit vectors in 300-dimensional space.

What really got people’s attention, when this came out, was word2vec’s ability to linearize analogies.  For example:  if x is the vector representing “king,” and y the vector representing “woman,” and z the vector representing “man,” then consider

x + y – z

which you might think of, in semantic space, as being the concept “king” to which “woman” has been added and “man” subtracted — in other words, “king made more female.”  What word lies closest in direction to x+y-z?  Just as you might hope, the answer is “queen.”

I found this really startling.  Does it mean that there’s some hidden linear structure in the space of words?

It turns out it’s not quite that simple.  I played around with word2vec a bunch, using Radim Řehůřek’s gensim package that nicely pulls everything into python; here’s what I learned about what the embedding is and isn’t telling you.

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts — that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’  (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.)  If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.)  It’s positive when the words are close to each other, negative when the words are far.  For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words.  ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

You might ask:  well, what words in Word2Vec are farthest from “tremendous?”  You just get trash:

>>> model.most_similar(negative=[‘tremendous’])

[(u’By_DENISE_DICK’, 0.2792186141014099), (u’NAVARRE_CORPORATION’, 0.26894450187683105), (u’By_SEAN_BARRON’, 0.26745346188545227), (u’LEGAL_NOTICES’, 0.25829464197158813), (u’Ky.Busch_##-###’, 0.2564955949783325), (u’desultorily’, 0.2563159763813019), (u’M.Kenseth_###-###’, 0.2562236189842224), (u’J.McMurray_###-###’, 0.25608277320861816), (u’D.Earnhardt_Jr._###-###’, 0.2547803819179535), (u’david.brett_@_thomsonreuters.com’, 0.2520599961280823)]

If 3 million words were distributed randomly in the unit ball in R^300, you’d expect the farthest one from “tremendous” to have dot product about -0.3 from it.  So when you see a list whose largest score is around that size, you should think “there’s no structure there, this is just noise.”

Antonyms

Challenge problem:  Is there a way to accurately generate antonyms using the word2vec embedding?  That seems to me the sort of thing the embedding is not capturing.  Kyle McDonald had a nice go at this, but I think the lesson of his experiment is that asking word2vec to find analogies of the form “word is to antonym as happy is to?” is just going to generate a list of neighbors of “happy.”  McDonald’s results also cast some light on the structure of word2vec analogies:  for instance, he finds that

waste is to economise as happy is to chuffed

First of all, “chuffed” is a synonym of happy, not an antonym.  But more importantly:  The reason “chuffed” is there is because it’s a way that British people say “happy,” just as “economise” is a way British people spell “economize.”  Change the spelling and you get

waste is to economize as happy is to glad

Non-semantic properties of words matter to word2vec.  They matter a lot.  Which brings us to diction.

Word2Vec distance keeps track of diction

Lots of non-semantic stuff is going on in natural language.  Like diction, which can be high or low, formal or informal, flowery or concrete.    Look at the nearest neighbors of “pugnacity”:

>>> model.most_similar(‘pugnacity’)

[(u’pugnaciousness’, 0.6015268564224243), (u’wonkishness’, 0.6014434099197388), (u’pugnacious’, 0.5877301692962646), (u’eloquence’, 0.5875781774520874), (u’sang_froid’, 0.5873805284500122), (u’truculence’, 0.5838015079498291), (u’pithiness’, 0.5773230195045471), (u’irascibility’, 0.5772287845611572), (u’hotheadedness’, 0.5741063356399536), (u’sangfroid’, 0.5715578198432922)]

Some of these are close semantically to pugnacity, but others, like “wonkishness,” “eloquence”, and “sangfroid,” are really just the kind of elevated-diction words the kind of person who says “pugnacity” would also say.

In the other direction:

>>> model.most_similar(‘psyched’)

[(u’geeked’, 0.7244787216186523), (u’excited’, 0.6678282022476196), (u’jazzed’, 0.666187584400177), (u’bummed’, 0.662735104560852), (u’amped’, 0.6473385691642761), (u’pysched’, 0.6245862245559692), (u’exicted’, 0.6116108894348145), (u’awesome’, 0.5838013887405396), (u’enthused’, 0.581687331199646), (u’kinda_bummed’, 0.5701783299446106)]

“geeked” is a pretty good synonym, but “bummed” is an antonym.  You may also note that contexts where “psyched” is common are also fertile ground for “pysched.”  This leads me to one of my favorite classes of examples:

Misspelling analogies

Which words are closest to “teh”?

>>> model.most_similar(‘teh’)

[(u’ther’, 0.6910992860794067), (u’hte’, 0.6501408815383911), (u’fo’, 0.6458913683891296), (u’tha’, 0.6098173260688782), (u’te’, 0.6042138934135437), (u’ot’, 0.595798909664154), (u’thats’, 0.595078706741333), (u’od’, 0.5908242464065552), (u’tho’, 0.58894944190979), (u’oa’, 0.5846965312957764)]

Makes sense:  the contexts where “teh” is common are those contexts where a lot of words are misspelled.

Using the “analogy” gadget, we can ask; which word is to “because” as “teh” is to “the”?

>>> model.most_similar(positive=[‘because’,’teh’],negative=[‘the’])

[(u’becuase’, 0.6815075278282166), (u’becasue’, 0.6744950413703918), (u’cuz’, 0.6165347099304199), (u’becuz’, 0.6027254462242126), (u’coz’, 0.580410361289978), (u’b_c’, 0.5737690925598145), (u’tho’, 0.5647958517074585), (u’beacause’, 0.5630674362182617), (u’thats’, 0.5605655908584595), (u’lol’, 0.5597798228263855)]

Or “like”?

>>> model.most_similar(positive=[‘like’,’teh’],negative=[‘the’])

[(u’liek’, 0.678846001625061), (u’ok’, 0.6136218309402466), (u’hahah’, 0.5887773633003235), (u’lke’, 0.5840467214584351), (u’probly’, 0.5819578170776367), (u’lol’, 0.5802655816078186), (u’becuz’, 0.5771245956420898), (u’wierd’, 0.5759704113006592), (u’dunno’, 0.5709049701690674), (u’tho’, 0.565370500087738)]

Note that this doesn’t always work:

>>> model.most_similar(positive=[‘should’,’teh’],negative=[‘the’])

[(u’wil’, 0.63351970911026), (u’cant’, 0.6080706715583801), (u’wont’, 0.5967696309089661), (u’dont’, 0.5911301970481873), (u’shold’, 0.5908039212226868), (u’shoud’, 0.5776053667068481), (u’shoudl’, 0.5491836071014404), (u”would’nt”, 0.5474458932876587), (u’shld’, 0.5443994402885437), (u’wouldnt’, 0.5413904190063477)]

What are word2vec analogies?

Now let’s come back to the more philosophical question.  Should the effectiveness of word2vec at solving analogy problems make us think that the space of words really has linear structure?

I don’t think so.  Again, I learned something important from the work of Levy and Goldberg.  When word2vec wants to find the word w which is to x as y is to z, it is trying to find w maximizing the dot product

w . (x + y – z)

But this is the same thing as maximizing

w.x + w.y – w.z.

In other words, what word2vec is really doing is saying

“Show me words which are similar to x and y but are dissimilar to z.”

This notion makes sense applied any notion of similarity, whether or not it has anything to do with embedding in a vector space.  For example, Levy and Goldberg experiment with minimizing

log(w.x) + log(w.y) – log(w.z)

instead, and get somewhat superior results on the analogy task.  Optimizing this objective has nothing to do with the linear combination x+y-z.

None of which is to deny that the analogy engine in word2vec works well in many interesting cases!  It has no trouble, for instance, figuring out that Baltimore is to Maryland as Milwaukee is to Wisconsin.  More often than not, the Milwaukee of state X correctly returns the largest city in state X.  And sometimes, when it doesn’t, it gives the right answer anyway:  for instance, the Milwaukee of Ohio is Cleveland, a much better answer than Ohio’s largest city (Columbus — you knew that, right?)  The Milwaukee of Virginia, according to word2vec, is Charlottesville, which seems clearly wrong.  But maybe that’s OK — maybe there really isn’t a Milwaukee of Virginia.  One feels Richmond is a better guess than Charlottesville, but it scores notably lower.  (Note:  Word2Vec’s database doesn’t have Virginia_Beach, the largest city in Virginia.  That one I didn’t know.)

Another interesting case:  what is to state X as Gainesville is to Florida?  The answer should be “the location of the, or at least a, flagship state university, which isn’t the capital or even a major city of the state,” when such a city exists.  But this doesn’t seem to be something word2vec is good at finding.  The Gainesville of Virginia is Charlottesville, as it should be.  But the Gainesville of Georgia is Newnan.  Newnan?  Well, it turns out there’s a Newnan, Georgia, and there’s also a Newnan’s Lake in Gainesville, FL; I think that’s what’s driving the response.  That, and the fact that “Athens”, the right answer, is contextually separated from Georgia by the existence of Athens, Greece.

The Gainesville of Tennessee is Cookeville, though Knoxville, the right answer, comes a close second.

Why?  You can check that Knoxville, according to word2vec, is much closer to Gainesville than Cookeville is.

>>> model.similarity(‘Cookeville’,’Gainesville’)

0.5457580604439547

>>> model.similarity(‘Knoxville’,’Gainesville’)

0.64010456774402158

But Knoxville is placed much closer to Florida!

>>> model.similarity(‘Cookeville’,’Florida’)

0.2044376252927515

>>> model.similarity(‘Knoxville’,’Florida’)

0.36523836770416895

Remember:  what word2vec is really optimizing for here is “words which are close to Gainesville and close to Tennessee, and which are not close to Florida.”  And here you see that phenomenon very clearly.  I don’t think the semantic relationship between ‘Gainesville’ and ‘Florida’ is something word2vec is really capturing.  Similarly:  the Gainesville of Illinois is Edwardsville (though Champaign, Champaign_Urbana, and Urbana are all top 5) and the Gainesville of Indiana is Connersville.  (The top 5 for Indiana are all cities ending in “ville” — is the phonetic similarity playing some role?)

Just for fun, here’s a scatterplot of the 1000 nearest neighbors of ‘Gainesville’, with their similarity to ‘Gainesville’ (x-axis) plotted against their similarity to ‘Tennessee’ (y-axis):

The Pareto frontier consists of “Tennessee” (that’s the one whose similarity to “Tennessee” is 1, obviously..) Knoxville, Jacksonville, and Tallahassee.

Bag of contexts

One popular simple linear model of word space is given by representing a word as a “bag of contexts” — perhaps there are several thousand contexts, and each word is given by a sparse vector in the space spanned by contexts:  coefficient 0 if the word is not in that context, 1 if it is.  In that setting, certain kinds of analogies would be linearized and certain kinds would not.  If “major city” is a context, then “Houston” and “Dallas” might have vectors that looked like “Texas” with the coodinate of “major city” flipped from 0 to 1.  Ditto, “Milwaukee” would be “Wisconsin” with the same basis vector added.  So

“Texas” + “Milwaukee” – “Wisconsin”

would be pretty close, in that space, to “Houston” and “Dallas.”

On the other hand, it’s not so easy to see what relations antonyms would have in that space. That’s the kind of relationship the bag of contexts may not make linear.

The word2vec space is only 300-dimensional, and the vectors aren’t sparse at all.  But maybe we should think of it as a random low-dimensional projection of a bag-of-contexts embedding!  By the Johnson-Lindenstrauss lemma, a 300-dimensional projection is plenty big enough to preserve the distances between 3 million points with a small distortion factor; and of course it preserves all linear relationships on the nose.

Perhaps this point of view gives some insight into which kind of word relationships manifest as linear relationships in word2vec.  “flock:birds” is an interesting example.  If you imagine “group of things” as a context, you can maybe imagine word2vec picking this up.  But actually, it doesn’t do well:

>> model.most_similar(positive=[‘fish’,’flock’],negative=[‘birds’])
[(u’crays’, 0.4601619839668274), (u’threadfin_salmon’, 0.4553075134754181), (u’spear_fishers’, 0.44864755868911743), (u’slab_crappies’, 0.4483765661716461), (u’flocked’, 0.44473177194595337), (u’Siltcoos_Lake’, 0.4429660737514496), (u’flounder’, 0.4414420425891876), (u’catfish’, 0.4413948059082031), (u’yellowtail_snappers’, 0.4410281181335449), (u’sockeyes’, 0.4395104944705963)]

>> model.most_similar(positive=[‘dogs’,’flock’],negative=[‘birds’])
[(u’dog’, 0.5390862226486206), (u’pooches’, 0.5000904202461243), (u’Eminem_Darth_Vader’, 0.48777419328689575), (u’Labrador_Retrievers’, 0.4792211949825287), (u’canines’, 0.4766522943973541), (u’barked_incessantly’, 0.4709487557411194), (u’Rottweilers_pit_bulls’, 0.4708423614501953), (u’labradoodles’, 0.47032350301742554), (u’rottweilers’, 0.46935927867889404), (u’forbidding_trespassers’, 0.4649636149406433)]

The answers “school” and “pack” don’t appear here.  Part of this, of course, is that “flock,” “school”, and “pack” all have interfering alternate meanings.  But part of it is that the analogy really rests on information about contexts in which the words “flock” and “birds” both appear.  In particular, in a short text window featuring both words, you are going to see a huge spike of “of” appearing right after flock and right before birds.  A statement of the form “flock is to birds as X is to Y” can’t be true unless “X of Y” actually shows up in the corpus a lot.

Challenge problem:  Can you make word2vec do a good job with relations like “flock:birds”?  As I said above, I wouldn’t have been shocked if this had actually worked out of the box, so maybe there’s some minor tweak that makes it work.

Boys’ names, girls’ names

Back to gender-flipping.  What’s the “male version” of the name “Jennifer”?

There are various ways one can do this.  If you use the analogy engine from word2vec, finding the closest word to “Jennifer” + “he” – “she”, you get as your top 5:

David, Jason, Brian, Kevin, Chris

>>> model.most_similar(positive=[‘Jennifer’,’he’],negative=[‘she’])
[(u’David’, 0.6693146228790283), (u’Jason’, 0.6635637283325195), (u’Brian’, 0.6586753129959106), (u’Kevin’, 0.6520106792449951), (u’Chris’, 0.6505492925643921), (u’Mark’, 0.6491551995277405), (u’Matt’, 0.6386727094650269), (u’Daniel’, 0.6294828057289124), (u’Greg’, 0.6267883777618408), (u’Jeff’, 0.6265031099319458)]

But there’s another way:  you can look at the words closest to “Jennifer” (which are essentially all first names) and pick out the ones which are closer to “he” than to “she”.  This gives

Matthew, Jeffrey, Jason, Jesse, Joshua.

>>> [x[0] for x in model.most_similar(‘Jennifer’,topn=2000) if model.similarity(x[0],’he’) > model.similarity(x[0],’she’)]
[u’Matthew’, u’Jeffrey’, u’Jason’, u’Jesse’, u’Joshua’, u’Evan’, u’Brian’, u’Cory’, u’Justin’, u’Shawn’, u’Darrin’, u’David’, u’Chris’, u’Kevin’, u’3/dh’, u’Christopher’, u’Corey’, u’Derek’, u’Alex’, u’Matt’, u’Jeremy’, u’Jeff’, u’Greg’, u’Timothy’, u’Eric’, u’Daniel’, u’Wyvonne’, u’Joel’, u’Chirstopher’, u’Mark’, u’Jonathon’]

Which is a better list of “male analogues of Jennifer?”  Matthew is certainly closer to Jennifer in word2vec distance:

>>> model.similarity(‘Jennifer’,’Matthew’)

0.61308109388608356

>>> model.similarity(‘Jennifer’,’David’)

0.56257556538528708

But, for whatever, reason, “David” is coded as much more strongly male than “Matthew” is; that is, it is closer to “he” – “she”.  (The same is true for “man” – “woman”.)  So “Matthew” doesn’t score high in the first list, which rates names by a combination of how male-context they are and how Jennifery they are.  A quick visit to NameVoyager shows that Matthew and Jennifer both peaked sharply in the 1970s; David, on the other hand, has a much longer range of popularity and was biggest in the 1950s.

Let’s do it again, for Susan.  The two methods give

David, Robert, Mark, Richard, John

Robert, Jeffrey, Richard, David, Kenneth

And for Edith:

Ernest, Edwin, Alfred, Arthur, Bert

Ernest, Harold, Alfred, Bert, Arthur

Pretty good agreement!  And you can see that, in each case, the selected names are “cultural matches” to the starting name.

Sidenote:  In a way it would be more natural to project wordspace down to the orthocomplement of “he” – “she” and find the nearest neighbor to “Susan” after that projection; that’s like, which word is closest to “Susan” if you ignore the contribution of the “he” – “she” direction.  This is the operation Ben Schmidt calls “vector rejection” in his excellent post about his word2vec model trained on student evaluations.

If you do that, you get “Deborah.”  In other words, those two names are similar in so many contextual ways that they remain nearest neighbors even after we “remove the contribution of gender.”  A better way to say it is that the orthogonal projection doesn’t really remove the contribution of gender in toto.  It would be interesting to understand what kind of linear projections actually make it hard to distinguish male surnames from female ones.

Google News is a big enough database that this works on non-English names, too.  The male “Sylvie”, depending on which protocol you pick, is

Alain, Philippe, Serge, Andre, Jean-Francois

or

Jean-Francois, Francois, Stephane, Alain, Andre

The male “Kyoko” is

Kenji, Tomohiko, Nobuhiro, Kazuo, Hiroshi

or

Satoshi, Takayuki, Yosuke, Michio, Noboru

French and Japanese speakers are encouraged to weigh in about which list is better!

Update:  Even a little more messing around with “changing the gender of words” in a followup post.