## Ranking mathematicians by hinge loss

As I mentioned, I’m reading Ph.D. admission files.  Each file is read by two committee members and thus each file has two numerical scores.

How to put all this information together into a preliminary ranking?

The traditional way is to assign to each applicant their mean score.  But there’s a problem: different raters have different scales.  My 7 might be your 5.

You could just normalize the scores by subtracting that rater’s overall mean.  But that’s problematic too.  What if one rater actually happens to have looked at stronger files?  Or even if not:  what if the relation between rater A’s scale and rater B’s scale isn’t linear?  Maybe, for instance, rater A gives everyone she doesn’t think should get in a 0, while rater A uses a range of low scores to express the same opinion, depending on just how unsuitable the candidate seems.

Here’s what I did last year.  If (r,a,a’) is a triple with r is a rater and a and a’ are two applicants, such that r rated a higher than a’, you can think of that as a judgment that a is more admittable than a’.  And you can put all those judgments from all the raters in a big bag, and then see if you can find a ranking of the applicants (or, if you like, a real-valued function f on the applicants) such that, for every judgment a > a’, we have f(a) > f(a’).

Of course, this might not be possible — two raters might disagree!  Or there might be more complicated incompatibilities generated by multiple raters.  Still, you can ask:  what if I tried to minimize the number of “mistakes”, i.e. the number of judgments in your bag that your choice of ranking contradicts?

Well, you can ask that, but you may not get an answer, because that’s a highly non-convex minimization problem, and is as far as we know completely intractable.

But here’s a way out, or at least a way part of the way out — we can use a convex relaxation.  Set it up this way.  Let V be the space of real-valued functions on applicants.  For each judgment j, let mistake_j(f) be the step function

mistake_j(f) = 1 if f(a) < f(a’) + 1

mistake_j(f) = 0 if f(a) >= f(a’) + 1

Then “minimize total number of mistakes” is the problem of minimizing

M = sum_j mistake_j(f)

over V.  And M is terribly nonconvex.  If you try to gradient-descend (e.g. start with a random ranking and then switch two adjacent applicants whenever doing so reduces the total number of mistakes) you are likely to get caught in a local minimum that’s far from optimal.  (Or at least that can happen; whether this typically actually happens in practice, I haven’t checked!)

So here’s the move:  replace mistake_j(f) with a function that’s “close enough,” but is convex.  It acts as a sort of tractable proxy for the optimization you’re actually after.  The customary choice here is the hinge loss:

hinge_j(f) = min(0, f(a)-f(a’) -1).

Then H := sum_j hinge_j(f) is a convex function on f, which you can easily minimize in Matlab or python.  If you can actually find an f with H(f) = 0, you’ve found a ranking which agrees with every judgment in your bag.  Usually you can’t, but that’s OK!  You’ve very quickly found a function H which does a decent job aggregating the committee scores. and which you can use as your starting point.

Now here’s a paper by Nihal Shah and Martin Wainwright commenter Dustin Mixon linked in my last ranking post.  It suggests doing something much simpler:  using a linear function as a proxy for mistake_j.  What this amounts to is:  score each applicant by the number of times they were placed above another applicant.  Should I be doing this instead?  My first instinct is no.  It looks like Shah and Wainwright assume that each pair of applicants is equally likely to be compared; I think I don’t want to assume that, and I think (but correct me if I’m wrong!) the optimality they get may not be robust to that?

Anyway, all thoughts on this question — or suggestions as to something totally different I could be doing — welcome, of course.

## Messing around with word2vec

Word2vec is a way of representing words and phrases as vectors in medium-dimensional space developed by Tomas Mikolov and his team at Google; you can train it on any corpus you like (see Ben Schmidt’s blog for some great examples) but the version of the embedding you can download was trained on about 100 billion words of Google News, and encodes words as unit vectors in 300-dimensional space.

What really got people’s attention, when this came out, was word2vec’s ability to linearize analogies.  For example:  if x is the vector representing “king,” and y the vector representing “woman,” and z the vector representing “man,” then consider

x + y – z

which you might think of, in semantic space, as being the concept “king” to which “woman” has been added and “man” subtracted — in other words, “king made more female.”  What word lies closest in direction to x+y-z?  Just as you might hope, the answer is “queen.”

I found this really startling.  Does it mean that there’s some hidden linear structure in the space of words?

It turns out it’s not quite that simple.  I played around with word2vec a bunch, using Radim Řehůřek’s gensim package that nicely pulls everything into python; here’s what I learned about what the embedding is and isn’t telling you.

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts — that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’  (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.)  If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.)  It’s positive when the words are close to each other, negative when the words are far.  For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words.  ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

You might ask:  well, what words in Word2Vec are farthest from “tremendous?”  You just get trash:

>>> model.most_similar(negative=[‘tremendous’])

[(u’By_DENISE_DICK’, 0.2792186141014099), (u’NAVARRE_CORPORATION’, 0.26894450187683105), (u’By_SEAN_BARRON’, 0.26745346188545227), (u’LEGAL_NOTICES’, 0.25829464197158813), (u’Ky.Busch_##-###’, 0.2564955949783325), (u’desultorily’, 0.2563159763813019), (u’M.Kenseth_###-###’, 0.2562236189842224), (u’J.McMurray_###-###’, 0.25608277320861816), (u’D.Earnhardt_Jr._###-###’, 0.2547803819179535), (u’david.brett_@_thomsonreuters.com’, 0.2520599961280823)]

If 3 million words were distributed randomly in the unit ball in R^300, you’d expect the farthest one from “tremendous” to have dot product about -0.3 from it.  So when you see a list whose largest score is around that size, you should think “there’s no structure there, this is just noise.”

Antonyms

Challenge problem:  Is there a way to accurately generate antonyms using the word2vec embedding?  That seems to me the sort of thing the embedding is not capturing.  Kyle McDonald had a nice go at this, but I think the lesson of his experiment is that asking word2vec to find analogies of the form “word is to antonym as happy is to?” is just going to generate a list of neighbors of “happy.”  McDonald’s results also cast some light on the structure of word2vec analogies:  for instance, he finds that

waste is to economise as happy is to chuffed

First of all, “chuffed” is a synonym of happy, not an antonym.  But more importantly:  The reason “chuffed” is there is because it’s a way that British people say “happy,” just as “economise” is a way British people spell “economize.”  Change the spelling and you get

waste is to economize as happy is to glad

Non-semantic properties of words matter to word2vec.  They matter a lot.  Which brings us to diction.

Word2Vec distance keeps track of diction

Lots of non-semantic stuff is going on in natural language.  Like diction, which can be high or low, formal or informal, flowery or concrete.    Look at the nearest neighbors of “pugnacity”:

>>> model.most_similar(‘pugnacity’)

[(u’pugnaciousness’, 0.6015268564224243), (u’wonkishness’, 0.6014434099197388), (u’pugnacious’, 0.5877301692962646), (u’eloquence’, 0.5875781774520874), (u’sang_froid’, 0.5873805284500122), (u’truculence’, 0.5838015079498291), (u’pithiness’, 0.5773230195045471), (u’irascibility’, 0.5772287845611572), (u’hotheadedness’, 0.5741063356399536), (u’sangfroid’, 0.5715578198432922)]

Some of these are close semantically to pugnacity, but others, like “wonkishness,” “eloquence”, and “sangfroid,” are really just the kind of elevated-diction words the kind of person who says “pugnacity” would also say.

In the other direction:

>>> model.most_similar(‘psyched’)

[(u’geeked’, 0.7244787216186523), (u’excited’, 0.6678282022476196), (u’jazzed’, 0.666187584400177), (u’bummed’, 0.662735104560852), (u’amped’, 0.6473385691642761), (u’pysched’, 0.6245862245559692), (u’exicted’, 0.6116108894348145), (u’awesome’, 0.5838013887405396), (u’enthused’, 0.581687331199646), (u’kinda_bummed’, 0.5701783299446106)]

“geeked” is a pretty good synonym, but “bummed” is an antonym.  You may also note that contexts where “psyched” is common are also fertile ground for “pysched.”  This leads me to one of my favorite classes of examples:

Misspelling analogies

Which words are closest to “teh”?

>>> model.most_similar(‘teh’)

[(u’ther’, 0.6910992860794067), (u’hte’, 0.6501408815383911), (u’fo’, 0.6458913683891296), (u’tha’, 0.6098173260688782), (u’te’, 0.6042138934135437), (u’ot’, 0.595798909664154), (u’thats’, 0.595078706741333), (u’od’, 0.5908242464065552), (u’tho’, 0.58894944190979), (u’oa’, 0.5846965312957764)]

Makes sense:  the contexts where “teh” is common are those contexts where a lot of words are misspelled.

Using the “analogy” gadget, we can ask; which word is to “because” as “teh” is to “the”?

>>> model.most_similar(positive=[‘because’,’teh’],negative=[‘the’])

[(u’becuase’, 0.6815075278282166), (u’becasue’, 0.6744950413703918), (u’cuz’, 0.6165347099304199), (u’becuz’, 0.6027254462242126), (u’coz’, 0.580410361289978), (u’b_c’, 0.5737690925598145), (u’tho’, 0.5647958517074585), (u’beacause’, 0.5630674362182617), (u’thats’, 0.5605655908584595), (u’lol’, 0.5597798228263855)]

Or “like”?

>>> model.most_similar(positive=[‘like’,’teh’],negative=[‘the’])

[(u’liek’, 0.678846001625061), (u’ok’, 0.6136218309402466), (u’hahah’, 0.5887773633003235), (u’lke’, 0.5840467214584351), (u’probly’, 0.5819578170776367), (u’lol’, 0.5802655816078186), (u’becuz’, 0.5771245956420898), (u’wierd’, 0.5759704113006592), (u’dunno’, 0.5709049701690674), (u’tho’, 0.565370500087738)]

Note that this doesn’t always work:

>>> model.most_similar(positive=[‘should’,’teh’],negative=[‘the’])

[(u’wil’, 0.63351970911026), (u’cant’, 0.6080706715583801), (u’wont’, 0.5967696309089661), (u’dont’, 0.5911301970481873), (u’shold’, 0.5908039212226868), (u’shoud’, 0.5776053667068481), (u’shoudl’, 0.5491836071014404), (u”would’nt”, 0.5474458932876587), (u’shld’, 0.5443994402885437), (u’wouldnt’, 0.5413904190063477)]

What are word2vec analogies?

Now let’s come back to the more philosophical question.  Should the effectiveness of word2vec at solving analogy problems make us think that the space of words really has linear structure?

I don’t think so.  Again, I learned something important from the work of Levy and Goldberg.  When word2vec wants to find the word w which is to x as y is to z, it is trying to find w maximizing the dot product

w . (x + y – z)

But this is the same thing as maximizing

w.x + w.y – w.z.

In other words, what word2vec is really doing is saying

“Show me words which are similar to x and y but are dissimilar to z.”

This notion makes sense applied any notion of similarity, whether or not it has anything to do with embedding in a vector space.  For example, Levy and Goldberg experiment with minimizing

log(w.x) + log(w.y) – log(w.z)

instead, and get somewhat superior results on the analogy task.  Optimizing this objective has nothing to do with the linear combination x+y-z.

None of which is to deny that the analogy engine in word2vec works well in many interesting cases!  It has no trouble, for instance, figuring out that Baltimore is to Maryland as Milwaukee is to Wisconsin.  More often than not, the Milwaukee of state X correctly returns the largest city in state X.  And sometimes, when it doesn’t, it gives the right answer anyway:  for instance, the Milwaukee of Ohio is Cleveland, a much better answer than Ohio’s largest city (Columbus — you knew that, right?)  The Milwaukee of Virginia, according to word2vec, is Charlottesville, which seems clearly wrong.  But maybe that’s OK — maybe there really isn’t a Milwaukee of Virginia.  One feels Richmond is a better guess than Charlottesville, but it scores notably lower.  (Note:  Word2Vec’s database doesn’t have Virginia_Beach, the largest city in Virginia.  That one I didn’t know.)

Another interesting case:  what is to state X as Gainesville is to Florida?  The answer should be “the location of the, or at least a, flagship state university, which isn’t the capital or even a major city of the state,” when such a city exists.  But this doesn’t seem to be something word2vec is good at finding.  The Gainesville of Virginia is Charlottesville, as it should be.  But the Gainesville of Georgia is Newnan.  Newnan?  Well, it turns out there’s a Newnan, Georgia, and there’s also a Newnan’s Lake in Gainesville, FL; I think that’s what’s driving the response.  That, and the fact that “Athens”, the right answer, is contextually separated from Georgia by the existence of Athens, Greece.

The Gainesville of Tennessee is Cookeville, though Knoxville, the right answer, comes a close second.

Why?  You can check that Knoxville, according to word2vec, is much closer to Gainesville than Cookeville is.

>>> model.similarity(‘Cookeville’,’Gainesville’)

0.5457580604439547

>>> model.similarity(‘Knoxville’,’Gainesville’)

0.64010456774402158

But Knoxville is placed much closer to Florida!

>>> model.similarity(‘Cookeville’,’Florida’)

0.2044376252927515

>>> model.similarity(‘Knoxville’,’Florida’)

0.36523836770416895

Remember:  what word2vec is really optimizing for here is “words which are close to Gainesville and close to Tennessee, and which are not close to Florida.”  And here you see that phenomenon very clearly.  I don’t think the semantic relationship between ‘Gainesville’ and ‘Florida’ is something word2vec is really capturing.  Similarly:  the Gainesville of Illinois is Edwardsville (though Champaign, Champaign_Urbana, and Urbana are all top 5) and the Gainesville of Indiana is Connersville.  (The top 5 for Indiana are all cities ending in “ville” — is the phonetic similarity playing some role?)

Just for fun, here’s a scatterplot of the 1000 nearest neighbors of ‘Gainesville’, with their similarity to ‘Gainesville’ (x-axis) plotted against their similarity to ‘Tennessee’ (y-axis):

The Pareto frontier consists of “Tennessee” (that’s the one whose similarity to “Tennessee” is 1, obviously..) Knoxville, Jacksonville, and Tallahassee.

Bag of contexts

One popular simple linear model of word space is given by representing a word as a “bag of contexts” — perhaps there are several thousand contexts, and each word is given by a sparse vector in the space spanned by contexts:  coefficient 0 if the word is not in that context, 1 if it is.  In that setting, certain kinds of analogies would be linearized and certain kinds would not.  If “major city” is a context, then “Houston” and “Dallas” might have vectors that looked like “Texas” with the coodinate of “major city” flipped from 0 to 1.  Ditto, “Milwaukee” would be “Wisconsin” with the same basis vector added.  So

“Texas” + “Milwaukee” – “Wisconsin”

would be pretty close, in that space, to “Houston” and “Dallas.”

On the other hand, it’s not so easy to see what relations antonyms would have in that space. That’s the kind of relationship the bag of contexts may not make linear.

The word2vec space is only 300-dimensional, and the vectors aren’t sparse at all.  But maybe we should think of it as a random low-dimensional projection of a bag-of-contexts embedding!  By the Johnson-Lindenstrauss lemma, a 300-dimensional projection is plenty big enough to preserve the distances between 3 million points with a small distortion factor; and of course it preserves all linear relationships on the nose.

Perhaps this point of view gives some insight into which kind of word relationships manifest as linear relationships in word2vec.  “flock:birds” is an interesting example.  If you imagine “group of things” as a context, you can maybe imagine word2vec picking this up.  But actually, it doesn’t do well:

>> model.most_similar(positive=[‘fish’,’flock’],negative=[‘birds’])
[(u’crays’, 0.4601619839668274), (u’threadfin_salmon’, 0.4553075134754181), (u’spear_fishers’, 0.44864755868911743), (u’slab_crappies’, 0.4483765661716461), (u’flocked’, 0.44473177194595337), (u’Siltcoos_Lake’, 0.4429660737514496), (u’flounder’, 0.4414420425891876), (u’catfish’, 0.4413948059082031), (u’yellowtail_snappers’, 0.4410281181335449), (u’sockeyes’, 0.4395104944705963)]

>> model.most_similar(positive=[‘dogs’,’flock’],negative=[‘birds’])
[(u’dog’, 0.5390862226486206), (u’pooches’, 0.5000904202461243), (u’Eminem_Darth_Vader’, 0.48777419328689575), (u’Labrador_Retrievers’, 0.4792211949825287), (u’canines’, 0.4766522943973541), (u’barked_incessantly’, 0.4709487557411194), (u’Rottweilers_pit_bulls’, 0.4708423614501953), (u’labradoodles’, 0.47032350301742554), (u’rottweilers’, 0.46935927867889404), (u’forbidding_trespassers’, 0.4649636149406433)]

The answers “school” and “pack” don’t appear here.  Part of this, of course, is that “flock,” “school”, and “pack” all have interfering alternate meanings.  But part of it is that the analogy really rests on information about contexts in which the words “flock” and “birds” both appear.  In particular, in a short text window featuring both words, you are going to see a huge spike of “of” appearing right after flock and right before birds.  A statement of the form “flock is to birds as X is to Y” can’t be true unless “X of Y” actually shows up in the corpus a lot.

Challenge problem:  Can you make word2vec do a good job with relations like “flock:birds”?  As I said above, I wouldn’t have been shocked if this had actually worked out of the box, so maybe there’s some minor tweak that makes it work.

Boys’ names, girls’ names

Back to gender-flipping.  What’s the “male version” of the name “Jennifer”?

There are various ways one can do this.  If you use the analogy engine from word2vec, finding the closest word to “Jennifer” + “he” – “she”, you get as your top 5:

David, Jason, Brian, Kevin, Chris

>>> model.most_similar(positive=[‘Jennifer’,’he’],negative=[‘she’])
[(u’David’, 0.6693146228790283), (u’Jason’, 0.6635637283325195), (u’Brian’, 0.6586753129959106), (u’Kevin’, 0.6520106792449951), (u’Chris’, 0.6505492925643921), (u’Mark’, 0.6491551995277405), (u’Matt’, 0.6386727094650269), (u’Daniel’, 0.6294828057289124), (u’Greg’, 0.6267883777618408), (u’Jeff’, 0.6265031099319458)]

But there’s another way:  you can look at the words closest to “Jennifer” (which are essentially all first names) and pick out the ones which are closer to “he” than to “she”.  This gives

Matthew, Jeffrey, Jason, Jesse, Joshua.

>>> [x[0] for x in model.most_similar(‘Jennifer’,topn=2000) if model.similarity(x[0],’he’) > model.similarity(x[0],’she’)]
[u’Matthew’, u’Jeffrey’, u’Jason’, u’Jesse’, u’Joshua’, u’Evan’, u’Brian’, u’Cory’, u’Justin’, u’Shawn’, u’Darrin’, u’David’, u’Chris’, u’Kevin’, u’3/dh’, u’Christopher’, u’Corey’, u’Derek’, u’Alex’, u’Matt’, u’Jeremy’, u’Jeff’, u’Greg’, u’Timothy’, u’Eric’, u’Daniel’, u’Wyvonne’, u’Joel’, u’Chirstopher’, u’Mark’, u’Jonathon’]

Which is a better list of “male analogues of Jennifer?”  Matthew is certainly closer to Jennifer in word2vec distance:

>>> model.similarity(‘Jennifer’,’Matthew’)

0.61308109388608356

>>> model.similarity(‘Jennifer’,’David’)

0.56257556538528708

But, for whatever, reason, “David” is coded as much more strongly male than “Matthew” is; that is, it is closer to “he” – “she”.  (The same is true for “man” – “woman”.)  So “Matthew” doesn’t score high in the first list, which rates names by a combination of how male-context they are and how Jennifery they are.  A quick visit to NameVoyager shows that Matthew and Jennifer both peaked sharply in the 1970s; David, on the other hand, has a much longer range of popularity and was biggest in the 1950s.

Let’s do it again, for Susan.  The two methods give

David, Robert, Mark, Richard, John

Robert, Jeffrey, Richard, David, Kenneth

And for Edith:

Ernest, Edwin, Alfred, Arthur, Bert

Ernest, Harold, Alfred, Bert, Arthur

Pretty good agreement!  And you can see that, in each case, the selected names are “cultural matches” to the starting name.

Sidenote:  In a way it would be more natural to project wordspace down to the orthocomplement of “he” – “she” and find the nearest neighbor to “Susan” after that projection; that’s like, which word is closest to “Susan” if you ignore the contribution of the “he” – “she” direction.  This is the operation Ben Schmidt calls “vector rejection” in his excellent post about his word2vec model trained on student evaluations.

If you do that, you get “Deborah.”  In other words, those two names are similar in so many contextual ways that they remain nearest neighbors even after we “remove the contribution of gender.”  A better way to say it is that the orthogonal projection doesn’t really remove the contribution of gender in toto.  It would be interesting to understand what kind of linear projections actually make it hard to distinguish male surnames from female ones.

Google News is a big enough database that this works on non-English names, too.  The male “Sylvie”, depending on which protocol you pick, is

Alain, Philippe, Serge, Andre, Jean-Francois

or

Jean-Francois, Francois, Stephane, Alain, Andre

The male “Kyoko” is

Kenji, Tomohiko, Nobuhiro, Kazuo, Hiroshi

or

Satoshi, Takayuki, Yosuke, Michio, Noboru

French and Japanese speakers are encouraged to weigh in about which list is better!

Update:  Even a little more messing around with “changing the gender of words” in a followup post.

## Spec is representable

Saw Matt Baker at the Joint Meetings and he told me about this crazy paper he just posted, “Matroids over Hyperfields.”   A hyperring is just like a ring except addition is multivalued; given elements x and y of R, x+y is a subset of R which you can think of as “the possible outcomes of summing x and y.”  A hyperfield is a hyperring in which every nonzero element has a multiplicative inverse.

Here’s an example familiar to tropical geometers:  let T be the hyperfield whose elements are $\mathbf{R} \bigcup -\infty$, whose multiplication law is real addition, and whose addition law is

a + b = max(a,b) if a <> b

a + b = {c: c < a} if a=b

In other words, each element of T can be thought of as the valuation of an otherwise unspecified element of a field with a non-archimedean valuation, and then the addition law answers the question “what is ord(x+y) if ord(x) = a and ord(y) = b”?

This may sounds at first like an almost aggressively useless generalization, but no!  The main point of Matt’s paper is that it makes sense to talk about a matroid with coefficients in a hyperfield, and that lots of well-studied flavors of matroids can be written as “matroids over F” for a suitable hyperfield F; in this way, a lot of different stories about different matroid theories get unified and cleaned up.

In fact, a matroid itself turns out to be the same thing as a matroid over K, where K is the Krasner hyperfield:  just two elements 0 and 1, with the multiplication law you expect, and addition given by

0 + 0 = 0

0 + 1 = 1

1 + 1 = {0,1}

One thing I like about K is that it repairs the problem (if you see it as a problem) that the category of fields has no terminal object.  K is terminal in the category of hyperfields; any hyperfield (and in particular any field) has a unique map to K which sends 0 to 0 and everything else to 1.

More generally, as Matt observes, if R is a commutative ring, a homomorphism f from R to K is nothing other than a prime ideal of R — namely, f^{-1}(0).  So once you relax a little and accept the category of hyperfield, the functor Spec: Rings -> Sets is representable!  I enjoy that.

Update:  David Goss points out that this observation about Spec and the Krasner hyperfield is due to Connes and Consani in “The hyperring of adèle classes” JNT 131, (2011) 159-194, p.161.  In fact, for any scheme X of finite type over Z, the underlying Zariski set of X is naturally identified with Hom(Spec(K),X); so Spec(K) functions as a kind of generic point that’s agnostic to characteristic.

## Ranking mathematicians

I’m on the hiring committee, I chair the graduate admissions committee, and I’m doing an NSF panel, so basically I’ll be spending much of this month judging and ranking people’s mathematics.  There’s a lot I like about these jobs:  it’s a very efficient way to get a panorama of what’s going on in math and what people think about it.  The actual ranking part I don’t like that much — especially because the nature of hiring, admissions, and grant-making means you’re inevitably putting tons of very worthwhile stuff below the line.  I feel like a researcher when I read the proposals, like a bureaucrat when I put scores on them.

But of course the bureaucratic work needs to be done.  I’d go so far as to say — if mathematicians aren’t willing to rank each other, others will rank us, and that would be worse.

## i, you sneaky bastard

Childhood memory:  I learned that i is formally defined to be the square root of -1.  Well, I thought, that worked well, what about the square root of i?  Surely that must be yet a new kind of number.  I just had to check that (a+bi)^2 can never be i.  But whoa, you can solve that!  (sqrt(2)/2) + (sqrt(2)/2)i does the trick.  I was kind of bowled over by this.  i, you sneaky bastard — you anticipated my next move and got ahead of me!  I had no idea what “algebraically closed” meant, or anything like that.  But it was one of my first experience of the incredible power of the right definition.  Once the definition is right, you can just do everything.

Tagged

## Lipnowski-Tsimerman: How large is A_g(F_p)?

Mike Lipnowski and Jacob Tsimerman have an awesome new preprint up, which dares to ask:  how many principally polarized abelian varieties are there over a finite field?

Well, you say, those are just the rational points of A_g, which has dimension g choose 2, so there should be about p^{(1/2)g^2} points, right?  But if you think a bit more about why you think that, you realize you’re implicitly imagining the cohomology groups in the middle making a negligible contribution to the Grothendieck-Lefchetz trace formula.  But why do you imagine that?  Those Betti numbers in the middle are huge, or at least have a right to be. (The Euler characteristic of A_g is known, and grows superexponentially in dim A_g, so you know at least one Betti number is big, at any rate.)

Well, so I always thought the size of A_g(F_q) really would be around p^{(1/2) g^2}, but that maybe humans couldn’t prove this yet.  But no!  Lipnowski and Tsimerman show there are massively many principally polarized abelian varieties; at least exp(g^2 log g).  This suggests (but doesn’t prove) that there is not a ton of cancellation in the Frobenius eigenvalues.  Which puts a little pressure, I think, on the heuristics about M_g in Achter-Erman-Kedlaya-Wood-Zureick-Brown.

What’s even more interesting is why there are so many principally polarized abelian varieties.  It’s because there are so many principal polarizations!  The number of isomorphism classes of abelian varieties, full stop, they show, is on order exp(g^2).  It’s only once you take the polarizations into account that you get the faster-than-expected-by-me growth.

What’s more, some abelian varieties have more principal polarizations than others.  The reducible ones have a lot.  And that means they dominate the count, especially the ones with a lot of multiplicity in the isogeny factors.  Now get this:  for 99% of all primes, it is the case that, for sufficiently large g:  99% of all points on A_g(F_p) correspond to abelian varieties which are 99% made up of copies of a single elliptic curve!

That is messed up.

## Leibniz on music

Even the pleasures of sense are reducible to intellectual pleasures, known confusedly.  Music charms us, although its beauty consists only in the agreement of numbers and in the counting, which we do not perceive but which the soul nevertheless continues to carry out, of the beats or vibrations of sounding bodies which coincide at certain intervals.

Boy, do I disagree.  Different pleasures are different.

## Bobrowski-Kahle-Skraba on the null hypothesis in persistent homology

I really like persistent homology; it’s a very beautiful idea, a way to look for structure in data when you really don’t have any principled way to embed it in Euclidean space (or, even when it does come embedded in Euclidean space, to find the kind of structure that doesn’t depend too much on the embedding.)

But because I like it, I want to see it done well, so I have some minor complaints!

Complaint one:  Persistent homology, applied to H_0 only, is clustering, and we know a lot about clustering already.  (Update:  As commenters point out, this is really only so for persistent homology computed on the Vietoris-Rips complex of a point cloud, the “classical case…”!)  Not to say that the ideas of persistence can’t be useful here at all (I have some ideas about directed graphs I want to eventually work out) but my sense is that people are not craving new clustering algorithms.  I really like the work that tries to grapple with the topology of the data in its fullness; I was really charmed, for instance, by Ezra Miller’s piece about the persistent homology of fruit fly wings.  (There’s a lot of nice stuff about geometric probability theory, too — e.g., how do you take the “average” of a bunch of graded modules for k[x,y], which you may think of as noisy measurements of some true module you want to estimate?)

My second complaint is the lack of understanding of the null hypothesis.  You have some point cloud, you make a barcode, you see some bars that look long, you say they’re features — but why are you so sure?  How long would bars be under the null hypothesis that the data has no topological structure at all?  You kind of have to know this in order to do good inference.  Laura Balzano and I did a little numerical investigation of this years ago but now Omer Bobrowski, Matthew Kahle, and Primoz Skraba have proved a theorem!  (Kahle’s cool work in probabilistic topology has appeared several times before on Quomodocumque…)

They show that if you sample points from a uniform Poisson process on the unit cube of intensity n (i.e. you expect n points) the longest bar in the H_k barcode has

(death radius / birth radius) ~ [(log n)/(log log n)]^(1/k).

That is really short!  And it makes me feel like there actually is something going on, when you see a long barcode in practice.

## The Coin Game, II

Good answers to the last question! I think I perhaps put my thumb on the scale too much by naming a variable p.

Let me try another version in the form of a dialogue.

ME: Hey in that other room somebody flipped a fair coin. What would you say is the probability that it fell heads?

YOU: I would say it is 1/2.

ME: Now I’m going to give you some more information about the coin. A confederate of mine made a prediction about whether the coin would fall head or tails and he was correct. Now what would you say is the probability that it fell heads?

YOU: Now I have no idea, because I have no information about the propensity of your confederate to predict heads.

(Update: What if what you knew about the coin in advance was that it fell heads 99.99% of the time? Would you still be at ease saying you end up with no knowledge at all about the probability that the coin fell heads?) This is in fact what Joyce thinks you should say. White disagrees. But I think they both agree that it feels weird to say this, whether or not it’s correct.

Why would it not feel weird? I think Qiaochu’s comment in the previous thread gives a clue. He writes:

Re: the update, no, I don’t think that’s strange. You gave me some weird information and I conditioned on it. Conditioning on things changes my subjective probabilities, and conditioning on weird things changes my subjective probabilities in weird ways.

In other words, he takes it for granted that what you are supposed to do is condition on new information. Which is obviously what you should do in any context where you’re dealing with mathematical probability satisfying the usual axioms. Are we in such a context here? I certainly don’t mean “you have no information about Coin 2” to mean “Coin 2 falls heads with probability p where p is drawn from the uniform distribution (or Jeffreys, or any other specified distribution, thanks Ben W.) on [0,1]” — if I meant that, there could be no controversy!

I think as mathematicians we are very used to thinking that probability as we know it is what we mean when we talk about uncertainty. Or, to the extent we think we’re talking about something other than probability, we are wrong to think so. Lots of philosophers take this view. I’m not sure it’s wrong. But I’m also not sure it’s right. And whether it’s wrong or right, I think it’s kind of weird.

Tagged ,

## The coin game

Here is a puzzling example due to Roger White.

There are two coins.  Coin 1 you know is fair.  Coin 2 you know nothing about; it falls heads with some probability p, but you have no information about what p is.

Both coins are flipped by an experimenter in another room, who tells you that the two coins agreed (i.e. both were heads or both tails.)

What do you now know about Pr(Coin 1 landed heads) and Pr(Coin 2 landed heads)?

(Note:  as is usual in analytic philosophy, whether or not this is puzzling is itself somewhat controversial, but I think it’s puzzling!)

Update: Lots of people seem to not find this at all puzzling, so let me add this. If your answer is “I know nothing about the probability that coin 1 landed heads, it’s some unknown quantity p that agrees with the unknown parameter governing coin 2,” you should ask yourself: is it strange that someone flipped a fair coin in another room and you don’t know what the probability is that it landed heads?”

Relevant readings: section 3.1 of the Stanford Encyclopedia of Philosophy article on imprecise probabilities and Joyce’s paper on imprecise credences, pp.13-14.

Tagged , ,