## Mathematicians becoming data scientists: Should you? How to?

I was talking the other day with a former student at UW, Sarah Rich, who’s done degrees in both math and CS and then went off to Twitter.  I asked her:  so what would you say to a math Ph.D. student who was wondering whether they would like being a data scientist in the tech industry?  How would you know whether you might find that kind of work enjoyable?  And if you did decide to pursue it, what’s the strategy for making yourself a good job candidate?

Sarah exceeded my expectations by miles and wrote the following extremely informative and thorough tip sheet, which she’s given me permission to share.  Take it away, Sarah!

Tagged , , ,

## Braid monodromy and the dual curve

Nick Salter gave a great seminar here about this paper; hmm, maybe I should blog about that paper, which is really interesting, but I wanted to make a smaller point here.  Let C be a smooth curve in P^2 of degree n. The lines in P^2 are parametrized by the dual P^2; let U be the open subscheme of the dual P^2 parametrizing those lines which are not tangent to C; in other words, U is the complement of the dual curve C*.  For each point u of U, write L_u for the corresponding line in P^2.

This gives you a fibration X -> U where the fiber over a point u in U is L_u – (L_u intersect C).  Since L_u isn’t tangent to C, this fiber is a line with n distinct points removed.  So the fibration gives you an (outer) action of pi_1(U) on the fundamental group of the fiber preserving the puncture classes; in other words, we have a homomorphism

$\pi_1(U) \rightarrow B_n$

where B_n is the n-strand braid group.

When you restrict to a line L* in U (i.e. a pencil of lines through a point in the original P^2) you get a map from a free group to B_n; this is the braid monodromy of the curve C, as defined by Moishezon.  But somehow it feels more canonical to consider the whole representation of pi_1(U).  Here’s one place I see it:  Proposition 2.4 of this survey by Libgober shows that if C is a rational nodal curve, then pi_1(U) maps isomorphically to B_n.  (OK, C isn’t smooth, so I’d have to be slightly more careful about what I mean by U.)

## Difference sets missing a Hamming sphere

I tend to think of the Croot-Lev-Pach method (as used, for instance, in the cap set problem) as having to do with n-tensors, where n is bigger than 2.  But you can actually also use it in the case of 2-tensors, i.e. matrices, to say things that (as far as I can see) are not totally trivial.

Write m_d for the number of squarefree monomials in x_1, .. x_n of degree at most d; that is,

$m_d = 1 + {n \choose 1} + {n \choose 2} + \ldots + {n \choose d}.$

Claim:  Let P be a polynomial of degree d in F_2[x_1, .. x_n] such that P(0) = 1.  Write S for the set of nonzero vectors x such that P(x) = 1.  Let A be a subset of F_2^n such that no two elements of A have difference lying in S.  Then |A| < 2m_{d/2}.

Proof:  Write M for the A x A matrix whose (a,b) entry is P(a-b).  By the Croot-Lev-Pach lemma, this matrix has rank at most 2m_{d/2}.  By hypothesis on A, M is the identity matrix, so its rank is |A|.

Remark: I could have said “sum” instead of “difference” since we’re in F_2 but for larger finite fields you really want difference.

The most standard context in which you look for large subsets of F_2^n with restricted difference sets is that of error correcting codes, where you ask that no two distinct elements of A have difference with Hamming weight (that is, number of 1 entries) at most k.

It would be cool if the Croot-Lev-Pach lemma gave great new bounds on error-correcting codes, but I don’t think it’s to be.  You would need to find a polynomial P which vanishes on all nonzero vectors of weight larger than k, but which doesn’t vanish at 0. Moreover, you already know that the balls of size k/2 around the points of A are disjoint, which gives you the “volume bound”

|A| < 2^n / m_{k/2}.

I think that’ll be hard to beat.

If you just take a random polynomial P, the support of P will take up about half of F_2^n; so it’s not very surprising that a set whose difference misses that support has to be small!

Here’s something fun you can do, though.  Let s_i be the i-th symmetric function on x_1, … x_n.  Then

$s_i(x) = {wt(x) \choose i}$

where wt(x) denotes Hamming weight.  Recall also that the binomial coefficient

${k \choose 2^a}$

is odd precisely when the a’th binary digit of k is 1.

Thus,

$(1-s_1(x))(1-s_2(x))(1-s_4(x))\ldots(1-s_{2^{b-1}}(x))$

is a polynomial of degree 2^b-1 which vanishes on x unless the last b digits of wt(x) are 0; that is, it vanishes unless wt(x) is a multiple of 2^b.  Thus we get:

Fact:  Let A be a subset of F_2^n such that the difference of two nonzero elements in A never has weight a multiple of 2^b.  Then

$|A| \leq 2m_{2^{b-1} - 1}$.

Note that this is pretty close to sharp!  Because if we take A to be the set of vectors of  weight at most 2^{b-1} – 1, then A clearly has the desired property, and already that’s half as big as the upper bound above.  (What’s more, you can throw in all the vectors of weight 2^{b-1} whose first coordinate is 1; no two of these sum to something of weight 2^b.  The Erdös-Ko-Rado theorem says you can do no better with those weight 2^{b-1} vectors.)

Is there an easier way to prove this?

When b=1, this just says that a set with no differences of even Hamming weight has size at most 2; that’s clear, because two vectors whose Hamming weight has the same parity differ by a vector of even weight.  Even for b=2 this isn’t totally obvious to me.  The result says that a subset of F_2^n with no differences of weight divisible by 4 has size at most 2+2n.  On the other hand, you can get 1+2n by taking 0, all weight-1 vectors, and all weight-2 vectors with first coordinate 1.  So what’s the real answer, is it 1+2n or 2+2n?

Write H(n,k) for the size of the largest subset of F_2^n having no two vectors differing by a vector of Hamming weight exactly k.  Then if 2^b is the largest power of 2 less than n, we have shown above that

$m_{2^{b-1} - 1 } \leq H(n,2^b) \leq 2m_{2^{b-1} - 1}$.

On the other hand, if k is odd, then H(n,k) = 2^{n-1}; we can just take A to be the set of all even-weight vectors!  So perhaps H(n,k) actually depends on k in some modestly interesting 2-adic way.

The sharpness argument above can be used to show that H(4m,2m) is as least

$2(1 + 4m + {4m \choose 2} + \ldots + {4m \choose m-1} + {4m-1 \choose m-1}). (*)$

I was talking to Nigel Boston about this — he did some computations which make it looks like H(4m,2m) is exactly equal to (*) for m=1,2,3.  Could that be true for general m?

(You could also ask about sets with no difference of weight a multiple of k; not sure which is the more interesting question…)

Update:  Gil Kalai points out to me that much of this is very close to and indeed in some parts a special case of the Frankl-Wilson theorem…  I will investigate further and report back!

## Prime subset sums

Efrat Bank‘s interesting number theory seminar here before break was about sums of arithmetic functions on short intervals in function fields.  As I was saying when I blogged about Hast and Matei’s paper, a short interval in F_q[t] means:  the set of monic degree-n polynomials P such that

deg(P-P_0) < h

for some monic degree-n P_0 and some small h.  Bank sets this up even more generally, defining an interval in the space V of global sections of a line bundle on an arbitrary curve over F_q.  In Bank’s case, by contrast with the number field case, an interval is an affine linear subspace of some ambient vector space of forms.  This leads one to wonder:  what’s special about these specific affine spaces?  What about general spaces?

And then one wonders:  well, what classical question over Z does this correspond to?  So here it is:  except I’m not sure this is a classical question, though it sort of seems like it must be.

Question:  Let c > 1 be a constant.  Let A be a set of integers with |A| = n and max(A) < c^n.  Let S be the (multi)set of sums of subsets of A, so |S| = 2^n.  What can we say about the number of primes in S?  (Update:  as Terry points out in comments, I need some kind of coprimality assumption; at the very least we should ask that there’s no prime factor common to everything in A.)

I’d like to say that S is kind of a “generalized interval” — if A is the first n powers of 2, it is literally an interval.  One can also ask about other arithmetic functions:  how big can the average of Mobius be over S, for instance?  Note that the condition on max(S) is important:   if you let S get as big as you want, you can make S have no primes or you can make S be half prime (thanks to Ben Green for pointing this out to me.)  The condition on max(S) can be thought of as analogous to requiring that an interval containing N has size at least some fixed power of N, a good idea if you want to average arithmetic functions.

## Women in math: accountability

I’ve talked about women in math a lot on this blog and maybe you think of me as someone who is aware of and resistant to sexism in our profession.  But what if we look at some actual numbers?

My Ph.D. students:  2 out of 15 are women.

Coauthors, last 5 years: 2 out of 23 are women.

Letters posted on MathJobs, last 2 years:  3 out of 24 are women.

That is sobering.  I’m hesitant about posting this, but I think it’s a good idea for senior people to look at their own numbers and get some sense of how much they’re actually doing to support early-career women in the profession.

Update:  I removed the numbers for tenure/promotion letters.  A correspondent pointed out that these, unlike the other items, are supposed to be confidential, and given the small numbers are at least partially de-anonymizable.

Tagged

## Hast and Matei, “Moments of arithmetic functions in short intervals”

Two of my students, Daniel Hast and Vlad Matei, have an awesome new paper, and here I am to tell you about it!

A couple of years ago at AIM I saw Jon Keating talk about this charming paper by him and Ze’ev Rudnick.  Here’s the idea.  Let f be an arithmetic function: in that particular paper, it’s the von Mangoldt function, but you can ask the same question (and they do) for Möbius and many others.

Now we know the von Mangoldt function is 1 on average.  To be more precise: in a suitably long interval ($[X,X+X^{1/2 + \epsilon}]$ is long enough under Riemann) the average of von Mangoldt is always close to 1.  But the average over a short interval can vary.  You can think of the sum of von Mangoldt over  $[x,x+H]$, with H = x^d,  as a function f(x) which has mean 1 but which for d < 1/2 need not be concentrated at 1.  Can we understand how much it varies?  For a start, can we compute its variance as x ranges from 1 to X?This is the subject of a conjecture of Goldston and Montgomery.  Keating and Rudnick don’t prove that conjecture in its original form; rather, they study the problem transposed into the context of the polynomial ring F_q[t].  Here, the analogue of archimedean absolute value is the absolute value

$|f| = q^{\deg f}$

so an interval of size q^h is the set of f such that deg(f-f_0) < q^h for some polynomial f_0.

So you can take the monic polynomials of degree n, split that up into q^{n-h} intervals of size q^h, and sum f over each interval, and take the variance of all these sums.  Call this V_f(n,h).  What Keating and Rudnick show is that

$\lim_{q \rightarrow \infty} q^{-(h+1)} V(n,h) = n - h - 2$.

This is not quite the analogue of the Goldston-Montgomery conjecture; that would be the limit as n,h grow with q fixed.  That, for now, seems out of reach.  Keating and Rudnick’s argument goes through the Katz equidistribution theorems (plus some rather hairy integration over groups) and the nature of those equidistribution theorems — like the Weil bounds from which they ultimately derive — is to give you control as q gets large with everything else fixed (or at least growing very slo-o-o-o-o-wly.)  Generally speaking, a large-q result like this reflects knowledge of the top cohomology group, while getting a fixed-q result requires some control of all the cohomology groups, or at least all the cohomology groups in a large range.

Now for Hast and Matei’s paper.  Their observation is that the variance of the von Mangoldt function can actually be studied algebro-geometrically without swinging the Katz hammer.  Namely:  there’s a variety X_{2,n,h} which parametrizes pairs (f_1,f_2) of monic degree-n polynomials whose difference has degree less than h, together with an ordering of the roots of each polynomial.  X_{2,n,h} carries an action of S_n x S_n by permuting the roots.  Write Y_{2,n,h} for the quotient by this action; that’s just the space of pairs of polynomials in the same h-interval.  Now the variance Keating and Rudnick ask about is more or less

$\sum_{(f_1, f_2) \in Y_{2,n,h}(\mathbf{F}_q)} \Lambda(f_1) \Lambda(f_2)$

where $\Lambda$ is the von Mangoldt function.  But note that $\Lambda(f_i)$ is completely determined by the factorization of $f_i$; this being the case, we can use Grothendieck-Lefschetz to express the sum above in terms of the Frobenius traces on the groups

$H^i(X_{2,n,h},\mathbf{Q}_\ell) \otimes_{\mathbf{Q}_\ell[S_n \times S_n]} V_\Lambda$

where $V_\Lambda$ is a representation of $S_n \times S_n$ keeping track of the function $\Lambda$.  (This move is pretty standard and is the kind of thing that happens all over the place in my paper with Church and Farb about point-counting and representation stability, in section 2.2 particularly)

When the smoke clears, the behavior of the variance V(n,h) as q gets large is controlled by the top “interesting” cohomology group of X_{2,n,h}.  Now X_{2,n,h} is a complete intersection, so you might think its interesting cohomology is all in the middle.  But no — it’s singular, so you have to be more careful.  Hast and Matei carry out a careful analysis of the singular locus of X_{2,n,h}, and use this to show that the cohomology groups that vanish in a large range.  Outside that range, Weil bounds give an upper bound on the trace of Frobenius.  In the end they get

$V(n,h) = O(q^{h+1})$.

In other words, they get the order of growth from Keating-Rudnick but not the constant term, and they get it without invoking all the machinery of Katz.  What’s more, their argument has nothing to do with von Mangoldt; it applies to essentially any function of f that only depends on the degrees and multiplicities of the irreducible factors.

What would be really great is to understand that top cohomology group H as an S_n x S_n – representation.  That’s what you’d need in order to get that n-h-2 from Keating-Rudnick; you could just compute it as the inner product of H with $V_\Lambda$.  You want the variance of a different arithmetic function, you pair H with a different representation.  H has all the answers.  But neither they nor I could see how to compute H.

Then came Brad Rodgers.  Two months ago, he posted a preprint which gets the constant term for the variance of any arithmetic function in short intervals.  His argument, like Keating-Rudnick, goes through Katz equidistribution.  This is the same information we would have gotten from knowing H.  And it turns out that Hast and Matei can actually provably recover H from Rodgers’ result; the point is that the power of q Rodgers get can only arise from H, because all the other cohomology groups of high enough weight are the ones Hast and Matei already showed are zero.

So in the end they find

$H = \oplus_\lambda V_\lambda \boxtimes V_\lambda$

where $\lambda$ ranges over all partitions of n whose top row has length at most n-h-2.

I don’t think I’ve ever seen this kind of representation come up before — is it familiar to anyone?

Anyway:  what I like so much about this new development is that it runs contrary to the main current in this subject, in which you prove theorems in topology or algebraic geometry and use them to solve counting problems in arithmetic statistics over function fields.  Here, the arrow goes the other way; from Rodgers’s counting theorem, they get a computation of a cohomology group which I can’t see any way to get at by algebraic geometry.  That’s cool!  The other example I know of the arrow going this direction is this beautiful paper of Browning and Vishe, in which they use the circle method over function fields to prove the irreducibility of spaces of rational curves on low-degree hypersurfaces.  I should blog about that paper too!  But this is already getting long….

## Call for nominations for the Chern Medal

This is a guest post by Caroline Series.

The Chern Medal is a relatively new prize, awarded once every four years jointly by the IMU and the Chern Medal Foundation (CMF) to an individual whose accomplishments warrant the highest level of recognition for outstanding achievements in the field of mathematics. Funded by the CMF, the Medalist receives a cash prize of US$250,000. In addition, each Medalist may nominate one or more organizations to receive funding totalling US$ 250,000, for the support of research, education, or other outreach programs in the field of mathematics.

Professor Chern devoted his life to mathematics, both in active research and education, and in nurturing the field whenever the opportunity arose. He obtained fundamental results in all the major aspects of modern geometry and founded the area of global differential geometry. Chern exhibited keen aesthetic tastes in his selection of problems, and the breadth of his work deepened the connections of geometry with different areas of mathematics. He was also generous during his lifetime in his personal support of the field.

Nominations should be sent to the Prize Committee Chair: Caroline Series, email: chair(at)chern18.mathunion.org by 31st December 2016. Further details and nomination guidelines for this and the other IMU prizes can be found here.  Note that previous winners of other IMU prizes, such as the Fields Medal, are not eligible for consideration.

Tagged , ,

## Math!

I really like talking with AB about arithmetic and her strategies for doing problems.  All this Common Core stuff about breaking up into hundreds and tens and ones that people like to make fun of?  That’s how she does things.  She can describe her process a lot more articulately than most grownups can, because it’s less automatic for her.  I learn a lot about how to teach math by watching her learn math.

## Kevin Jamieson, hyperparameter optimization, playoffs

Kevin Jamieson gave a great seminar here on Hyperband, his algorithm for hyperparameter optimization.

Here’s the idea.  Doing machine learning involves making a lot of choices.  You set up your deep learning neural thingamajig but that’s not exactly one size fits all:  How many layers do you want in your net?  How fast do you want your gradient descents to step?  And etc. and etc.  The parameters are the structures your thingamajig learns.  The hyperparameters are the decisions you make about your thingamajig before you start learning.  And it turns out these decisions can actually affect performance a lot.  So how do you know how to make them?

Well, one option is to pick N choices of hyperparameters at random, run your algorithm on your test set with each choice, and see how you do.  The problem is, thingamajigs take a long time to converge.  This is expensive to do, and when N is small, you’re not really seeing very much of hyperparameter space (which might have dozens of dimensions.)

A more popular choice is to place some prior on the function

F:[hyperparameter space] -> [performance on test set]

You make a choice of hyperparameters, you run the thingamajig, based on the output you update your distribution on F, based on your new distribution you choose a likely-to-be-informative hyperparameter and run again, etc.

This is called “Bayesian optimization of hyperparameters” — it works pretty well — but really only about as well as taking twice as many random choices of hyperparameters, in practice.  A 2x speedup is nothing to sneeze at, but it still means you can’t get N large enough to search much of the space.

Kevin thinks you should think of this as a multi-armed bandit problem.  You have a hyperparameter whose performance you’d like to judge.  You could run your thingamajig with those parameters until it seems to be converging, and see how well it does.  But that’s expensive.  Alternatively, you could run your thingamajig (1/c) times as long; then you have time to consider Nc values of the hyperparameters, much better.  But of course you have a much less accurate assessment of the performance:  maybe the best performer in that first (1/c) time segment is actually pretty bad, and just got off to a good start!

So you do this instead.  Run the thingamajig for time (1/c) on Nc values.  That costs you N.  Then throw out all values of the hyperparameters that came in below median on performance.  You still have (1/2)Nc values left, so continue running those processes for another time (1/c).  That costs you (1/2)N.  Throw out everything below the median.  And so on.  When you get to the end you’ve spent N log Nc, not bad at all but instead of looking at only N hyperparameters, you’ve looked at Nc, where c might be pretty big.  And you haven’t wasted lots of processor time following unpromising choices all the way to the end; rather, you’ve mercilessly culled the low performers along the way.

But how do you choose c?  I insisted to Kevin that he call c a hyperhyperparameter but he wasn’t into it.  No fun!  Maybe the reason Kevin resisted my choice is that he doesn’t actually choose c; he just carries out his procedure once for each c as c ranges over 1,2,4,8,…. N; this costs you only another log N.

In practice, this seems to find hyperparameters just as well as more fancy Bayesian methods, and much faster.  Very cool!  You can imagine doing the same things in simpler situations (e.g. I want to do a gradient descent, where should I start?) and Kevin says this works too.

In some sense this is how a single-elimination tournament works!  In the NCAA men’s basketball finals, 64 teams each play a game; the teams above the median are 1-0, while the teams below the median, at 0-1, get cut.  Then the 1-0 teams each play one more game:  the teams above the median at 2-0 stay, the teams below the median at 1-1 get cut.

What if the regular season worked like this?  Like if in June, the bottom half of major league baseball just stopped playing, and the remaining 15 teams duked it out until August, then down to 8… It would be impossible to schedule, of course.  But in a way we have some form of it:  at the July 31 trade deadline, teams sufficiently far out of the running can just give up on the season and trade their best players for contending teams’ prospects.  Of course the bad teams keep playing games, but in some sense, the competition has narrowed to a smaller field.

## Luke Pebody on sharp bounds for tri-colored sum-free sets

Quick update on this post, where I listed three variations on the problem of finding large subsets of an abelian group A with no three terms in arithmetic progression.  The quantities I called G_2(F_q^n) and G_3(F_q^n) in that post are both bounded above by M(F_q^n), the number of monomials of total degree at most (q-1)n/3 and degree at most q-1 in each variable.  There’s already been a lot of motion in the few weeks since I wrote that post!  A result of Kleinberg, Sawin, and Speyer shows that G_2(F_q^n) is bounded between c_q^n and c_q^{n-epsilon} for some constant c_q, which they explicitly describe but which is kind of hard to compute.  But it’s kind of a win-win.  Either c_q^n is less than M(F_q^n), in which case, great, improvement over the results of CLP/EG, or not, in which case, great, the bounds on tri-colored sum-free sets in CLP/EG are tight up to subexponential factors!  And now Luke Pebody has posted a preprint showing that the latter is the case.

To sum up:  the quantities G_2(F_q^n) and G_3(F_q^n) which I alluded to in the original post are now bounded above by M(F_q^n) and below by M(F_q^n)^{1-epsilon}.  Wonderful!

This only heightens the interest in the original problem of estimating G_1(F_q^n), the size of the largest subset of F_q^n containing no three-term arithmetic progession.  Is the bound M(F_q^n) essentially sharp?  Or is G_1(F_q^n) much smaller?