As I mentioned, I’m reading Ph.D. admission files. Each file is read by two committee members and thus each file has two numerical scores.
How to put all this information together into a preliminary ranking?
The traditional way is to assign to each applicant their mean score. But there’s a problem: different raters have different scales. My 7 might be your 5.
You could just normalize the scores by subtracting that rater’s overall mean. But that’s problematic too. What if one rater actually happens to have looked at stronger files? Or even if not: what if the relation between rater A’s scale and rater B’s scale isn’t linear? Maybe, for instance, rater A gives everyone she doesn’t think should get in a 0, while rater A uses a range of low scores to express the same opinion, depending on just how unsuitable the candidate seems.
Here’s what I did last year. If (r,a,a’) is a triple with r is a rater and a and a’ are two applicants, such that r rated a higher than a’, you can think of that as a judgment that a is more admittable than a’. And you can put all those judgments from all the raters in a big bag, and then see if you can find a ranking of the applicants (or, if you like, a real-valued function f on the applicants) such that, for every judgment a > a’, we have f(a) > f(a’).
Of course, this might not be possible — two raters might disagree! Or there might be more complicated incompatibilities generated by multiple raters. Still, you can ask: what if I tried to minimize the number of “mistakes”, i.e. the number of judgments in your bag that your choice of ranking contradicts?
Well, you can ask that, but you may not get an answer, because that’s a highly non-convex minimization problem, and is as far as we know completely intractable.
But here’s a way out, or at least a way part of the way out — we can use a convex relaxation. Set it up this way. Let V be the space of real-valued functions on applicants. For each judgment j, let mistake_j(f) be the step function
mistake_j(f) = 1 if f(a) < f(a’) + 1
mistake_j(f) = 0 if f(a) >= f(a’) + 1
Then “minimize total number of mistakes” is the problem of minimizing
M = sum_j mistake_j(f)
over V. And M is terribly nonconvex. If you try to gradient-descend (e.g. start with a random ranking and then switch two adjacent applicants whenever doing so reduces the total number of mistakes) you are likely to get caught in a local minimum that’s far from optimal. (Or at least that can happen; whether this typically actually happens in practice, I haven’t checked!)
So here’s the move: replace mistake_j(f) with a function that’s “close enough,” but is convex. It acts as a sort of tractable proxy for the optimization you’re actually after. The customary choice here is the hinge loss:
hinge_j(f) = min(0, f(a)-f(a’) -1).
Then H := sum_j hinge_j(f) is a convex function on f, which you can easily minimize in Matlab or python. If you can actually find an f with H(f) = 0, you’ve found a ranking which agrees with every judgment in your bag. Usually you can’t, but that’s OK! You’ve very quickly found a function H which does a decent job aggregating the committee scores. and which you can use as your starting point.
Now here’s a paper by Nihal Shah and Martin Wainwright commenter Dustin Mixon linked in my last ranking post. It suggests doing something much simpler: using a linear function as a proxy for mistake_j. What this amounts to is: score each applicant by the number of times they were placed above another applicant. Should I be doing this instead? My first instinct is no. It looks like Shah and Wainwright assume that each pair of applicants is equally likely to be compared; I think I don’t want to assume that, and I think (but correct me if I’m wrong!) the optimality they get may not be robust to that?
Anyway, all thoughts on this question — or suggestions as to something totally different I could be doing — welcome, of course.