Category Archives: language

Not to exceed 25%

Supreme Court will hear a math case!

At issue in Murphy v. Smith:  the amount of a judgment that a court can apply to covering attorney’s fees.  Here’s the relevant statute:

Whenever a monetary judgment is awarded in an action described in paragraph (1), a portion of the judgment (not to exceed 25 percent) shall be applied to satisfy the amount of attorney’s fees awarded against the defendant.

To be clear: there are two amounts of money here.  The first is the amount of attorney’s fees awarded against the defendant; the second is the portion of the judgment which the court applies towards that first amount.  This case concerns the discretion of the court to decide on the second number.

In Murphy’s case, the court decided to apply just 10% of the judgment to attorney’s fees.  Other circuit courts have licensed this practice, interpreting the law to allow the court discretion to apply any portion between 0 and 25% of the judgement to attorney’s fees.  The 7th circuit disagreed, saying that, given that the amount of attorney’s fees awarded exceeded 25% of the judgment, the court was obligated to apply the full 25% maximum.

The cert petition to the Supreme Court hammers this view, which it calls “non-literal”:

The Seventh Circuit is simply wrong in interpreting this language to mean “exactly 25 percent.” “Statutory interpretation, as we always say, begins with the text.” Ross v. Blake, 136 S. Ct. 1850, 1856 (2016). Here, the text is so clear that interpretation should end with the text as well. “Not to exceed” does not mean “exactly.”

This seems pretty clearly correct:  “not to exceed 25%” means what it means, not “exactly 25%.”  So the 7th circuit just blew it, right?

Nope!  The 7th circuit is right, the other circuits and the cert are wrong, and the Supreme Court should affirm.  At least that’s what I say.  Here’s why.

I can imagine at least three interpretations of the statuye.

  1.  The court has to apply exactly 25% of the judgment to attorney’s fees.
  2.  The court has to apply the smaller of the following numbers:  the total amount awarded in attorney’s fees, or 25% of the judgment.
  3.  The court has full discretion to apply any nonnegative amount of the judgment to attorney’s fees.

Cert holds that 3 is correct, that the 7th circuit applied 1, and that 1 is absurdly wrong.  In fact, the 7th circuit applied 2, which is correct, and 1 and 3 are both wrong.

1 is wrong:  1 is wrong for two reasons.  One is pointed out by the cert petition:  “Not to exceed 25%” doesn’t mean “Exactly 25%.”  Another reason is that “Exactly 25%” might be more than the amount awarded in attorney’s fees, in which case it would be ridiculous to apply more money than was actually owed.

7th circuit applied 2, not 1:  The opinion reads:

In Johnson v. Daley, 339 F.3d 582, 585 (7th Cir. 2003) (en banc), we explained that § 1997e(d)(2) required that “attorneys’ compensation come[] first from the damages.” “[O]nly  if 25% of the award is inadequate to compensate counsel fully” does the defendant contribute more to the fees. Id. We continue to believe that is the most natural reading of the statutory text. We do not think the statute contemplated a discretionary decision by the district court. The statute neither uses discretionary language nor provides any guidance for such discretion.

The attorney’s compensation comes first out of the damages, but if that compensation is less than 25% of the damages, then less than 25% of the damages will be applied.  This is interpretation 2.  In the case at hand, 25% of the damages was $76,933.46 , while the attorney’s fees awarded were $108,446.54.   So, in this case, the results of applying 1 and 2 are the same; but the court’s interpretation is clearly 2, not the absurd 1.

3 is wrong:  Interpretation 3 is on first glance appealing.  Why shouldn’t “a portion of the judgment (not to exceed 25%)” mean any portion satisfying that inequality?  The reason comes later in the statute; that portion is required to “satisfy the amount of attorney’s fees awarded against the defendant.”  To “satisfy” a claim is to pay it in full, not in part.  Circuits that have adopted interpretation 3, as the 8th did in Boesing v. Spiess, are adopting a reading at least as non-literal as the one cert accuses the 7th of.

Of course, in cases like Murphy v. Smith, the two clauses are in conflict:  25% of the judgment is insufficient to satisfy the amount awarded.  In this case, one requirement must bend.  Under interpretation 2, when the two clauses are in conflict, “satisfy” is the one to give way.  The 7th circuit recognizes this, correctly describing the 25% awarded as ” toward satisfying the attorney fee the court awarded,” not “satisfying” it.

Under interpretation 3, on the other hand, the requirement to “satisfy” has no force even when it is not in conflict with the first clause.  In other words, they interpret the law as if the word “satisfy” were absent, and the clause read “shall be applied to the amount of attorney’s fees.”

Suppose the attorney’s fees awarded in Murphy had been $60,000.  Under interpretation 3, the court would be free to ignore the requirement to satisfy entirely, and apply only 10% of the judgment to the attorneys, despite the fact that satisfaction was achievable within the statutory 25% limit.

Even worse:  imagine that the statute didn’t have the parenthetical, and said just

Whenever a monetary judgment is awarded in an action described in paragraph (1), a portion of the judgment shall be applied to satisfy the amount of attorney’s fees awarded against the defendant.

It would be crystal clear that the court was required to apply $60,000, the amount necessary to satisfy the award.  On interpretation 3, the further constraint imposed by the statute gives the court more discretion rather than less in a case like this one!  This can’t be right.

You could imagine switching to an interpretation 3′, in which the court is required to satisfy the amount awarded if it can do so without breaking the 25% limit, but is otherwise totally unconstrained.  Under this theory, an increase in award from $60,000 to $100,000 lessens the amount the court is required to contribute — indeed, lessens it to essentially zero.  This also can’t be right.

 

2 is right:  When two clauses of a statute can’t simultaneously be satisfied, the court’s job is to find some balance which satisfies each requirement to the greatest extent possible in a range of possible cases.  Interpretation 2 seems the most reasonable choice.  The Supreme Court should recognize that, contra the cert petition, this is the interpretation actually adopted by the 7th Circuit, and should go along with it.

 

 

Tagged , ,

Intersectionality as nonlinearity

I wonder if the idea of intersectionality would be better-understood in STEMmy circles if we called it “nonlinearity” instead.  Put that way, e.g.

“the condition of being queer and disabled isn’t the sum of the condition of being queer and the condition of being disabled, or even some linear combination of those, it’s just its own thing, which draws input from each of those conditions in some more complicated way and which has features of its own particular to the intersection”

it’s something I think most mathematicians would find extremely natural and intuitive.

Tagged

Peter Norvig, the meaning of polynomials, debugging as psychotherapy

I saw Peter Norvig give a great general-audience talk on AI at Berkeley when I was there last month.  A few notes from his talk.

  • “We have always prioritized fast and cheap over safety and privacy — maybe this time we can make better choices.”
  • He briefly showed a demo where, given values of a polynomial, a machine can put together a few lines of code that successfully computes the polynomial.  But the code looks weird to a human eye.  To compute some quadratic, it nests for-loops and adds things up in a funny way that ends up giving the right output.  So has it really ”learned” the polynomial?  I think in computer science, you typically feel you’ve learned a function if you can accurately predict its value on a given input.  For an algebraist like me, a function determines but isn’t determined by the values it takes; to me, there’s something about that quadratic polynomial the machine has failed to grasp.  I don’t think there’s a right or wrong answer here, just a cultural difference to be aware of.  Relevant:  Norvig’s description of “the two cultures” at the end of this long post on natural language processing (which is interesting all the way through!)
  • Norvig made the point that traditional computer programs are very modular, leading to a highly successful debugging tradition of zeroing in on the precise part of the program that is doing something wrong, then fixing that part.  An algorithm or process developed by a machine, by contrast, may not have legible “parts”!  If a neural net is screwing up when classifying something, there’s no meaningful way to say “this neuron is the problem, let’s fix it.”  We’re dealing with highly non-modular complex systems which have evolved into a suboptimally functioning state, and you have to find a way to improve function which doesn’t involve taking the thing apart and replacing the broken component.  Of course, we already have a large professional community that works on exactly this problem.  They’re called therapists.  And I wonder whether the future of debugging will look a lot more like clinical psychology than it does like contemporary software engineering.
Tagged , , ,

Robin laid a gun

OK here’s a weird piece of kid culture AB brought home:

Jingle bells, Batman smells

Robin laid a gun

Shot a tree and made it pee in 1981

This dates back at least to 2007 apparently.

It scans and rhymes very nicely but makes so sense at all.  What can it mean?

It seems like we are witnessing a kind of cultural hybrid; the “Jingle bells / Batman smells” of my childhood has here combined with a “Jingle bells / shotgun shells” tradition I was unaware of until now, which is actually older than the Batman version.  A lot of the “shotgun shells” versions found online involve Santa meeting his death in a hail of bullets, but “shot a tree and made it pee” is not uncommon.  I wonder how many utterly nonsensical kids rhymes we know are actually hybrids of different songs, each of which at some point sort of made sense?

 

Tagged , , ,

Portuguese vs. Portuguese

The Portuguese edition of “How Not To Be Wrong” just arrived at my house.  “Portuguese” as in “from Portugal” and as distinct from the Brazilian edition.  Interesting how two versions of the book in the same language can be rather different!  Here’s the opening paragraph in Portugal:

Agora mesmo, numa sala de aula algures no mundo, uma estudante esta a reclamar com o seu professor de matematica.  Este acabou de lhe pedir para usar uma parte substancial do seu fim de semana a calcular uma lista de trinta integrais definidas.

And in Brazil:

Neste exato momento, numa sala de aula em algum lugar do mundo, uma aluna esta xingando o professor de matematica.  O professor acaba de lhe pedir que passe uma parte substancial de seu fim de semana calculando uma lista de trinta integrais definidas.

Ok, those are not too far off.  Here’s how some lines of John Ashbery’s “Soonest Mended” are translated in Portugal:

E vimos que ambos temos razao, ainda que nada

Tenha resultado em coisa alguma; os avatares

Do nosso conformismo perante as regros,

E ficar sempre por casa, fizeram de nos — bem, en certo sentido — <<bons cidadaos>>

and in Brazil:

Esta vendo, ambos estavamos certos, embora nada

Tenha de algum modo chegado a nada; os avatares

Da nossa comformidade com regras e viver

Em torno de casa fizeram de nos … bem, num sentido, “bons cidadaos”

I don’t know whether Ashbery’s poems have official Portuguese translations.  The only one I could find of “Soonest Mended” was in a book of criticism by V.B. Concagh, where the last two lines were rendered

Deste conformarmo-nos as regras e fazermos a nossa vida

Ca por casa fizeram de nos — bem, num certo sentido, “bons cidadaos”

The line I hit very hard in English is  “For this is action, this not being sure.”  That last phrase is rendered

  • (Portugal) esta incerteza
  • (Brazil) essa falta de certeza
  • (Concagh) este nao esta seguro

I don’t read Portuguese but the last, most literal rendering seems best to me, assuming I’m right that it captures something of the “not the way you’d normally say it”-ness of the Ashbery:  “this uncertainty” or “this lack of certainty” in English don’t have at all the same quality.

Note:  Because I was feeling lazy I have omitted all diacritical marks.  Lusophones are welcome to hassle me about this if it makes the quotes ambiguous or unreadable.

 

Tagged , , ,

Home rule in Wisconsin: the or and the and

The Wisconsin Supreme Court is hearing arguments about a residency requirement for employees of the city of Milwaukee.  Milwaukee has a law requiring city employees to live within Milwaukee’s boundaries.  The state legislature passed a law forbidding cities from making or enforcing such laws.  Last summer, the 1st District Court of Appeals found that law in violation of the Home Rule Amendment to the Wisconsin Constitution.  The constitutional question is:  when can state lawmakers overrule the legislative decisions of cities and villages?

You might think this would be clear.  On November 4, 1924, voters in Wisconsin overwhelmingly approved the Home Rule Amendment, which added to the state Constitution:

Cities and villages organized pursuant to state law may determine their local affairs and government, subject only to this constitution and to such enactments of the legislature of statewide concern as with uniformity shall affect every city or every village. The method of such determination shall be prescribed by the legislature.

It turns out it hasn’t been so simple, in practice, to figure out what those 51 words mean.  In a recent high-profile case, the Wisconsin Supreme Court upheld Act 10, Governor Walker’s signature legislation; among other things, the law forbade Milwaukee from contributing to its employees’ pension funds.  The plaintiffs argued that this provision violated home rule.  Michael Gableman, writing for the court majority, said it was fine.

This raises questions.  First of all:  if a state law needs to affect every city uniformly in order to supersede local government, how can it be OK to specifically target Milwaukee’s pension fund?  Here the exact wording of 62.623 is critical.  The law doesn’t mention Milwaukee:  it applies to “any employee retirement system of a 1st class city.”   The “uniformity” requirement in the Home Rule amendment has generally been understood very liberally, allowing laws which affect cities in different size classes differently as long as the application within each class is uniform.

To construe the amendment as meaning that every act of the Legislature relating to cities is subject to a charter ordinance unless the act of the Legislature affected with uniformity every city from the smallest to the greatest, practically destroys legislative control over municipal affairs, assuming that laws could be drawn which would meet the requirements of the amendment so construed.

That’s from Van Gilder v. City of Madison (1936), one of the first Wisconsin Supreme Court cases to wrestle with the limits of home rule.  I will have more to say about Chief Justice Marvin Rosenberry’s decision in that case, some of it pretty salty.  But for now let’s stick to the point at hand.  The law can be argued to pass the “uniformity” test because it applies equally to all cities of the first class.  There is only one city of the first class in Wisconsin, and there has only ever been one city of the first class in Wisconsin, and it’s Milwaukee.

That’s the argument the Walker administration made in defense of the law.  But the court’s upholding the law rejects that defense, and the uniformity clause as a whole, as irrelevant the question before it.

In sum, our home rule case law instructs us that, when reviewing a legislative enactment under the home rule amendment, we apply a two-step analysis.  First, as a threshold matter, the court determines whether the statute concerns a matter of primarily statewide or primarily local concern.  If the statute concerns a matter of primarily statewide interest, the home rule amendment is not implicated and our analysis ends.  If, however, the statute concerns a matter of primarily local affairs, the reviewing court then examines whether the statute satisfies the uniformity requirement.  If the statute does not, it violates the home rule amendment.

Thus:

no merit exists in the plaintiffs’ contention that the legislative enactment at issue in a home rule challenge must be a matter of statewide concern and uniformly applied statewide to withstand constitutional scrutiny.

Now this is weird, right?  Because what’s described and rejected as “the plaintiff’s contention” is what the constitution says.  Gableman replaces the Constitution’s and with an or:  in his analysis, a state law supersedes local powers if it’s either of statewide concern or applied uniformly to all cities.

Is this an act of wanton judicial activism?  Well, not exactly.  The phrase “as home rule case law instructs us” is important here.  The opinion marshals a long line of precedents showing that the Home Rule amendment has typically been read as an or, not an and.  It goes all the way back to Rosenberry’s opinion in Van Gilder v. City of Madison; and the reason there’s such a long list is that all those other cases rely on Van Gilder, which has become the foundation of Wisconsin’s theory of home rule.

Which brings us to the main point.  I’m not a legal scholar, but what the hell, this is blogging, I get to have an opinion, and here’s mine:  Van Gilder v. City of Madison was wrongly decided and has been screwing up home rule jurisprudence for 80 years.

Rosenberry’s first go at explaining home rule goes like this:

The home–rule amendment certainly confers upon cities plenary powers to deal with local affairs and government subject to the limitations contained in the amendment itself and other provisions of the Constitution. The powers of municipalities are subject to the limitation that the municipality cannot by its charter deal with matters which
are of state–wide concern and its power to enact an organic law dealing with local affairs and government is subject to such acts of the Legislature relating thereto as are of state–wide concern and affect with uniformity all cities.

The “and” between statewide concern and uniformity is clear here.  But Rosenberry also says that municipalities simply have no power to address matters of statewide concern:  its powers, he says, are restricted to “local affairs and government” as distinct from matters of statewide concern.  So what cases are the second clause (“its power to enact an organic law….”) referring to?  Only those matters which are not of statewide concern, but which are affected by state laws which are of statewide concern.  Rosenberry gives no examples of such a situation, nor can I really imagine one, so I don’t think that’s really the conclusion he means to draw.  Later in the opinion, he settles more clearly on the policy adopted by Gableman in Madison Teachers Inc. v. Walker:

when the Legislature deals with local affairs as distinguished from
matters which are primarily of state–wide concern, it can only do so effectually by an act which affects with uniformity every city. It is true that this leaves a rather narrow field in which the home–rule amendment operates freed from legislative restriction, but there is no middle ground.

and

the limitation contained in the section upon the power of the Legislature is a limitation upon its power to deal with the local affairs and government of a city or village. Care must be taken to distinguish between the power of the Legislature to deal with local affairs and its power to deal with matters primarily of state–wide concern. When the Legislature deals with local affairs and government of a city, if its act is not to be subordinate to a charter ordinance, the act must be one which affects with uniformity every city. If in dealing with the local affairs of a city the Legislature classifies cities so that the act does not apply with uniformity to every city, that act is subordinate to a charter ordinance relating to the same matter. A charter ordinance of a city is not subject to an act of the Legislature dealing with local affairs unless the act affects with uniformity every city. State ex rel. Sleeman v. Baxter, supra. When the Legislature deals with matters which are primarily matters of state–wide concern, it may deal with them free from any restriction contained in the home rule amendment.

Now the ground has shifted.  In Rosenberry’s reading, when the home rule amendment refers to “local affairs and government” it specifically intends to exclude any “matters of statewide concern.”  I can accept this as a reading of those four words, but not as a reading of the whole sentence. If Roseberry is correct, then the phrase “of statewide concern” is never active in the amendment:  a local affair is, by definition, not a matter of statewide concern.  I think when your interpretation of a constitutional passage means that part of the text never applies, you need to think twice about your interpretation.

What’s more, Rosenberry holds that the state has the power to override local officials on purely local matters, of no statewide concern whatsoever, as long as it does so uniformly.  If that is so, what does he think the words “of statewide concern” are doing in the Home Rule amendment at all?

To me, the amendment has a pretty plain meaning.  Something like a residency requirement for city employees or a fiscal decision about a city pension plan is plainly a local affair.  It may also be a matter of statewide concern.  The state legislature can enact a law overriding local legislation if the matter is of statewide concern and the law in question applies uniformly to all cities.  I think Rosenberry just plain got this wrong in Van Gilder and it’s been wrong ever since.

 

 

 

 

 

 

Tagged , , , , , ,

Is a search a search?

(Continued from yesterday’s post.)

Scalia understood, when he needed to, that words changed their meaning, and stretched to accommodate cases that didn’t exist for the founders.  What, in the sense of the Fourth Amendment, counts as a “search”?  Scalia took up this lexical question in Kyllo vs. U.S, writing that infrared scanning of a house to detect excess heat (generated, the police correctly inferred, by a marijuana greenhouse within) did indeed constitute a search.  This is not the kind of search the Framers contemplated.  Nonetheless, says Scalia:

When the Fourth Amendment was adopted, as now, to “search” meant “[t]o look over or through for the purpose of finding something; to explore; to examine by inspection; as, to search the house for a book; to search the wood for a thief.” N. Webster, An American Dictionary of the English Language 66 (1828) (reprint 6th ed. 1989)

How to read this?  The written definition can be read to include viewing a house from the outside, and indeed, Scalia brings it up in this context:

One might think that the new validating rationale would be that examining the portion of a house that is in plain public view, while it is a “search”1 despite the absence of trespass, is not an “unreasonable” one under the Fourth Amendment.

But visual inspection of a house has not been classified as search by the Court — “perhaps,” Scalia says, “in order to preserve somewhat more intact our doctrine that warrantless searches are presumptively unconstitutional.”

In fact, it’s pretty clear from other Scalia opinions that he chooses a meaning for the word “search” which is simultaneously more restrictive than the dictionary definition —  it excludes visual inspection of a house — and more inclusive than the contemporary plain-language meaning.  To push a stereo away from the wall and look at its serial number, as in Arizona v. Hicks, is not to “search” the stereo; it’s not even clear whether, in standard English, a stereo can be searched, unless by pulling open the casing and digging through its insides.  But in Scalia’s majority opinion there, the moving of the stereo is what creates the search:

A truly cursory inspection – one that involves merely looking at what is already exposed to view, without disturbing it – is not a “search” for Fourth Amendment purposes, and therefore does not even require reasonable suspicion… [t]aking action, unrelated to the objectives of the authorized intrusion, which exposed to view concealed portions of the apartment or its contents, did produce a new invasion of respondent’s privacy unjustified by the exigent circumstance that validated the entry. This is why, contrary to JUSTICE POWELL’S suggestion, post, at 333, the “distinction between `looking’ at a suspicious object in plain view and `moving’ it even a few inches” is much more than trivial for purposes of the Fourth Amendment. It matters not that the search uncovered nothing of any great personal value to respondent – serial numbers rather than (what might conceivably have been hidden behind or under the equipment) letters or photographs. A search is a search, even if it happens to disclose nothing but the bottom of a turntable.”

So a “search,” for Scalia, requires observation of something that might reasonably be expected to be private, but doesn’t require looking inside of the thing searched.  I think that’s a pretty good definition; but it’s not what’s in the dictionary, it’s not the way we use the word in plain Enlglish, and Scalia makes no claim that it’s what was in the Framers’ minds.  It’s a definition you choose in order to achieve a goal, the goal of a workable evidential rule that suits our — or someone’s — sense of justice.

And that’s why it grates when Scalia says “[a] search is a search.”  So matter-of-fact, so direct; but so utterly opposite to what’s actually happening!  He should have said “A search is what we define a search to be.”

In light of Scalia’s take on statistical sampling, his rejection of Powell’s dissent is interesting:

As for the dissent’s extraordinary assertion that anything learned through “an inference” cannot be a search, see post, at 4-5, that would validate even the “through-the-wall” technologies that the dissent purports to disapprove. Surely the dissent does not believe that the through-the-wall radar or ultrasound technology produces an 8-by-10 Kodak glossy that needs no analysis (i.e., the making of inferences)

To measure radiation emanating from the outside of a house, and to infer by technological means something about the contents of the interior that can’t be directly observed:  this, for Scalia, is a search.  To count all the inhabitants of a city you can find, and to infer by technological means something about the people who couldn’t be directly observed:  that, Scalia says, isn’t counting.  In Kyllo, Scalia is happy to speculate about future technologies that will make his view more obviously correct, as soon as they exist (“The ability to “see” through walls and other opaque barriers is a clear, and scientifically feasible, goal of law enforcement research and development.”)  In Commerce, his vision of technological progress in statistics is decidedly more pessimistic.  Why?

Tagged , , , , ,

Gendercycle: a dynamical system on words

By the way, here’s another fun word2vec trick.  Following Ben Schmidt, you can try to find “gender-neutralized synonyms” — words which are close to each other except for the fact that they have different gender connotations.   A quick and dirty way to “mascify” a word is to find its nearest neighbor which is closer to “he” than “she”:

def mascify(y): return [x[0] for x in model.most_similar(y,topn=200) if model.similarity(x[0],’she’) < model.similarity(x[0],’he’)][0]

“femify” is defined similarly.  We could put a threshold away from 0 in there, if we wanted to restrict to more strongly gender-coded words.

Anyway, if you start with a word and run mascify and femify alternately, you can ask whether you eventually wind up in a 2-cycle:  a pair of words which are each others gender counterparts in this loose sense.

e.g.

gentle -> easygoing -> chatty -> talkative -> chatty -> …..

So “chatty” and “talkative” are a pair, with “chatty” female-coded and “talkative” male-coded.

beautiful -> magnificent -> wonderful -> marvelous -> wonderful -> …

So far, I keep hitting 2-cycles, and pretty quickly, though I don’t see why a longer cycle wouldn’t be possible or likely.  Update:  Kevin in comments explains very nicely why it has to terminate in a 2-cycle!

Some other pairs, female-coded word first:

overjoyed / elated

strident / vehement

fearful / worried

furious / livid

distraught / despondent

hilarious / funny

exquisite / sumptuous

thought_provoking / insightful

kick_ass / badass

Sometimes it’s basically giving the same word, in two different forms or with one word misspelled:

intuitive / intuitively

buoyant / bouyant

sad / Sad

You can do this for names, too, though you have to make the “topn” a little longer to find matches.  I found:

Jamie / Chris

Deborah / Jeffrey

Fran / Pat

Mary / Joseph

Pretty good pairs!  Note that you hit a lot of gender-mixed names (Jamie, Chris, Pat), just as you might expect:  the male-biased name word2vec-closest to a female name is likely to be a name at least some women have!  You can deal with this by putting in a threshold:

>> def mascify(y): return [x[0] for x in model.most_similar(y,topn=1000) if model.similarity(x[0],’she’) < model.similarity(x[0],’he’) – 0.1][0]

This eliminates “Jamie” and “Pat” (though “Chris” still reads as male.)

Now we get some new pairs:

Jody / Steve (this one seems to have a big basis of attraction, it shows up from a lot of initial conditions)

Kasey / Zach

Peter / Catherine (is this a Russia thing?)

Nicola / Dominic

Alison / Ian

 

 

 

 

 

Tagged , , , ,

Messing around with word2vec

Word2vec is a way of representing words and phrases as vectors in medium-dimensional space developed by Tomas Mikolov and his team at Google; you can train it on any corpus you like (see Ben Schmidt’s blog for some great examples) but the version of the embedding you can download was trained on about 100 billion words of Google News, and encodes words as unit vectors in 300-dimensional space.

What really got people’s attention, when this came out, was word2vec’s ability to linearize analogies.  For example:  if x is the vector representing “king,” and y the vector representing “woman,” and z the vector representing “man,” then consider

x + y – z

which you might think of, in semantic space, as being the concept “king” to which “woman” has been added and “man” subtracted — in other words, “king made more female.”  What word lies closest in direction to x+y-z?  Just as you might hope, the answer is “queen.”

I found this really startling.  Does it mean that there’s some hidden linear structure in the space of words?

It turns out it’s not quite that simple.  I played around with word2vec a bunch, using Radim Řehůřek’s gensim package that nicely pulls everything into python; here’s what I learned about what the embedding is and isn’t telling you.

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts — that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’  (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.)  If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.)  It’s positive when the words are close to each other, negative when the words are far.  For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words.  ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

You might ask:  well, what words in Word2Vec are farthest from “tremendous?”  You just get trash:

>>> model.most_similar(negative=[‘tremendous’])

[(u’By_DENISE_DICK’, 0.2792186141014099), (u’NAVARRE_CORPORATION’, 0.26894450187683105), (u’By_SEAN_BARRON’, 0.26745346188545227), (u’LEGAL_NOTICES’, 0.25829464197158813), (u’Ky.Busch_##-###’, 0.2564955949783325), (u’desultorily’, 0.2563159763813019), (u’M.Kenseth_###-###’, 0.2562236189842224), (u’J.McMurray_###-###’, 0.25608277320861816), (u’D.Earnhardt_Jr._###-###’, 0.2547803819179535), (u’david.brett_@_thomsonreuters.com’, 0.2520599961280823)]

If 3 million words were distributed randomly in the unit ball in R^300, you’d expect the farthest one from “tremendous” to have dot product about -0.3 from it.  So when you see a list whose largest score is around that size, you should think “there’s no structure there, this is just noise.”

Antonyms

Challenge problem:  Is there a way to accurately generate antonyms using the word2vec embedding?  That seems to me the sort of thing the embedding is not capturing.  Kyle McDonald had a nice go at this, but I think the lesson of his experiment is that asking word2vec to find analogies of the form “word is to antonym as happy is to?” is just going to generate a list of neighbors of “happy.”  McDonald’s results also cast some light on the structure of word2vec analogies:  for instance, he finds that

waste is to economise as happy is to chuffed

First of all, “chuffed” is a synonym of happy, not an antonym.  But more importantly:  The reason “chuffed” is there is because it’s a way that British people say “happy,” just as “economise” is a way British people spell “economize.”  Change the spelling and you get

waste is to economize as happy is to glad

Non-semantic properties of words matter to word2vec.  They matter a lot.  Which brings us to diction.

Word2Vec distance keeps track of diction

Lots of non-semantic stuff is going on in natural language.  Like diction, which can be high or low, formal or informal, flowery or concrete.    Look at the nearest neighbors of “pugnacity”:

>>> model.most_similar(‘pugnacity’)

[(u’pugnaciousness’, 0.6015268564224243), (u’wonkishness’, 0.6014434099197388), (u’pugnacious’, 0.5877301692962646), (u’eloquence’, 0.5875781774520874), (u’sang_froid’, 0.5873805284500122), (u’truculence’, 0.5838015079498291), (u’pithiness’, 0.5773230195045471), (u’irascibility’, 0.5772287845611572), (u’hotheadedness’, 0.5741063356399536), (u’sangfroid’, 0.5715578198432922)]

Some of these are close semantically to pugnacity, but others, like “wonkishness,” “eloquence”, and “sangfroid,” are really just the kind of elevated-diction words the kind of person who says “pugnacity” would also say.

In the other direction:

>>> model.most_similar(‘psyched’)

[(u’geeked’, 0.7244787216186523), (u’excited’, 0.6678282022476196), (u’jazzed’, 0.666187584400177), (u’bummed’, 0.662735104560852), (u’amped’, 0.6473385691642761), (u’pysched’, 0.6245862245559692), (u’exicted’, 0.6116108894348145), (u’awesome’, 0.5838013887405396), (u’enthused’, 0.581687331199646), (u’kinda_bummed’, 0.5701783299446106)]

“geeked” is a pretty good synonym, but “bummed” is an antonym.  You may also note that contexts where “psyched” is common are also fertile ground for “pysched.”  This leads me to one of my favorite classes of examples:

Misspelling analogies

Which words are closest to “teh”?

>>> model.most_similar(‘teh’)

[(u’ther’, 0.6910992860794067), (u’hte’, 0.6501408815383911), (u’fo’, 0.6458913683891296), (u’tha’, 0.6098173260688782), (u’te’, 0.6042138934135437), (u’ot’, 0.595798909664154), (u’thats’, 0.595078706741333), (u’od’, 0.5908242464065552), (u’tho’, 0.58894944190979), (u’oa’, 0.5846965312957764)]

Makes sense:  the contexts where “teh” is common are those contexts where a lot of words are misspelled.

Using the “analogy” gadget, we can ask; which word is to “because” as “teh” is to “the”?

>>> model.most_similar(positive=[‘because’,’teh’],negative=[‘the’])

[(u’becuase’, 0.6815075278282166), (u’becasue’, 0.6744950413703918), (u’cuz’, 0.6165347099304199), (u’becuz’, 0.6027254462242126), (u’coz’, 0.580410361289978), (u’b_c’, 0.5737690925598145), (u’tho’, 0.5647958517074585), (u’beacause’, 0.5630674362182617), (u’thats’, 0.5605655908584595), (u’lol’, 0.5597798228263855)]

Or “like”?

>>> model.most_similar(positive=[‘like’,’teh’],negative=[‘the’])

[(u’liek’, 0.678846001625061), (u’ok’, 0.6136218309402466), (u’hahah’, 0.5887773633003235), (u’lke’, 0.5840467214584351), (u’probly’, 0.5819578170776367), (u’lol’, 0.5802655816078186), (u’becuz’, 0.5771245956420898), (u’wierd’, 0.5759704113006592), (u’dunno’, 0.5709049701690674), (u’tho’, 0.565370500087738)]

Note that this doesn’t always work:

>>> model.most_similar(positive=[‘should’,’teh’],negative=[‘the’])

[(u’wil’, 0.63351970911026), (u’cant’, 0.6080706715583801), (u’wont’, 0.5967696309089661), (u’dont’, 0.5911301970481873), (u’shold’, 0.5908039212226868), (u’shoud’, 0.5776053667068481), (u’shoudl’, 0.5491836071014404), (u”would’nt”, 0.5474458932876587), (u’shld’, 0.5443994402885437), (u’wouldnt’, 0.5413904190063477)]

What are word2vec analogies?

Now let’s come back to the more philosophical question.  Should the effectiveness of word2vec at solving analogy problems make us think that the space of words really has linear structure?

I don’t think so.  Again, I learned something important from the work of Levy and Goldberg.  When word2vec wants to find the word w which is to x as y is to z, it is trying to find w maximizing the dot product

w . (x + y – z)

But this is the same thing as maximizing

w.x + w.y – w.z.

In other words, what word2vec is really doing is saying

“Show me words which are similar to x and y but are dissimilar to z.”

This notion makes sense applied any notion of similarity, whether or not it has anything to do with embedding in a vector space.  For example, Levy and Goldberg experiment with minimizing

log(w.x) + log(w.y) – log(w.z)

instead, and get somewhat superior results on the analogy task.  Optimizing this objective has nothing to do with the linear combination x+y-z.

None of which is to deny that the analogy engine in word2vec works well in many interesting cases!  It has no trouble, for instance, figuring out that Baltimore is to Maryland as Milwaukee is to Wisconsin.  More often than not, the Milwaukee of state X correctly returns the largest city in state X.  And sometimes, when it doesn’t, it gives the right answer anyway:  for instance, the Milwaukee of Ohio is Cleveland, a much better answer than Ohio’s largest city (Columbus — you knew that, right?)  The Milwaukee of Virginia, according to word2vec, is Charlottesville, which seems clearly wrong.  But maybe that’s OK — maybe there really isn’t a Milwaukee of Virginia.  One feels Richmond is a better guess than Charlottesville, but it scores notably lower.  (Note:  Word2Vec’s database doesn’t have Virginia_Beach, the largest city in Virginia.  That one I didn’t know.)

Another interesting case:  what is to state X as Gainesville is to Florida?  The answer should be “the location of the, or at least a, flagship state university, which isn’t the capital or even a major city of the state,” when such a city exists.  But this doesn’t seem to be something word2vec is good at finding.  The Gainesville of Virginia is Charlottesville, as it should be.  But the Gainesville of Georgia is Newnan.  Newnan?  Well, it turns out there’s a Newnan, Georgia, and there’s also a Newnan’s Lake in Gainesville, FL; I think that’s what’s driving the response.  That, and the fact that “Athens”, the right answer, is contextually separated from Georgia by the existence of Athens, Greece.

The Gainesville of Tennessee is Cookeville, though Knoxville, the right answer, comes a close second.

Why?  You can check that Knoxville, according to word2vec, is much closer to Gainesville than Cookeville is.

>>> model.similarity(‘Cookeville’,’Gainesville’)

0.5457580604439547

>>> model.similarity(‘Knoxville’,’Gainesville’)

0.64010456774402158

But Knoxville is placed much closer to Florida!

>>> model.similarity(‘Cookeville’,’Florida’)

0.2044376252927515

>>> model.similarity(‘Knoxville’,’Florida’)

0.36523836770416895

Remember:  what word2vec is really optimizing for here is “words which are close to Gainesville and close to Tennessee, and which are not close to Florida.”  And here you see that phenomenon very clearly.  I don’t think the semantic relationship between ‘Gainesville’ and ‘Florida’ is something word2vec is really capturing.  Similarly:  the Gainesville of Illinois is Edwardsville (though Champaign, Champaign_Urbana, and Urbana are all top 5) and the Gainesville of Indiana is Connersville.  (The top 5 for Indiana are all cities ending in “ville” — is the phonetic similarity playing some role?)

Just for fun, here’s a scatterplot of the 1000 nearest neighbors of ‘Gainesville’, with their similarity to ‘Gainesville’ (x-axis) plotted against their similarity to ‘Tennessee’ (y-axis):

Screen Shot 2016-01-14 at 14 Jan 4.37.PM

The Pareto frontier consists of “Tennessee” (that’s the one whose similarity to “Tennessee” is 1, obviously..) Knoxville, Jacksonville, and Tallahassee.

Bag of contexts

One popular simple linear model of word space is given by representing a word as a “bag of contexts” — perhaps there are several thousand contexts, and each word is given by a sparse vector in the space spanned by contexts:  coefficient 0 if the word is not in that context, 1 if it is.  In that setting, certain kinds of analogies would be linearized and certain kinds would not.  If “major city” is a context, then “Houston” and “Dallas” might have vectors that looked like “Texas” with the coodinate of “major city” flipped from 0 to 1.  Ditto, “Milwaukee” would be “Wisconsin” with the same basis vector added.  So

“Texas” + “Milwaukee” – “Wisconsin”

would be pretty close, in that space, to “Houston” and “Dallas.”

On the other hand, it’s not so easy to see what relations antonyms would have in that space. That’s the kind of relationship the bag of contexts may not make linear.

The word2vec space is only 300-dimensional, and the vectors aren’t sparse at all.  But maybe we should think of it as a random low-dimensional projection of a bag-of-contexts embedding!  By the Johnson-Lindenstrauss lemma, a 300-dimensional projection is plenty big enough to preserve the distances between 3 million points with a small distortion factor; and of course it preserves all linear relationships on the nose.

Perhaps this point of view gives some insight into which kind of word relationships manifest as linear relationships in word2vec.  “flock:birds” is an interesting example.  If you imagine “group of things” as a context, you can maybe imagine word2vec picking this up.  But actually, it doesn’t do well:

>> model.most_similar(positive=[‘fish’,’flock’],negative=[‘birds’])
[(u’crays’, 0.4601619839668274), (u’threadfin_salmon’, 0.4553075134754181), (u’spear_fishers’, 0.44864755868911743), (u’slab_crappies’, 0.4483765661716461), (u’flocked’, 0.44473177194595337), (u’Siltcoos_Lake’, 0.4429660737514496), (u’flounder’, 0.4414420425891876), (u’catfish’, 0.4413948059082031), (u’yellowtail_snappers’, 0.4410281181335449), (u’sockeyes’, 0.4395104944705963)]

>> model.most_similar(positive=[‘dogs’,’flock’],negative=[‘birds’])
[(u’dog’, 0.5390862226486206), (u’pooches’, 0.5000904202461243), (u’Eminem_Darth_Vader’, 0.48777419328689575), (u’Labrador_Retrievers’, 0.4792211949825287), (u’canines’, 0.4766522943973541), (u’barked_incessantly’, 0.4709487557411194), (u’Rottweilers_pit_bulls’, 0.4708423614501953), (u’labradoodles’, 0.47032350301742554), (u’rottweilers’, 0.46935927867889404), (u’forbidding_trespassers’, 0.4649636149406433)]

The answers “school” and “pack” don’t appear here.  Part of this, of course, is that “flock,” “school”, and “pack” all have interfering alternate meanings.  But part of it is that the analogy really rests on information about contexts in which the words “flock” and “birds” both appear.  In particular, in a short text window featuring both words, you are going to see a huge spike of “of” appearing right after flock and right before birds.  A statement of the form “flock is to birds as X is to Y” can’t be true unless “X of Y” actually shows up in the corpus a lot.

Challenge problem:  Can you make word2vec do a good job with relations like “flock:birds”?  As I said above, I wouldn’t have been shocked if this had actually worked out of the box, so maybe there’s some minor tweak that makes it work.

Boys’ names, girls’ names

Back to gender-flipping.  What’s the “male version” of the name “Jennifer”?

There are various ways one can do this.  If you use the analogy engine from word2vec, finding the closest word to “Jennifer” + “he” – “she”, you get as your top 5:

David, Jason, Brian, Kevin, Chris

>>> model.most_similar(positive=[‘Jennifer’,’he’],negative=[‘she’])
[(u’David’, 0.6693146228790283), (u’Jason’, 0.6635637283325195), (u’Brian’, 0.6586753129959106), (u’Kevin’, 0.6520106792449951), (u’Chris’, 0.6505492925643921), (u’Mark’, 0.6491551995277405), (u’Matt’, 0.6386727094650269), (u’Daniel’, 0.6294828057289124), (u’Greg’, 0.6267883777618408), (u’Jeff’, 0.6265031099319458)]

But there’s another way:  you can look at the words closest to “Jennifer” (which are essentially all first names) and pick out the ones which are closer to “he” than to “she”.  This gives

Matthew, Jeffrey, Jason, Jesse, Joshua.

>>> [x[0] for x in model.most_similar(‘Jennifer’,topn=2000) if model.similarity(x[0],’he’) > model.similarity(x[0],’she’)]
[u’Matthew’, u’Jeffrey’, u’Jason’, u’Jesse’, u’Joshua’, u’Evan’, u’Brian’, u’Cory’, u’Justin’, u’Shawn’, u’Darrin’, u’David’, u’Chris’, u’Kevin’, u’3/dh’, u’Christopher’, u’Corey’, u’Derek’, u’Alex’, u’Matt’, u’Jeremy’, u’Jeff’, u’Greg’, u’Timothy’, u’Eric’, u’Daniel’, u’Wyvonne’, u’Joel’, u’Chirstopher’, u’Mark’, u’Jonathon’]

Which is a better list of “male analogues of Jennifer?”  Matthew is certainly closer to Jennifer in word2vec distance:

>>> model.similarity(‘Jennifer’,’Matthew’)

0.61308109388608356

>>> model.similarity(‘Jennifer’,’David’)

0.56257556538528708

But, for whatever, reason, “David” is coded as much more strongly male than “Matthew” is; that is, it is closer to “he” – “she”.  (The same is true for “man” – “woman”.)  So “Matthew” doesn’t score high in the first list, which rates names by a combination of how male-context they are and how Jennifery they are.  A quick visit to NameVoyager shows that Matthew and Jennifer both peaked sharply in the 1970s; David, on the other hand, has a much longer range of popularity and was biggest in the 1950s.

Let’s do it again, for Susan.  The two methods give

David, Robert, Mark, Richard, John

Robert, Jeffrey, Richard, David, Kenneth

And for Edith:

Ernest, Edwin, Alfred, Arthur, Bert

Ernest, Harold, Alfred, Bert, Arthur

Pretty good agreement!  And you can see that, in each case, the selected names are “cultural matches” to the starting name.

Sidenote:  In a way it would be more natural to project wordspace down to the orthocomplement of “he” – “she” and find the nearest neighbor to “Susan” after that projection; that’s like, which word is closest to “Susan” if you ignore the contribution of the “he” – “she” direction.  This is the operation Ben Schmidt calls “vector rejection” in his excellent post about his word2vec model trained on student evaluations.  

If you do that, you get “Deborah.”  In other words, those two names are similar in so many contextual ways that they remain nearest neighbors even after we “remove the contribution of gender.”  A better way to say it is that the orthogonal projection doesn’t really remove the contribution of gender in toto.  It would be interesting to understand what kind of linear projections actually make it hard to distinguish male surnames from female ones.

Google News is a big enough database that this works on non-English names, too.  The male “Sylvie”, depending on which protocol you pick, is

Alain, Philippe, Serge, Andre, Jean-Francois

or

Jean-Francois, Francois, Stephane, Alain, Andre

The male “Kyoko” is

Kenji, Tomohiko, Nobuhiro, Kazuo, Hiroshi

or

Satoshi, Takayuki, Yosuke, Michio, Noboru

French and Japanese speakers are encouraged to weigh in about which list is better!

Update:  Even a little more messing around with “changing the gender of words” in a followup post.

Tagged , , , , , ,

Devil math!

The Chinese edition of How Not To Be Wrongpublished by CITAC and translated by Xiaorui Hu, comes out in a couple of weeks.

ChineseCover

The Chinese title is

魔鬼数学

or

“Mo gui shu xue”

which means “Devil mathematics”!  Are they saying I’m evil?  Apparently not.  My Chinese informants tell me that in this context “Mo gui” should be read as “magical/powerful and to some extent to be feared” but not necessarily evil.

One thing I learned from researching this is that the Mogwai from Gremlins are just transliterated “Mo gui”!  So don’t let my book get wet, and definitely don’t read it after midnight.

Tagged , , ,
%d bloggers like this: