Tag Archives: data

Every Noise At Once

Glenn McDonald is the guy who wrote the amazing, obsessive, beautiful music blog The War Against Silence, now mostly dormant.  I admire him for writing tens of thousands of words about Alanis Morissette, whom he, and I, and maybe nobody else, still consider an important cultural figure.  He’s also a pretty hardcore data analyst.  I’ve often fallen down the rabbit hole of his analysis of the Pazz and Jop ballots.

Now he works for Echo Nest, the Greater Boston music startup that sponsored the Music Hack Day I participated in a couple of years ago.  And his latest project, Every Noise At Once, is a map of all music.  Seriously!  A map of all music!  By which I mean: an embedding of the set of genres tracked by EN into the Euclidean plane, and, for each genre, an embedding of bands tagged in that genre.

Play with it here.

Tagged , , ,

Humanities Hackathon

Had a great time today talking graph theory with a roomful of students and faculty in the humanities at the Humanities Hackathon.  Here’s a (big .ppt file) link to my slides.  One popular visualization was this graph of baby boys’ names from 2011, where two names are adjacent if their popularity profile across 12 representative states is very similar.  (For example, names similar to “Malachi” on this measure include “Ashton” and “Kaden,” while names similar to “Patrick” include “Thomas,” “John,” “Sean,” and “Ryan.”)

 

The visualization is by the open-source graph-viz tool gephi.

I came home only to encounter this breathless post from the Science blog about a claim that you can use network invariants (e.g. clustering coefficient, degree distribution, correlation of degree between adjacent nodes) to distinguish factually grounded narratives like the Iliad from entirely fictional ones like Harry Potter.  The paper itself is not so convincing.  For instance, its argument on “assortativity,” the property that high-degree nodes tend to be adjacent to one another, goes something like this:

Real-life social networks tend to be assortative, in the sense that the number of friends I have is positively correlated with the number of friends my friends have.

The social network they write down for the Iliad isn’t assortative, so they remove all the interactions classified as “hostile,” and then it is.

The social network for Beowulf isn’t assortative, so they remove all the interactions classified as “hostile,” and then it still isn’t, so they take out Beowulf himself, and then it is, but just barely.

Conclusion: The social networks of Beowulf and the Iliad are assortative, just like real social networks.

Digital humanities can be better than this!

 

Tagged , , ,

Raw polling data as playground

This is a picture of the American electorate!

More precisely; this is a scatterplot I just made using the dataset recently released by PPP, a major political polling firm.  (They’re the outfit that did the “is your state hot or not” poll I blogged about last week.)  PPP has made available the raw responses from 46 polls with 1000 responses each, conducted more or less weekly over the course of 2011.  Here’s the whole thing as a .zip file.

Analyzing data sets like this is in some sense not hard.  But there’s a learning curve.  Little things, like:  you have to know that the .csv format is beautifully portable and universal — it’s the ASCII of data.  You have to know how to get your .csv file into your math package of choice (in my case, python, but I think I could easily have done this in r or MatLab as well) and you have to know where to get a PCA package, if it’s not already installed.  And you have to know how to output a new .csv file and make a graphic from it when you’re done.  (As you can see, I haven’t quite mastered this last part, and have presented you with a cruddy Excel scatterplot.)  In total, this probably took me about three hours to do, and now that I have a data-to-picture path I understand how to use, I think I could do it again in about 30 minutes.  It’s fun and I highly recommend it.  There’s a lot of data out there.

So what is this picture?  The scatterplot has 1000 points, one for each person polled in the December 15, 2011 PPP survey.  The respondents answered a bunch of questions, mostly about politics:

Q1: Do you have a favorable or unfavorable opinion of Barack Obama?
Q2: Do you approve or disapprove of Barack Obama’s job performance?
Q3: Do you think Barack Obama is too liberal, too conservative, or about right?
Q4: Do you approve or disapprove of the job Harry Reid is doing?
Q5: Do you approve or disapprove of the job Mitch McConnell is doing?
Q6: Do you have a favorable or unfavorable opinion of the Democratic Party?
Q7: Do you have a favorable or unfavorable opinion of the Republican Party?
Q8: Generally speaking, if there was an election today, would you vote to reelect Barack Obama, or would you vote for his Republican opponent?
Q9: Are you very excited, somewhat excited, or not at all excited about voting in the 2012 elections?
Q10: If passed into law one version of immigration reform that people have discussed would secure the border and crack down on employers who hire illegal immigrants. It would also require illegal immigrants to register for legal immigration status, pay back taxes, and learn English in order to be eligible for U.S. citizenship. Do you favor or oppose Congress passing this version of immigration reform?
Q11: Have you heard about the $10,000 bet Mitt Romney challenged Rick Perry to in last week’s Republican Presidential debate?
Q12: (Asked only of those who say ‘yes’ to Q11:) Did Romney‚Äôs bet make you more or less likely to vote for him next year, or did it not make a difference either way?
Q13: Do you believe that there’s a “War on Christmas” or not?
Q14: Do you consider yourself to be a liberal, moderate, or conservative?
Q15: Do you consider yourself to be a supporter of the Tea Party or not?
Q16: Are you or is anyone in your household a member of a labor union?
Q17: If you are a woman, press 1. If a man, press 2.
Q18: If you are a Democrat, press 1. If a Republican, press 2. If you are an independent or a member of another party, press 3.
Q19: If you are Hispanic, press 1. If white, press 2. If African American, press 3. If Asian, press 4. If you are an American Indian, press 5. If other, press 6.
Q20: (Asked only of people who say American Indian on Q19:) Are you enrolled in a federally recognized tribe?
Q21: If you are 18 to 29 years old, press 1. If 30 to 45, press 2. If 46 to 65, press 3. If you are older than 65, press 4.
Q22: What part of the country do you live in NOW – the Northeast, the Midwest, the South, or the West?
Q23: What is your household’s annual income?

The answers to these questions, which are coded as integers, now give us 1000 points in R^{23}.  Our eyes are not good at looking at point clouds in 23-dimensional space.  So it’s useful to project down to R^2, that mos bloggable of Euclidean spaces.  But how?  We could just look at two coordinates and see what we get.  But this requires careful choice.  Suppose I map the voters onto the plane via their answers to Q1 and Q2.  The problem is, almost everyone who has a favorable opinion of Barack Obama approves of his job performance, and vice versa.  Considering these two features is hardly better than considering only one feature.  Better would be to look at Q8 and Q21; these two variables are surely less correlated, and studying both together would give us good information on how support for Obama varies with age.  But still, we’re throwing out a lot.  Principal component analysis is a very popular quick-n-dirty method of dimension reduction; it finds the projection onto R^2 (or a Euclidean space of any desired dimension) which best captures the variance in the original dataset.  In particular, the two axes in the PCA projection have correlation zero with each other.

A projection from R^23 to R^2 can be expressed by two vectors, each one of which is some linear combination of the original 23 variables.  The hope is always that, when you stare at the entries of these vectors, the corresponding axis has some “meaning” that jumps out at you.  And that’s just what happens here.

The horizontal axis is “left vs. right.”  It assigns positive weight to approving of Obama, identifying as a liberal, and approving of the Democratic Party, and negative weight to supporting the Tea Party and believing in a “War on Christmas.”  It would be very weird if any analysis of this kind of polling data didn’t pull out political affiliation as the dominant determinant of poll answers.

The second axis is “low-information voter vs. high-information voter,” I think.  It assigns a negative value to all answers of the form “don’t know / won’t answer,” and positive value to saying you are “very excited to vote” and having heard about Mitt Romney’s $10,000 bet.  (Remember that?)

And now the picture already tells you something interesting.  These two variables are uncorrelated, by definition, but they are not unrelated.  The voters split roughly into two clusters, the Democrats and the Republicans.  But the plot is “heart-shaped” — the farther you go into the low-information voters, the less polarization there is between the two parties, until in the lower third of the graph it is hard to tell there are two parties at all.  This phenomenon is not surprising — but I think it’s pretty cool that it pops right out of a completely automatic process.

(I am less sure about the third-strongest axis, which I didn’t include in the plot.  High scorers here, like low scorers on axis 2, tend to give a lot of “don’t know” answers, except when asked about Harry Reid and Mitch McConnell, whom they dislike.  They are more likely to say they’re “not at all excited to vote” and more likely to be independents.  So I think one might call this the “to hell with all those crooks” axis.)

A few technical notes:  I removed questions, like “region of residence,” that didn’t really map on a linear scale, and others, like “income,” that not everyone answered.  I normalized all the columns to have equal variance.  I made new 0-1-valued columns to record “don’t know” answers.  Yes, I know that many people consider it bad news to run PCA on binary variables, but I decided that since I was just trying to draw pictures and not infer anything, it would be OK.

Tagged , , , , , , , ,

Music Hack Day

Andrew Bridy, Lalit Jain, Ben Recht and I spent the weekend in Cambridge at Music Hack Day, organized by the Echo Nest and sponsored by just about every company you can think of that cares about both music and technology.  We hacked in a somewhat different spirit than most of the folks there; for us, the Million Song Dataset isn’t a tool for app-building, but a playground where we can test ideas about massive networks and information retrieval.

(Re app-building:  Bohemian Rhapsicord.  Chrome-only.)

We’ve actually been playing with the MSD for a few weeks, and I’ll probably post some of those results later, but let’s start with what we did this weekend.  We wanted to see what aspects of the rules of melody we could find in the dataset.  Which notes like to follow which other notes?  Which chords like to follow which other chords?  If you took piano lessons as a kid you already know the answers to these questions.  Which is kind of the point!  When you start to dig into a giant dataset, the first thing you’d better do is check that it can tell you the things you already know.

We quickly found out that getting a handle on the melodies wasn’t so easy.  The song files in the MSD aren’t transcribed from scores, and they don’t have notes:  there’s pitch data, but it’s in the form of chromata; these keep good track of how the energy of a song segment is distributed across frequency bands, but they don’t necessarily correspond well to notes.  (For instance, what does the chroma of a drum hit sound like?)  We found that only about 2% of the songs in the sample had chromata that were “clean” enough to let us infer notes.

But here’s the good thing about a million — 2% of a million is still a lot!  Actually, to save time, we only analyzed about 100,000 songs — but that still gave us a couple of thousand songs’ worth of chroma to work with.  We threw out all the songs Echo Nest thought were in minor keys, and transposed everything to C.  Then we put all the bigrams, or pairs of successive notes, in a big bag, and computed the frequency of each one in the sample.  And this is what we saw:

Pretty nice, right?  The size of the circle represents the frequency of the note.  C (the tonic) and G (the dominant) are the most common notes, just as they should be.  And the notes that are actually in the C-major scale are noticably more frequent than those that aren’t.  The arrow from note x to note y represents the probability that the note following an x will be y; the thicker and redder the arrow, the greater the transition probability.  These, too, look just as they should.  The biggest red arrow is the one from B to C, which is because a major seventh (correction from commenter: a leading tone) really wants to resolve to tonic.  And the strong “Louie Louie” clique joining C,F, and G is plain to see.

Once you have these numbers, you can start to play around.  Lalit wrote a program that generated notes by random-walking along the graph above: the resulting “song” sounds kind of OK!  You can hear it at the end of our 2-minute presentation:

Once you have this computation, you can do all kinds of fun things.  For example, which songs in the database have the most “unusual” melodies from the point of view of this transition matrix?  It turns out that many of the top scorers are indeed songs whose key Echo Nest has misclassified, or which are in keys (like blues scale) that Echo Nest doesn’t recognize.  There’s also a lot of stuff like this:

Not exactly “Louie Louie.”  Low scorers often sound like this Spiritualized song, with big dynamic shifts but not much tonal stray from the old I-IV-V (and in this case, I think it’s mostly the big red I-V)

A relevant paper:  “Clustering beat-chroma patterns in a large music database,” by Thierry Bertin-Mahieux, Ron Weiss, and Daniel Ellis.

Here I am talking linear algebra with Vladimir Viro, who built the amazing Music N-gram Viewer.

DSC_0179 by thomasbonte, on Flickr

Note our team slogan, a bit hard to read on a slant:  “DO THE STUPIDEST THING FIRST.”

Tagged , , , , , ,

Active clustering

Just a note — the paper “Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities,” by Eriksson, Dasarathy, Singh, and Nowak, describing the algorithm which I (but not they) refer to as “compressed clustering,” is now on the arXiv.

 

Tagged , , ,
%d bloggers like this: