I have a piece in Slate today about the classification of personality disorders in the new DSM, and the NRC graduate school rankings. OK, they don’t really let me mention finite metric spaces in Slate. But that’s what’s going on behind the lines, and it’s a problem I’ve been wrestling with. Let’s say you have a finite metric space M; that is, a finite set of points with assigned distances between them. Now there’s a whole world of algorithms (multidimensional scaling and its many cousins) to embed M in a Euclidean space of some reasonably small dimension without messing up the metric too much. And there’s a whole world of heirarchical clustering algorithms that embed M in the set of leaves of a tree.
But I don’t really know a principled way to decide which one of these things to do.
Stuff there wasn’t room for in the piece — I should have mentioned Ian Hacking’s book Mad Travelers, which gives a very rich humanistic account of the process by which categories of mental illness are generated. And when I talked about the difficulty of crushing a finite metric down to one dimension, I should have linked to Cosma Shalizi’s “g, a statistical myth”
I’d be interested in seeing Leland Wilkinson’s analysis of the NRC data. Is it available anywhere? I can’t find it on his webpage and a quick google search doesn’t turn it up.
He sent it to me by e-mail — I’ll ask him if he wants me to share it on the blog.
> But I don’t really know a principled way to decide which one of these things to do.
It sounds a lot like choosing a data compression algorithm in computational contexts. How much memory do I want to save? How much loss am I willing to incur? Do I want an algorithm that achieves high compression 90% of the time, or not-so-high compression 99% of the time? And given that different algorithms work better on different types of data (images, text etc), all this affected by what kind of data you have.
Do you know of any principled way of choosing a data compression algorithm? I’m not aware that there’s anything other than rules of thumb; then again, I know virtually nothing about it.
I propose a natural generalization of the question you ask:
How do you cluster all the clustering algorithms??
:)
Emoticon all you will, but there are people who do interesting work on this very problem, e.g. our own Michael Coen.
I’ve been into personality theory for the last 12 years, and thus have some beefs with the dimensional (factor) approach coming to the DSM and common in popular personality systems such as the MBTI and the Big 5.
Factor approaches seem more appropriate for delving into neurology from the aspect of individual psychology (top down approach). Cluster analysis seems more appropriate to personality typing in the first place. Then, once you have the personality type, I like the approach of Enneagram theorists who go dimensional *within* that personality cluster – speaking of neurotic vs. non-neurotic actions and motivations for actions. Then you can find parallel actions between different personalities – factoring across the personalities at equivalent levels or modes of action to determine what, if any, commonalities may exist, and whether these commonalities stem from the same origin.
An ideal factor analysis would pull out the specific origins of the majority of factors, show the cross connections between the factors, and show which factors are more basic, and which are emergent. And eventually, under unrealistically ideal circumstances, you’d basically have a map of the mind that includes the biological and chemical parts of the brain in their actual relations (to the extent you can, given not all brains are alike).
An ideal cluster analysis would pull out the basic personality types, and only those basic personality types, and would show the relations, distances, and orthogonalities between those types, so we could have a real map of actual personalities which can exist within humans (which can emerge from the neurological factors).
Too often it seems like people have a tool they like, and don’t even stop to think if another tool would be more appropriate.
I know almost nothing about the math involved with cluster analysis and factor analysis, so this could be so much moronic babbling.