A few years ago, I posted this article on Hubpages. It argues that most social science research data can’t be trusted because of the inherent difficulties in studying human thought and behavior. Today, just like Herodotus, I am vindicated.
This article on Futurity shows that a major psychological test that has been used for years to screen for potential psychiatric illness is not reliable. The differences between the scores of schizophrenics and neurotypical people are smaller than the differences in scores for less educated versus more educated people, and even slightly smaller than the score differences among blacks versus whites. In other words, whether you finished college has a greater impact on your score of this test than whether you are an actual schizophrenic. Yet the test supposedly measures our ability to guess what other people are thinking.
This does not mean that the designers of the test were racist or classist. It means that, like nearly every person who dares to undertake social “science,” they were naïve. They didn’t realize that it’s almost impossible to measure anything in human cognition without also measuring a bunch of other stuff, such as vocabulary and cultural norms. And often, you don’t even realize that you are measuring the other stuff instead.
Rating People’s Ratings of Pictures of Eyes
For the test, subjects were shown a series of black-and-white shots of actors’ eyes expressing various feelings. For each eye shot, they were asked to choose between four words that would best describe the eyes’ mental state. For example, in one of the samples in the article, you are asked to choose between sarcastic, suspicious, dispirited, and stern.
I am sure you have spotted the problem already. You have to be fairly literate to parse the nuances of those four words. And all the shots were like that. This explains the education disparity produced by the test. To make matters worse, in most of the examples the article gives, two or even three of the offered words could plausibly describe what the eyes are expressing. Which word was considered right on the test was decided “through consensus ratings.”
Finally, some kinds of emotional expression are culturally conditioned. For example, in many cultures people show respect by looking down, avoiding eye contact. This is not true in mainstream American white culture. So, the same eyes that might say “confident” to an educated WASP could say something very different to a Latino: “defiant,” perhaps, or “angry.”
The test designers’ naivete lay in not realizing how much of emotional expression is culturally conditioned. This is a blind spot that all of humanity shares, but in this case there were serious real-life consequences for it because this test was being used to identify people who might be at risk for psychiatric disorders, and thus might require intervention. Imagine being flagged as possibly schizophrenic because you didn’t understand the cultural norms behind a test.
However, I would argue that the deepest problem was not with the designers’ ethnocentricity but with their assumption that they were in a position to “objectively” measure human thought and predict human behavior.
You can’t do it, people.
Machine Analysis of Word Frequency
Here’s another test. This one is much more recent, better backed by data, and apparently better at predicting what it’s supposed to predict. But there’s still a problem with it.
This test, too, is used to screen for potential psychosis, which usually, according to the article, comes on in a person’s early 20s, with warning signs in the late teens. Apparently there are subtle signs of being predisposed to psychosis in a person’s language (for example a less rich vocabulary).
For this test, researchers used an algorithm to study in detail the speech of 40 individuals in their diagnostic interviews with therapists. Based on these diagnostic interviews, a trained therapist can predict who will later develop psychosis with about 80% accuracy. The participants were then followed for 14 years (!) to discover whether they in fact developed psychosis. (Following a subject long-term is called a longitudinal study.)
It turned out that the algorithm could predict psychosis with a greater than 90% accuracy. The machine found that in addition to using a lot of synonyms, another predictor of psychosis was “a higher than normal usage of words related to sound.” The researchers had not anticipated this.
I am impressed with the results the test yields and I am really impressed that the designers actually tested its results by doing longitudinal studies to find out how many of the subjects actually displayed psychosis. (I think longitudinal studies are almost the only legitimate kind of social science research.) This checking of their results already puts them light years ahead of the eyes test.
That said, I think there is a potential problem with the way the machine was trained. To create a baseline for “normal” conversation, the researchers “fed [the] program the online conversations of 30,000 users of the social media platform Reddit.”
Internet conversation defines “normal.” That should raise red flags for all of us.
Then, that baseline of “normal,” from written conversations, was used to evaluate transcripts of face to face interviews. It looks like, in this case, this problem did not skew the data, given how well the test predicts psychosis. But I have a huge problem with the principle that we can diagnose people based on word frequency counts. In the wrong hands, this principle could really escape its glass cage and go rampaging across the countryside, wreaking havoc and destruction.
To take just one example, I’ve heard of a scholar (somewhere) who decided the Apostle Paul had some kind of sexual fixation because his letters so often use the word “flesh” (sarx). Never mind that Paul used the word sarx as shorthand for the deep sin nature of the unredeemed human being. When he used sarx, he was talking about a frustrating natural human inability to do good … and usually, he was talking about this phenomenon in himself.
This demonstrates how easily word-frequency studies can be manipulated to prove whatever we want. And this problem gets bigger the smaller the size of the text being studied.
What if you were analyzing an essay in which the author has to define a term? The term in question, and its synonyms, could come up dozens of times without being something that author is fixated on in everyday life. Using machine learning, you could “prove” that Ben Shapiro is a Nazi, because lately he’s had to spend so much time refuting that very accusation. (Shapiro is an orthodox Jew.)
Suffice it to say, though this particular study seems well-done, in general I am deeply suspicious of word-frequency tests, especially if they are the only measure being used, because they allow the researcher to ignore the actual content of the text in question.
So, What’s the Takeaway?
I don’t have a big moral of the story to give you here. Read the articles I linked to and decide for yourself. I am just sounding a warning that social science “data” is not nearly as objective as we tend to think it is, and may often be flat-out false.