A few years ago, I posted this article on Hubpages. It argues that most social science research data can’t be trusted because of the inherent difficulties in studying human thought and behavior. Today, just like Herodotus, I am vindicated.
This article on Futurity shows that a major psychological test that has been used for years to screen for potential psychiatric illness is not reliable. The differences between the scores of schizophrenics and neurotypical people are smaller than the differences in scores for less educated versus more educated people, and even slightly smaller than the score differences among blacks versus whites. In other words, whether you finished college has a greater impact on your score of this test than whether you are an actual schizophrenic. Yet the test supposedly measures our ability to guess what other people are thinking.
This does not mean that the designers of the test were racist or classist. It means that, like nearly every person who dares to undertake social “science,” they were naïve. They didn’t realize that it’s almost impossible to measure anything in human cognition without also measuring a bunch of other stuff, such as vocabulary and cultural norms. And often, you don’t even realize that you are measuring the other stuff instead.
Rating People’s Ratings of Pictures of Eyes
For the test, subjects were shown a series of black-and-white shots of actors’ eyes expressing various feelings. For each eye shot, they were asked to choose between four words that would best describe the eyes’ mental state. For example, in one of the samples in the article, you are asked to choose between sarcastic, suspicious, dispirited, and stern.
I am sure you have spotted the problem already. You have to be fairly literate to parse the nuances of those four words. And all the shots were like that. This explains the education disparity produced by the test. To make matters worse, in most of the examples the article gives, two or even three of the offered words could plausibly describe what the eyes are expressing. Which word was considered right on the test was decided “through consensus ratings.”
Finally, some kinds of emotional expression are culturally conditioned. For example, in many cultures people show respect by looking down, avoiding eye contact. This is not true in mainstream American white culture. So, the same eyes that might say “confident” to an educated WASP could say something very different to a Latino: “defiant,” perhaps, or “angry.”
The test designers’ naivete lay in not realizing how much of emotional expression is culturally conditioned. This is a blind spot that all of humanity shares, but in this case there were serious real-life consequences for it because this test was being used to identify people who might be at risk for psychiatric disorders, and thus might require intervention. Imagine being flagged as possibly schizophrenic because you didn’t understand the cultural norms behind a test.
However, I would argue that the deepest problem was not with the designers’ ethnocentricity but with their assumption that they were in a position to “objectively” measure human thought and predict human behavior.
You can’t do it, people.
Machine Analysis of Word Frequency
Here’s another test. This one is much more recent, better backed by data, and apparently better at predicting what it’s supposed to predict. But there’s still a problem with it.
This test, too, is used to screen for potential psychosis, which usually, according to the article, comes on in a person’s early 20s, with warning signs in the late teens. Apparently there are subtle signs of being predisposed to psychosis in a person’s language (for example a less rich vocabulary).
For this test, researchers used an algorithm to study in detail the speech of 40 individuals in their diagnostic interviews with therapists. Based on these diagnostic interviews, a trained therapist can predict who will later develop psychosis with about 80% accuracy. The participants were then followed for 14 years (!) to discover whether they in fact developed psychosis. (Following a subject long-term is called a longitudinal study.)
It turned out that the algorithm could predict psychosis with a greater than 90% accuracy. The machine found that in addition to using a lot of synonyms, another predictor of psychosis was “a higher than normal usage of words related to sound.” The researchers had not anticipated this.
I am impressed with the results the test yields and I am really impressed that the designers actually tested its results by doing longitudinal studies to find out how many of the subjects actually displayed psychosis. (I think longitudinal studies are almost the only legitimate kind of social science research.) This checking of their results already puts them light years ahead of the eyes test.
That said, I think there is a potential problem with the way the machine was trained. To create a baseline for “normal” conversation, the researchers “fed [the] program the online conversations of 30,000 users of the social media platform Reddit.”
Internet conversation defines “normal.” That should raise red flags for all of us.
Then, that baseline of “normal,” from written conversations, was used to evaluate transcripts of face to face interviews. It looks like, in this case, this problem did not skew the data, given how well the test predicts psychosis. But I have a huge problem with the principle that we can diagnose people based on word frequency counts. In the wrong hands, this principle could really escape its glass cage and go rampaging across the countryside, wreaking havoc and destruction.
To take just one example, I’ve heard of a scholar (somewhere) who decided the Apostle Paul had some kind of sexual fixation because his letters so often use the word “flesh” (sarx). Never mind that Paul used the word sarx as shorthand for the deep sin nature of the unredeemed human being. When he used sarx, he was talking about a frustrating natural human inability to do good … and usually, he was talking about this phenomenon in himself.
This demonstrates how easily word-frequency studies can be manipulated to prove whatever we want. And this problem gets bigger the smaller the size of the text being studied.
What if you were analyzing an essay in which the author has to define a term? The term in question, and its synonyms, could come up dozens of times without being something that author is fixated on in everyday life. Using machine learning, you could “prove” that Ben Shapiro is a Nazi, because lately he’s had to spend so much time refuting that very accusation. (Shapiro is an orthodox Jew.)
Suffice it to say, though this particular study seems well-done, in general I am deeply suspicious of word-frequency tests, especially if they are the only measure being used, because they allow the researcher to ignore the actual content of the text in question.
So, What’s the Takeaway?
I don’t have a big moral of the story to give you here. Read the articles I linked to and decide for yourself. I am just sounding a warning that social science “data” is not nearly as objective as we tend to think it is, and may often be flat-out false.
15 thoughts on “Aha! I Knew It!”
The article on machine learning was of interest to me because that is what I am studying right now. I wish the article defined more clearly what was meant by ‘sound’ words. I think a poet or, I don’t know, a blogger with a lot of poetic content, might fall through the cracks in their algorithm because he or she might use a lot of rhyming and alliteration. I also thought it was strange that they used Reddit as a baseline for ‘normal’ human conversation. At least they did not use Twitter!
LikeLiked by 1 person
I know, right? Machine Learning … weird coincidence #9???
Based on the art near the article’s title, I take it that “sound words” means things like “Bang,” “Boom,” “Thwip,” and so on. (“Thwip” … my favorite.)
I agree about the danger if the algorithm was used on a text like a poem or blog post. Those tend to be short and focused on a specific topic, which would skew the results. I think it is designed to be used on oral conversation, and a large chunk of it.
I imagine that the reason they went to Reddit was that they needed sooo many examples of conversation on which to train the machine. It would be hard to get a comparable volume of oral conversation from the general public because you’d have to record it, you’d have to respect ethics restrictions, and transcribing it would take years probably. With Reddit posts, the conversation is already written and it’s already public. So it’s a shortcut to get a good volume of data. This is a great example of how doing really high-quality social science research is pretty much always prohibitively expensive.
Another strange coincidence, for sure.
I enjoyed reading your piece. God job.
What leaves me feeling unsatisfied when I read the science news (in this case the Machine Learning article) is this: the stories always seem to draw conclusions without giving enough details about their experiment.
There are plenty of sources of transcribed conversations (think courtrooms, government proceedings, interviews), but they chose Reddit. Fair enough. But actual verbal conversations vs. typed conversations by people of varying levels of literacy would be two different animals.
Fortunately machine learning does not care. It takes in data, makes an algorithm, and then plots future data based on that algorithm.
I just wish the article told us more about how the experiment was conducted so I could decide if those sound noises I’m making are cause for concern.
LikeLiked by 1 person
Oh, I’m certain they could not have gotten ahold of transcripts of court cases or interviews in a way that allowed them to use the data legally. Certainly not at that volume. They would have needed to fill out a Freedom of Information Act form for each one, pay a fee I believe, and then wait months and possibly see it denied.
I agree that reporting on conclusions but not on methods is a major problem with social/cognitive science reporting. The conclusions are the part that is most likely to be wrong.
About you and your sound words … if you are not psychotic by now, you are probably safe. 🙂 Also, the fact that you are worrying about it is a good sign. But if you really want to be certain, maybe ask your wife. I hope your wife doesn’t have a cruel sense of humor.
You make a good point about the court documents. They could be tricky to get access to. City council meetings and those type of proceedings are more transparent where I live and are available without FoIA requests, however plugging them in to a computer program wouldn’t be that straightforward.
I don’t know if it was created for the study, or if it existed before, but the Reddit comments dataset is available for anyone to access (preformatted for the purposes of machine learning). That could be another reason the researchers chose it.
As you mentioned, the Reddit dataset probably gave them the volume of records they were looking for. It’s huge.
The real question is, “Am I just a chatbot created by machine learning and the Reddit dataset?”
Apparently that’s a thing and its not that hard to do.
LikeLiked by 1 person
As you say.
Keep us updated on chatbots as you take your course. Unless, of course, you are sworn to secrecy.
If you are a chatbot, then whoever wrote your program really outdid themselves.
LikeLiked by 1 person
Jen, Tom says the basic problem is psychologists’ compulsive need to be treated as scientists, in spite of overwhelming evidence (such as yours) to the contrary. In other words, I knew it too!
LikeLiked by 2 people
Thanks for chiming in, Tom. The more people who “knew it,” the better. The Emperor has no clothes and all that.
Also, having dethroned “scientists” as the experts on human society and psychology, we now have to turn for wisdom to wise tribal elders like yourself.
First time I’ve been called a tribal elder. I kind of like it.
LikeLiked by 1 person
Such an interesting piece! And it makes a lot of sense to me and if I’m honest I can’t say I’m surprised. I think that these tests get skewed in many ways- including (but probably not limited to) social class. Often, I think that people are slanting the test to get the answer they want, cos I’ve also seen flaws in many social experiments (maybe an extreme example, but I once watched a video of a social science experiment where they were trying to show babies have an inherent sense of right and wrong- they had babies choose between a well-behaved and badly behaved toy- not only was there no control group, cos goodness how they’d figure that out, but I saw the person running the experiment waving the child towards the toy they wanted them to pick… which the babies didn’t pick 100% of the time. Point is, they had a foregone conclusion that you don’t need to teach children right from wrong, then set about “proving” it). A word frequency test doesn’t seem like it would be helpful- there are so many ways it could go wrong. In relation to your Ben Shapiro example, there was a recent article in the NYT about how listening to what he has to say can lead you down a fascistic rabbithole and radicalise you (although the guy in the article became left wing, so I’m not sure how that works)- so I feel like they may just cut out the middle man and consider the search term “Ben Shapiro” radical enough 😉 (which again, is a problem because you don’t know the context in which a person is listening to him- it could be someone that agrees, or someone on the left readying their counter arguments, or even a fascist hate-watching… so yeah the more I think about it, the more I’m disturbed by the idea that some people think you can use tech to mind-read what intent).
LikeLiked by 1 person
Thanks for your insightful comments.
As someone who has done actual social science research, my experience has been that like it or not, you tend to definitely want a certain finding – or any clear finding, sometimes – and it’s really hard not to try to manipulate the experiment to get that. Maybe other researchers are more detached than I am, but I kind of doubt it.
About search terms and politics … hoo boy. Yes, I agree with you. The problem is that anyone who wants to can link “Nazi” with your name, even your enemies. And once it’s linked, there seems to be no unlinking it. Unless people were to read the actual context … ha ha …
LikeLiked by 1 person
That is really interesting to hear (and yet not surprising!)
haha unfortunately it’s much easier to just read headlines 😉
LikeLiked by 1 person
Pingback: Flat Earth Anyone? – Out of Babel
Pingback: Aha! I Knew It, Part II – Out of Babel
Pingback: Aha! I Knew It, Part III – Out of Babel