Written by Matthew Archer.
The thirteenth stroke of the clock is not only false of itself, but casts grave doubt on the credibility of the preceding twelve.
— Mark Twain (allegedly)
Yet more holes in the idea that human psychology flutters in a breeze of trivial situations, like a tired butterfly.
— Timothy Bates (responding to a bullshit one-shot intervention study)
I do not recommend this book. I do not recommend reading it. I do not recommend believing anything in it. It is one of the most insanely idiotic books I have read, and at no point approaches anything which could be considered clearly not a false positive or grossly overestimated; everyone is dumber for it existing. I award it 0 stars and may God have mercy on its author’s soul.
Every now and then I flick through a modern social science textbook scanning for interesting studies and updates in a particular field. However, every so often you find eye-wateringly bad studies and wonder how on God’s green Earth they became textbook worthy! After all, an introductory textbook is really meant to read like a Bible for the field. Unless the study in question is being showcased precisely because it’s a useful guide of what not to do, you don’t expect uncritical acceptance of utter nonsense. Well, today I found a corker.
This study had been hiding in plain sight. I’d been aware of it for years, but never checked out the original data. To introduce it, I shall quote its summary from the British sociology textbook I happened to be reading:
Robert Rosenthal and Lenore Jacobsen (1968) carried out a field experiment on a Californian primary school. Rosenthal and Jacobson informed the school that they had a special test that would reveal which children were likely to be ‘spurters’ or children with particular academic abilities. In reality, this was simply a normal IQ test. Teachers gave the tests to the children and the researchers randomly picked 20 per cent of the sample and reported falsely to the teachers that these students were spurters.
The result was that, a year later, nearly half of those children labelled as ‘spurters’ had made most gains in terms of their abilities.
This study demonstrates the power of experiments in revealing teacher labelling and how students respond to labelling through their self-fulfilling prophecy. This study has been widely criticised, however, for ethical issues. The researchers ignored the potentially long-term damaging effect on students academically and socially.
This became known as the Pygmalion effect or the Rosenthal effect. In the Greek myth, Pygmalion is a king and sculptor who falls so in love with the perfectly beautiful statue he created that the statue comes to life.
I should say that when the study was published in 1968, Rosenthal was professor of social psychology (at Harvard University no less). Of course, there’s more difference between a crêpe and a pancake than between sociology and social psychology. What’s more, the study has become very famous in sociology. It’s probably the case that most sociology undergrads learn about it, but I doubt whether the same can be said for psych majors.
And what about Jacobson? She was an elementary school principal in the South San Francisco Unified School District who contacted Rosenthal after reading a write-up in American Scientist of one of his intriguing rat studies. More on those later. “If you ever ‘graduate’ to classroom children”, Jacobson wrote, “please let me know whether I can be of assistance”. She soon would be.
Another caveat before we delve into the famous experiment: the weak version of the theory behind Rosenthal’s results is really nothing but an uncontroversial truism: our expectations affect both ourselves and other people, and the size of that effect probably scales with how explicitly my expectation is conveyed and, most importantly according to Rosenthal, the extent to which I genuinely believe it. Nobody seriously doubts that expectations are powerful, that they can influence outcomes. However, I had never actually looked at Rosenthal’s original data to see just how insane his results were.
In 1968, Robert L. Thorndike (1910-90), son of the famous Edward and father of Robert M. Thorndike (one of the 52 signatories of Mainstream Science on Intelligence, the public statement defending The Bell Curve), wrote a scathing two-page review of the book Rosenthal and Jacobsen published the same year, Pygmalion in the Classroom. It bears quoting at length for the comedic elements alone. Thorndike starts with an eerily prescient prediction:
The enterprise which represents the core of this document, and presumably the excuse for its publication, has received widespread advance publicity. In spite of anything I can say, I am sure it will become a classic — widely referred to and rarely examined critically. Alas, it is so defective technically that one can only regret that it ever got beyond the eyes of the original investigators! Though the volume may be an effective addition to educational propagandizing, it does nothing to raise the standards of educational research.
Thorndike then points out that all of the claims regarding “self-fulfilling prophecies” hinge on 19 children in grades 1 and 2:
The tests Rosenthal chose were published by the Society for Research on Adolescence (SRA for short). These tests of general ability (or TOGAs) had two subtests: vocabulary (that had 35 items) and reasoning (this had 28 items).
Grades 1 and 2 were given the K-2 level of the test. Below is the first set of results for six classrooms before the experimental intervention, that is before Rosenthal lied to the teachers about who the “spurters” were. Do you notice anything? Anything that is absolutely batshit crazy?
The control group in class 1C sticks out like Lia Thomas at an NCAA swimming competition. That class consisted of 21 children, 19 of whom were in the control group. Their mean reasoning “IQ” was 30.79. I can do no better than quote Mr Thorndike’s reaction upon realising the lunacy of the data:
They just barely appear to make the grade as imbeciles! And yet these pretest data were used blithely by the authors without even a reference to these fantastic results!
If these pretest data show anything, they show that the testing was utterly worthless and meaningless.
One can obviously calculate the means for the entire first and second grade by adding the classes together. The entire first grade have a mean reasoning “IQ” of 58. As Thorndike writes:
What kind of a test, or what kind of testing is it that gives a mean "IQ" of 58 for the total entering first grade of a rather run-of-the-mill school?
The first grade consisted of 63 students. An IQ of 58 is, by almost any classification scheme, considered ‘mentally defective’. Of course, you probably noticed that Thorndike has rightly been using inverted commas whenever he writes IQ. Why?
Rather bizarrely, Rosenthal never gave raw scores, nor the ages of the children — ‘so it takes a little impressionistic estimating to try to reconstruct the picture’, Thorndike again. He estimates an average age of 6 for the first graders. This means an “IQ” of 58 would give a “mental age” of 3.5 because:
Mental Age = IQ x Chronological Age / 100
Next step? Go to the norms tables for the TOGA and find the raw score that corresponds to a “mental age” of 3.5. Back to Thorndike’s comedy:
Alas, the norms do not go down that far! It is not possible to tell what the authors did, but finding that a raw score of 8 corresponds to an "M.A." of 5.3, we can take a shot at extrapolating downward. We come out with a raw score of approximately 2! Random marking would give 5 or 6 right!
We can only conclude that the pretest on the so-called Reasoning Test at this age is worthless.
What about the pupils randomly chosen to be “spurters”? For class 2A, that was 6 pupils. As you saw above, their mean “IQ” before the intervention was 114.33. A year of false expectations from their teacher resulted in a new mean “IQ” of 150.17. In Thorndike’s words, This looks a little high!'
Thorndike estimated an average age for the second graders of 7.5. An “IQ” of 150 implies a “mental age” of 11.25, though. Back to the norms tables we go! But it turns out that the highest entry was 10 years of age, which corresponded to a raw score of 26. If you remember, the test only had 28 items! As Thorndike asks, what of those who fall above the mean? He concludes:
In conclusion, then, the indications are that the basic data upon which this structure has been raised are so untrustworthy that any conclusions based upon them must be suspect. The conclusions may be correct, but if so it must be considered a fortunate coincidence.
A year later, Arthur Jensen would publish his infamous 124 page paper titled, How Much Can We Boost IQ and Scholastic Achievement? Jensen also tore Rosenthal’s paper to shreds:
Because of the questionable statistical significance of the results of this study, there may actually be no phenomenon that needs to be explained. Other questionable aspects of the conduct of the experiment make it mandatory that its results be replicated under better conditions before any conclusions from the study be taken seriously or used as a basis for educational policy. For example, the same form of the group-administered IQ test was used for each testing, so that specific practice gains were maximized. The teachers themselves administered the tests, which is a faux pas par excellence in research of this type. The dependability of teacher-administered group tests leaves much to be desired. Would any gains beyond those normally expected from general test familiarity have been found if the children’s IQs had been accurately measured in the first place by individual tests administered by qualified psychometrists without knowledge of the purpose of the experiment? These are some of the conditions under which such an experiment must be conducted if it is to inspire any confidence in its results.
In 1971, Baker and Crist reviewed all of the studies attempting to replicate Rosenthal’s work. There were nine that used IQ as the dependent variable in teacher-pupil interaction. None of them showed overall significant effects. The authors concluded that teacher expectations probably do not affect student IQ, a finding:
[…] supported by a background of decades of research suggesting the stability of human intelligence and its resistance to alterations by environmental manipulation.
In 1984, Stephen Raudenbush published the final nail in the coffin, a meta-analysis showing that once teachers had gotten to know their students for two weeks, the effect of a prior expectancy induction was reduced to virtually zero!
Reviewing Rosenthal’s 1975 470 page monograph, Experimenter Effects In Behaviour Research, Gwern concluded that the level of Rosenthal’s incompetence was indistinguishable from idiocy or ideology:
How stupid do you have to be to think a child could gain 30 IQ points over a year because their teacher speaks in a warmer tone or pays slightly more attention to them? To think that a kid could go from clinical retardation to one of the brightest in the class? Apparently, Harvard professor stupid. According to one write up, Rosenthal ‘countered that even if the initial test results were faulty, that didn’t invalidate the subsequent increase, as measured by the same test.’
Of course, this was almost Thorndike’s entire point! The increase was likely wholly explained by regression to the mean. However, another possibility is that Rosenthal just manipulated the data. This would explain (a) the rather queer statistics and (b) the lack of raw scores.
As Jensen alluded to, there were so many egregious methodological errors that the room for manipulation was near infinite. More likely, however, was that Rosenthal didn’t skip statistics 101, and he didn’t fudge the data, he just desperately wanted the findings to be true. As I’ve written about before, ideology literally makes people illogical:
Although some left-wingers were annoyed at his results (for they implied the teacher was more to blame than something like poverty), the implications would have obviously been enormous. It would have been a major dent in 70 odd years of intelligence research.
Another defence Rosenthal used was that his research design for the experiment won an award from the American Psychological Association. I think this is what the cool kids online call “accidentally posting your Ls”. Or, in this case, the APA’s Ls.
I think the only charitable explanation was that Rosenthal was caught in the dying currents of behaviourism. If you’re interested, I wrote a piece about how dog torture experiments swept the last vestiges of behaviourist dogmas aside:
So, swept up, perhaps, in the idea that humans were input-output machines, maybe Rosenthal didn’t bother to read the intelligence literature. If this is true, it is a great irony, because the chairman of the Harvard Department of Psychology at the time Rosenthal published his nonsense was none other than the leading behaviourist Richard Herrnstein, co-author of The Bell Curve. Perhaps if Rosenthal had walked a few yards down the hallway he would have been told by Herrnstein that his findings were as likely as…well, in the words of Gwern:
It as if Rosenthal et al had measured the heights of the students, placidly recorded that the student body included a few 1-foot-tall students, instructed the teachers in a memo to think more elevated thoughts, and returned to measure the now-8-foot-tall students — concluding they had discovered a marvellously effective and scalable intervention for literally boosting students up, surely worthy of further in-depth study…
The legend that will not die
The most annoying thing — by far —about Rosenthal’s original study is that it empowered a generation of midwits. Fuelled by truly Olympian levels of entitlement and scientific illiteracy, writers, journalists, teachers, and business execs have, for decades, referenced the importance of Rosenthal’s Pygmalion effect. In fact, just do a Google News search and you’ll see how often the original study is still trotted out to bolster some pre-made argument:
Thorndike was right: nothing one could say about this study could kill it off; it was destined to become a classic. That it exists in a modern sociology textbook without actual criticism is testimony to the fact. More bizarre, if one looks through the entire British pre-university textbooks for sociology and psychology (designed for 16-18 year olds), you’ll find barely any mention of intelligence research — probably the most reliable research there is in social science!
Of course, one cannot rely on teachers to correct this type of omission either, because teachers generally know nothing about intelligence or learning. Indeed, Russell Warne has found that 69.5% of American K-12 teachers believe that a good teacher could equalise differences in school success.
Who can blame the teachers, though, when they’re being trained by the likes of Philip Zimbado? Yes, he of Stanford Prison Experiment fame, another famous study that has questionable scientific validity. Below is a video of Zimbardo praising the Rosenthal study for his Heroic Imagination Project, a non-profit research and education organisation that:
[…] conducts trainings and workshops that tackle bullying, police oversight, racism or sexism for businesses and schools across the United States and other nations around the world.
Even professors of psychiatry at Columbia apparently felt able to opine on a literature they clearly had no expertise in. I speak of Myron A. Hofer’s 1994 New York Times’ opinion piece on The Bell Curve in which she rebuked the authors for not citing Rosenthal’s Pygmalion study:
Even in the book's own field, social psychology, it ignores such highly relevant work as the studies of Robert Rosenthal, now of Harvard, and others -- studies that extend over 30 years and are reflected in hundreds of published papers.
Question everything, learn basic statistics, and don’t for a second think that a credential — any credential — means much in the social sciences.
Matthew Archer is the Editor-in-Chief of Aporia Magazine.
Follow us on Twitter and consider becoming a paid supporter: