The Stroking Community, Part 1

Grade inflation has been a fact of life at American universities for several decades. College grades are measured on a 4-point scale (A = 4, B = 3, C = 2, D = 1, F = 0). Since the 1980’s, grades at a large sample of colleges and universities, have increased on average by .10 to .15 points per decade. The overall grade point average now stands at about 3.15.

This would seem to imply that students have either gotten smarter or are working harder. However, verbal SAT scores of incoming students have declined sharply during this period, while math scores have remained relatively stable. There has also been a decline in the amount of time students report that they spend studying. On average, college students now claim to study only 12 to 14 hours per week. Assuming 16 hours of class time, that amounts to a work week of less than 30 hours.

More disturbing is the research of Richard Arum and Josipa Roksa. They administered the Collegiate Learning Assessment, a cognitive test measuring critical thinking, complex reasoning and writing, to 2300 students at 24 universities in their first semester and at the end of their sophomore year. They found only limited improvement (.18 of a standard deviation, on average), and no improvement at all among 45% of the students. Of the behaviors they measured, only time spent studying was associated with cognitive gains.

Beginning in the 1980s, colleges and universities entered what is sometimes called the “student-as-consumer” era. Almost all of them began routinely administering student evaluations of teaching (SETs), and basing decisions about tenure and promotion of faculty members in part on their SETs. Social psychologist Wolfgang Stroebe, in an article entitled “Why Good Teaching Evaluations May Reward Bad Teaching,” argues that SETs are responsible for some of the grade inflation. Stroebe has organized the research on SETs around two hypotheses which he calls the bias assumption and the grading leniency assumption.

The Bias Assumption

It has long been known that higher student grades are associated with better evaluations, both within and between classes. That is, within a class, the students with the highest grades give the instructor the most favorable evaluations. When you compare different classes, those with the highest average grades also have the highest average SETs. A recent meta-analysis found that grades account for about 10% of the variability in teaching evaluations.

Since these data are correlational, their meaning is ambiguous. They were initially interpreted to mean that teaching effectiveness influences both grades and evaluations. If so, SETs are a valid measure of instructional quality. Stroebe’s bias assumption states that students give favorable evaluations in appreciation for having less work to do and higher grades, and that this is an important source of bias which undermines the validity of SETs.

Over the years, this debate has been a source of animosity among college faculty. It is probably the case that SET believers receive more favorable evaluations than SET skeptics. SET believers sometimes accuse SET skeptics of making excuses for their poor student evaluations, while skeptics suggest that believers are in denial about the possibility that their high ratings are obtained in part by ingratiating their students.

The obvious—but unethical—way to test the bias hypothesis is to manipulate students’ grades in order to see what effect this has on SETs. Back in the days before ethical review of research with human subjects became routine, there were a few studies that temporarily gave students false feedback about their grades. They found that grades did affect evaluations. For example, in one study, students in two large sections of General Psychology taught by the same instructor were graded on slightly different scales. The instructor received better evaluations in the section with the more generous grading scale.

There are several other research findings that, while correlational, are consistent with the bias hypothesis.

  • In the early 2000s, Wellesley College, concerned about grade inflation, instituted a policy requiring that average grades in introductory course be no higher than 3.33. This resulted in an immediate decline in grades. Average SETs declined significantly in the affected courses and departments.
  • Greenwald and Gillmore found that the grade a student expected affected not only ratings of teaching effectiveness, but also had significant effects on logically irrelevant factors such as ratings of the instructor’s handwriting, the audibility of his or her voice, and the quality of the classroom. This suggests that there is a general halo effect surrounding lenient instructors.
  • The website Rate My Professors (RMP) contains 15 million ratings of 1.4 miliion professors at 7000 colleges. Professors are rated on easiness, helpfulness, clarity, “hotness” and overall quality. Easiness—a question that is seldom asked on institutional evaluations—is defined on the website as the ability to get a high grade without working hard. RMP ratings closely match the institutional SET scores of the same professors. The two dimensions most highly correlated with overall quality are easiness (r = .62) and hotness (r = .64). Obviously, the professor’s physical attractiveness is another threat to the validity of student evaluations that is deserving of study.

In my judgment, the best test of whether teachers with high evaluations are really better teachers are those studies which examine the effects of SETs in one course on performance in a followup course. For example, do students who give their Calculus I instructor a high rating do better in a Calculus II course taught by a different instructor? Stroebe found five studies using this research design. Three of them reported that those students who gave their instructors high SETs in the first course did more poorly in the followup course, one of them found no difference, and the fifth reported a mixture of negative and null effects depending on the item.

Merely finding no relationship between SETs in Course 1 and grades in Course 2 raises questions about the validity of SETs. The negative relationship found in the majority of these studies has a more radical implication. It implies that students learn less from those teachers to whom they give high evaluations. One of these studies found, however, that ratings of grading strictness in the first course were positively related to performance in the second.

Its important to noted that Stroebe does not claim that SETs are totally invalid as measures of teaching effectiveness, but only that they are strongly biased. Poor student evaluations can serve as a warning that faculty are not meeting their obligations. One recent study found a non-linear relationship between SETs and its measure of student learning. The students learned the most from professors whose SETs were near the middle of the distribution. They learned the least from those whose evaluations were the lowest and the highest.

There are a number of possible explanations for the bias hypothesis. One is simple reciprocity. When a professor does something nice for a student, the student returns the favor with a positive evaluation. SETs give students who are unhappy with their grades an opportunity to exact their revenge. A second explanation for the negative ratings given by students with lower grades is attributional bias. The self-serving attribution bias predicts that we maintain our self-esteem by taking personal credit for our successful behaviors but blaming our failures on external causes, such as poor teaching or unfair grading by the professor.

Please continue reading Part 2.

You may also be interested in reading:

The Stroking Community, Part 2

Asian-American Achievement as a Self-Fulfilling Prophecy

Racial Profiling in Preschool