student evaluations | Thinking Slowly

Please read The Stroking Community, Part 1 before continuing.

The Grading Leniency Assumption

The evidence for the bias assumption questions the validity of SETs, but it does not, by itself, explain grade inflation. The grading leniency assumption adds that college teachers try to obtain favorable evaluations by assigning higher grades and by reducing course workloads. Stroebe cites three surveys that show that a majority of faculty believe that higher grades and lower workloads result in higher SETs. One survey published in 1980 found that 38% of faculty respondents admitted lowering the difficulty level of their courses as a result of SETs. (I’m not aware of any more recent survey which asked this question, which is unfortunate.)

It should be noted, of course, that faculty may not be aware of having changed their behavior, or they may think they have done it for other reasons. One common reason given for watering down courses is that contemporary students are unprepared for college-level work. (One former colleague, for example, said, “You have to meet students at a place where they feel comfortable.” Unfortunately, that “place” gets closer to the downtown bars with each passing year.)

Indirect evidence for the grading leniency assumption comes from student behavior. Greenwald and Gillmore note that students would ordinarily be expected to report working harder in courses in which they expect to get a higher grade. However, in a study of over 500 classes, students reported doing less work in those courses in which they expected to get a higher grade, a finding which is readily explained by the grading leniency assumption.

Finally, there are studies of the effects of grades on future course enrollment. Some universities publish average grades by course and instructor at the university’s website, and it is possible to determine through computer signatures whether students have accessed this information. In two studies, consulting past grading data predicted future choices of courses and sections, with the sections with higher grades being preferred by about 2 to 1. In one of these studies, this preference for easier courses was greater among low ability students than high ability students.

It should be noted that lowering the students’ workload not only improves faculty evaluations, it also lowers the faculty’s own workload. There are fewer of those time-consuming term papers and essay exams to grade. Instead, teachers can give the multiple-choice exams that are considerately provided free of charge by the textbook publisher.

The faculty members with the most to lose in the current enviroment are those who attempt to maintain high academic standards and are punished for their integrity with low student evaluations. If they don’t have tenure, they could be fired. And even if they do have tenure, they are likely to be under considerable pressure from administrators to improve their evaluations.

Grade Inflation

Here’s another chart to remind you of how bad grade inflation has gotten. It shows the change over time in the frequency of letter grades.

Grade inflation is an unintended consequence of universities’ reliance on student evaluations. Can it be considered a good thing? Kohn proposes that grades serve three functions: sorting, motivation and feedback. If grades gradually lose their meaning, they become less useful as sorting criteria for employers and graduate schools and less useful as feedback to students. The students most harmed are the hard-working, high ability students who would have gotten A’s in the absence of grade inflation. They are no longer able to distinguish themselves from their more mediocre colleagues. Leading average students to believe they are doing better than they actually are could lead to unpleasant shocks after they graduate.

The motivational function of grading assumes that the rewards and punishments provided by grades induce students to work harder and learn more. But the picture that emerges from the course selection studies is one of students attempting to obtain higher grades without working for them. Stroebe suggests that grade inflation is most likely to demotivate high ability students, who might decide that studying is not worth the effort if they wind up with the same grades as their less deserving classmates.

It’s hard to see how grade inflation can be reversed. The Wellesley solution of mandating lower grades holds some promise, but only if it is adopted by almost all similar universities at about the same time, since if some universities attempt to control grade inflation while others do not, their students will be at a competitive disadvantage when applying for jobs or to graduate school. Princeton initiated a similar program, but abandoned it after peer colleges failed to follow suit. There was some concern that controlling grade inflation might cause studients not to come to Princeton.

A shorter-term solution is suggested by Greenwald and Gillmore. They propose that SETs be statistically corrected for the average grade in the class. Although their method is complicated, the gist of it is that if the distribution of grades in a class is lenient, SETs are reduced. If the distribution is strict, the instructor receives a bonus. Although this makes good sense to me, it’s hard to imagine a university faculty agreeing to it.

The implications of this research are depressing. Students and professors are rewarding one another for working less hard. They are caught in a social trap in which short-term positive reinforcement serves to maintain behavior that has long-term negative consequences for themselves, the university and the society. Meanwhile, colleges and universities, already under financial stress, are decaying from the inside out because they are failing to meet their most basic obligation—that of helping and requiring students to learn.

You may also be interested in reading:

The Stroking Community, Part 1

Asian-American Achievement as a Self-Fulfilling Prophecy

Racial Profiling in Preschool

Grade inflation has been a fact of life at American universities for several decades. College grades are measured on a 4-point scale (A = 4, B = 3, C = 2, D = 1, F = 0). Since the 1980’s, grades at a large sample of colleges and universities, have increased on average by .10 to .15 points per decade. The overall grade point average now stands at about 3.15.

This would seem to imply that students have either gotten smarter or are working harder. However, verbal SAT scores of incoming students have declined sharply during this period, while math scores have remained relatively stable. There has also been a decline in the amount of time students report that they spend studying. On average, college students now claim to study only 12 to 14 hours per week. Assuming 16 hours of class time, that amounts to a work week of less than 30 hours.

More disturbing is the research of Richard Arum and Josipa Roksa. They administered the Collegiate Learning Assessment, a cognitive test measuring critical thinking, complex reasoning and writing, to 2300 students at 24 universities in their first semester and at the end of their sophomore year. They found only limited improvement (.18 of a standard deviation, on average), and no improvement at all among 45% of the students. Of the behaviors they measured, only time spent studying was associated with cognitive gains.

Beginning in the 1980s, colleges and universities entered what is sometimes called the “student-as-consumer” era. Almost all of them began routinely administering student evaluations of teaching (SETs), and basing decisions about tenure and promotion of faculty members in part on their SETs. Social psychologist Wolfgang Stroebe, in an article entitled “Why Good Teaching Evaluations May Reward Bad Teaching,” argues that SETs are responsible for some of the grade inflation. Stroebe has organized the research on SETs around two hypotheses which he calls the bias assumption and the grading leniency assumption.

The Bias Assumption

It has long been known that higher student grades are associated with better evaluations, both within and between classes. That is, within a class, the students with the highest grades give the instructor the most favorable evaluations. When you compare different classes, those with the highest average grades also have the highest average SETs. A recent meta-analysis found that grades account for about 10% of the variability in teaching evaluations.

Since these data are correlational, their meaning is ambiguous. They were initially interpreted to mean that teaching effectiveness influences both grades and evaluations. If so, SETs are a valid measure of instructional quality. Stroebe’s bias assumption states that students give favorable evaluations in appreciation for having less work to do and higher grades, and that this is an important source of bias which undermines the validity of SETs.

Over the years, this debate has been a source of animosity among college faculty. It is probably the case that SET believers receive more favorable evaluations than SET skeptics. SET believers sometimes accuse SET skeptics of making excuses for their poor student evaluations, while skeptics suggest that believers are in denial about the possibility that their high ratings are obtained in part by ingratiating their students.

The obvious—but unethical—way to test the bias hypothesis is to manipulate students’ grades in order to see what effect this has on SETs. Back in the days before ethical review of research with human subjects became routine, there were a few studies that temporarily gave students false feedback about their grades. They found that grades did affect evaluations. For example, in one study, students in two large sections of General Psychology taught by the same instructor were graded on slightly different scales. The instructor received better evaluations in the section with the more generous grading scale.

There are several other research findings that, while correlational, are consistent with the bias hypothesis.

In the early 2000s, Wellesley College, concerned about grade inflation, instituted a policy requiring that average grades in introductory course be no higher than 3.33. This resulted in an immediate decline in grades. Average SETs declined significantly in the affected courses and departments.

Greenwald and Gillmore found that the grade a student expected affected not only ratings of teaching effectiveness, but also had significant effects on logically irrelevant factors such as ratings of the instructor’s handwriting, the audibility of his or her voice, and the quality of the classroom. This suggests that there is a general halo effect surrounding lenient instructors.

The website Rate My Professors (RMP) contains 15 million ratings of 1.4 miliion professors at 7000 colleges. Professors are rated on easiness, helpfulness, clarity, “hotness” and overall quality. Easiness—a question that is seldom asked on institutional evaluations—is defined on the website as the ability to get a high grade without working hard. RMP ratings closely match the institutional SET scores of the same professors. The two dimensions most highly correlated with overall quality are easiness (r = .62) and hotness (r = .64). Obviously, the professor’s physical attractiveness is another threat to the validity of student evaluations that is deserving of study.

In my judgment, the best test of whether teachers with high evaluations are really better teachers are those studies which examine the effects of SETs in one course on performance in a followup course. For example, do students who give their Calculus I instructor a high rating do better in a Calculus II course taught by a different instructor? Stroebe found five studies using this research design. Three of them reported that those students who gave their instructors high SETs in the first course did more poorly in the followup course, one of them found no difference, and the fifth reported a mixture of negative and null effects depending on the item.

Merely finding no relationship between SETs in Course 1 and grades in Course 2 raises questions about the validity of SETs. The negative relationship found in the majority of these studies has a more radical implication. It implies that students learn less from those teachers to whom they give high evaluations. One of these studies found, however, that ratings of grading strictness in the first course were positively related to performance in the second.

Its important to noted that Stroebe does not claim that SETs are totally invalid as measures of teaching effectiveness, but only that they are strongly biased. Poor student evaluations can serve as a warning that faculty are not meeting their obligations. One recent study found a non-linear relationship between SETs and its measure of student learning. The students learned the most from professors whose SETs were near the middle of the distribution. They learned the least from those whose evaluations were the lowest and the highest.

There are a number of possible explanations for the bias hypothesis. One is simple reciprocity. When a professor does something nice for a student, the student returns the favor with a positive evaluation. SETs give students who are unhappy with their grades an opportunity to exact their revenge. A second explanation for the negative ratings given by students with lower grades is attributional bias. The self-serving attribution bias predicts that we maintain our self-esteem by taking personal credit for our successful behaviors but blaming our failures on external causes, such as poor teaching or unfair grading by the professor.

Please continue reading Part 2.

You may also be interested in reading:

The Stroking Community, Part 2

Asian-American Achievement as a Self-Fulfilling Prophecy

Racial Profiling in Preschool

Thinking Slowly

A blog about social science and politics

Tag Archives: student evaluations