Assessing faculty performance: are student ratings the best source of data? (nerdy teaching post)

I wrote something for a blog at work but it was hidden behind a single sign-on so I’m re-publishing it here. It has nothing to do with food or travel or anything lighthearted and fun, but I slogged through a lot of statistics descriptions in the course of producing this and I need to share it far and wide so I feel my pain was worthwhile!

Last autumn, a comprehensive meta-analysis of student evaluation of teaching (SET) ratings received widespread attention after laying bare the many flaws present in previous studies that had sought to relate SET to student learning. Those previous examinations (notably Cohen’s 1981 paper ‘Student ratings of instruction and student achievement: A meta-analysis’ of multisection validity studies’, Feldman’s 1989 ‘The association between student ratings of specific instructional dimensions and student achievement: refining and extending the synthesis of data from multisection validity studies’, and Clayson’s 2009 study ‘Student evaluations of teaching: Are they related to what students learn? A meta-analysis and review of the literature’) suffer from a variety of methodological shortcomings. These include inadequate description of literature search techniques and parameters, small sample size effects, an inappropriate admixture of data with and without corrections for certain factors, and ‘voodoo correlations’ (impossibly high correlations that are merely an artefact rather than a reflection of reality).

The authors of the new work, Bob Uttl (Mount Royal University), Carmela White (University of British Columbia), and Daniela Gonzalez (University of Windsor) discount each of these studies after repeating them from scratch, and then present their own painstakingly performed meta-analyses to support their hypothesis that ‘students do not learn more from professors who receive higher SET ratings’. They end the paper with the following:

…universities and colleges may need to give appropriate weight to SET ratings when evaluating their professors. Universities and colleges focused on student learning may need to give minimal or no weight to SET ratings. In contrast, universities and colleges focused on students’ perceptions or satisfaction rather than learning may way to evaluate their faculty’s teaching using primarily or exclusively SET ratings, emphasize to their faculty members the need to obtain as high SET ratings as possible (i.e., preferably the perfect ratings), and systematically terminate those faculty members who do not meet the standards. For example, they may need to terminate all faculty members who do not exceed the average SET ratings of the department or the university, the standard of satisfactory teaching used in some departments and universities today despite common sense objections that not every faculty member can be above the average.

This statement is, presumably, intended to be deliberately provocative, but nevertheless highlights the intense pressure that SETs place on the modern academic. This theme is also explored in a qualitative analysis performed by Henry Hornstein (University of Hong Kong) in his review paper ‘Student evaluations of teaching are an inadequate assessment tool for evaluating faculty performance‘. Hornstein’s work was published a few months before the paper by Uttl et al., but did not receive similar media attention. This is understandable, since it is a review article, but it’s also a shame because Hornstein’s work acts as an excellent companion piece and provides a more in-depth examination of why we should approach SETs with caution.

As Hornstein points out, SETs were originally used (in the 1970s) in a formative way, to help lecturers understand which aspects of their teaching might require improvement Because the data were so easy to collect, however, SETs became increasingly popular as a method of providing administrators with the sort of snapshot of activities that could prove useful when making decisions about employment. Hornstein writes, ‘…the persistent practice of using student evaluations as summative measures to determine decisions for retention, promotion, and pay for faculty members is improper and depending on circumstances could be argued to be illegal.’

In particular, Hornstein highlights three main problems with the use of SETs:

  1. Measurement. This includes not only the statistical issues described by Uttl et al. above, but also difficulties associated with the fact that SETs typically involve qualitative data (usually collected by offering a range of categories along a Likert scale, e.g., ‘unacceptable’, ‘satisfactory, ‘very good’, etc.) that are then converted into quantitative data (e.g., unacceptable = 1, satisfactory = 3, and very good = 5). There is no real-world numerical difference between ‘unacceptable’ and ‘very good’, for example, so it is difficult to interpret what these selections mean in terms of actual teaching performance. Further, Hornstein writes, ‘it is not possible to interpret average scores of categories [because] the categories are not truly ordinal’, and yet this is the very information being used to make important strategic decisions. Like Uttl et al., Hornstein points out that ‘[administrators’] reasoning seems to be based on the improbable assumption that all of their faculty members should be above average in all categories.’
  2. Validity of student assessment. Hornstein cites a range of studies that provide evidence that students may not be ‘dispassionate evaluators of instructor performance’. For one thing, even the pedagogical literature does not yield an agreement on ‘effective teaching’, so it seems inappropriate to ask relatively untrained undergraduate students to be able to assess it. Rather, he writes, ‘students can reliable speak about their experience in a course, including factors that ostensibly affect teaching effectiveness such as audibility of the instructor, legibility of instructor notes, and availability of the instructor for consultation outside of class’. This is not the same thing as being able to ‘evaluate outside their experience’, e.g. determining whether instructors are truly knowledgeable within their field, or are well versed in, and demonstrative of, accepted good practice in learning and teaching.
  3. Response rates and satisfaction. As pointed out by a number of previous authors, SETs may reflect student satisfaction more than anything else; as a result, ratings are often given only by the students who feel most excited by or upset about their learning experience. Further, students may fixate on particular attributes–a perception of career preparation, relevance, and innovation, as well as factors such as classroom facilities, which are usually out of the instructors’ hands–that do not directly describe lecturers’ abilities. If students are mainly focused on getting good grades and not having to work too hard, as studies suggest many are, then lecturers that ensure the most challenging and educational environments may actually get the lowest SETs. Perhaps worst of all, SETs are known to be biased against particular types of instructor, in particular women. Hornstein writes, ‘…gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors’.

On the basis of these serious issues, Hornstein states that ‘the conservative and more appropriate approach is to question the validity of SET for all summative purposes’, else we risk alienating good lecturers who have been on the receiving end of SET bias, and also inappropriately encouraging academics to put on a performance just to make students happy rather than to choose the best pedagogical practices for ensuring a stimulating and effective educational environment. He also has an answer for how to assess good teaching in the absence of SET:

If one truly wants to understand how well someone teaches, observation is necessary. In order to know what is going on [in] the classroom, observation is necessary. In order to determine the quality of instructors’ materials, observation is necessary. Most of all, if the actual desire is to see improvement in teaching quality, then attention must be paid to the teaching itself, and not to the average of a list of student-reported numbers that bear at best a troubled and murky relationship to actual teaching performance. University faculty benefits most from visiting each other’s classrooms and looking at others’ teaching materials routinely. Learning can occur from one another, exchanging pedagogical ideas and practices.

Again, the strength of the author’s language evidences the passion that many lecturers feel about evaluation of teaching. Hornstein’s recommendation hopefully resonates with my University of Exeter colleagues who participate in the institution’s Annual Review of Teaching — a practice that many may find onerous to organize beforehand but beneficial to discuss afterwards. Demonstrating and discussing best practice with colleagues is an essential part of a well-rounded reflective teaching practice (and is one of the ‘four lenses’ advocated by Brookfield); as indicated by the work of Hornstein and Uttl et al., student feedback alone may not always tell the full story, and so it can be helpful and encouraging to also hear what colleagues have to say.

Hornstein notes that SETs can be a useful way of helping students feel engaged in their own education; rather than discounting the importance of the student voice, he questions the way in which that voice is recorded. Universities are increasingly exploring ways of empowering students to work side-by-side with academics in shaping their own learning process, as demonstrated in the growing importance of more experiential learning activities such as engagement in research projects and flipped classrooms in which students teach their peers. Work like that by Hornstein and Uttl et al. should encourage institutions to build on these positive advances and find more equitable, accurate, and beneficial tools for measuring the student learning experience–something that is also better for students, as it rewards the best educational practices and encourages the development of staff who are not quite up to snuff. Ideally, these data would then be used alongside colleague observations to produce more comprehensive, constructive evaluations, hopefully leading to ever more effective learning environments.