Some thoughts on evaluation framing, based on my academic experience with REF2021

gradusnik-1280x720.jpg

As an academic, you are frequently required to evaluate other people’s work—and, for obvious reasons, your work is also permanently being assessed. After having completed quite a few evaluation tasks in the last few months, and having received feedback on my own work, I have generated some thoughts that I will seek to organise here.

These thoughts mainly concern a hypothesis or hunch I have developed, which would posit that framing the evaluation in ways that seek to obtain more information about the reasons for a specific ‘grade’ can diminish the quality of the evaluation. This is due to the fact that the evaluator can externalise the uncertainty implicit in the qualitative evaluation. Let me try to explain.

Some background

Academic evaluations come in different flavours and colours. There is the rather obvious assessment of students’ work (ie marking). There are the also well-known peer-review assessments of academic papers. The scales used (eg 1 to 10 for students, or a four-point scale involving rejection/major corrections (aka revise & resubmit)/minor corrections/acceptance for paper) for these type of evaluations are generally well-known in each relevant context, and can be applied with varying degrees of opacity or the reviewers’ and the reviewees’ identities.

There are perhaps less well-known evaluations of colleagues’ work for promotion purposes, as well as the evaluation of funding proposals or the assessment of academic outputs for other (funding-related) purposes, such as the REF2021 in the case of English universities.

The REF2021 provides the framework in which I can structure my thoughts more easily.

Internal REF2021 evaluations

REF2021 is an exercise whereby academic outputs (among other things) are rated on a five point scale—from unclassified (= 0) to a maximum of 4*. The rating is supposed to be based on a holistic assessment of the three notoriously (let’s say, porous) concepts of ‘originality, significance and rigour'—although there are lengthy explanations on their intended interpretation.

The difficult evaluation dynamic that the REF2021 has generated is a guessing game whereby universities try to identify which of the works produced by their academics (during the eligible period) are most likely to be ranked at 4* by the REF panel, as that is where the money is (and perhaps more importantly, the ‘marker of prestige’ that is supposed to follow the evaluation, which in turn feeds into university rankings… etc).

You would think that the relevant issue when asked to assess a colleague’s work (whether anonymously or not, let’s leave that aside) for ‘REF-purposes’ would be for you to express your academic criterion in the same way as the experts in the panel will. That is, giving it a mark of 0 to 4*. That gives you five evaluation steps and you need to place the work in one of them. This is very difficult and there is a mix of conflicting loyalties, relative expertise gaps, etc that will condition that decision. That is why the evaluation is carried out by (at least) two independent evaluators, with possible intervention of a third (or more) in case of significant discrepancies.

Having to choose a specific rating between 0 and 4* forces the evaluator to internalise any uncertainties in its decision. This is a notoriously invidious exercise and the role of internal REF evaluator is unenviable.

It also creates a difficulty for decision-makers tasked with establishing the overall REF submission—in the best case scenario, thus having to chose the ‘best 4*’ of a pool of academic outputs internally assessed at 4* that exceeds the maximum allowed submissions. Decision-makers have nothing but the rating (4*) on which to choose. So it is tempting to introduce additional mechanisms to gather more information from the internal assessors in order to perform comparisons.

Change in the evaluation framing

Some of the information the decision-makers would want to gather concerns ‘how strong’ is the rating given by the evaluator with some more granularity. A temptation is to transform the 5-point scale (0 to 4) into a 9 (or even 10) point scale by halving each step (0, 0.5*, 1* etc up to 4* — or even 4.5* or 4*+)—and there are, of course, possibilities to create more steps. Another temptation is to disaggregate the rating and ask for separate marks for each of the criteria (originality, significance and rigour), with or without an overall rating.

Along the same lines, the decision-makers may also want to know how confident the evaluator is of its rating. This can be captured through narrative comments, or asking the evaluator to indicate its confidence in any scale (from low to high confidence, with as many intermediate steps as you could imagine). While all of this may create more information about the evaluation process—as well as fuel the indecision or overconfidence of the evaluator, as the case may be—I would argue that it does not result in a better rating for the purposes of the REF2021.

A more complex framing of the decision allows the evaluator to externalise the uncertainty in its decision, in particular by allowing it to avoid hard choices by using ‘boundary steps’ in the evaluation scale, as well as disclosing its level of confidence on the rating. When a 4* that had ‘only just made it’ in the mind of the evaluator morphs into a 3.5* with a moderate to high level of confidence and a qualitative indication that the evaluation could be higher, the uncertainty squarely falls with the decision-maker and not the evaluator.

As well as for other important governance reasons that need not worry us now, this is problematic in the specific REF2021 setting because of the need to reconcile more complex internal evaluations with the narrower and more rigid criteria to be applied by the external evaluators. Decision-makers faced with the task of identifying the specific academic outputs to be submitted need to deal with the uncertainty externalised by the evaluators, which creates an additional layer of uncertainty, in particular as not all evaluators will provide homogenous (additional) information (think eg of the self-assessment of the degree of confidence).

I think this also offers broader insights into the different ways in which the framing of the evaluation affects it.

Tell me the purpose and I’ll tell you the frame

I think that one of the insights that can be extracted is that the framing of the evaluation changes the process and that different frames should be applied depending on the main purpose of the exercise—beyond reaching the best possible evaluation (as that depends on issues of expertise that do not necessarily change due to framing).

Where the purpose of the exercise is to extract the maximum information from the evaluator in a standardised manner, an evaluation frame that forces commitment amongst a limited range of possible outcomes seems preferable due to the internalisation of the uncertainty in the agent that can best assess it (ie the evaluator).

Conversely, where the purpose of the exercise is to monitor the way the evaluator carries out its assessment, then a frame that generates additional information can enhance oversight, but generates fuzziness in the grades. It can also create a different set of incentives for the evaluator, depending on additional circumstances, such as identification of the evaluator and or the author of the work being evaluated, whether this is a repeated game (thus triggering reputational issues) etc.

Therefore, where the change of frame alters the dynamics and the outputs of the evaluation process, there is a risk that an evaluation system initially designed to extract expert judgment ends up being perceived as a mechanism to judge the expert. The outcomes cannot be expected to simply improve, despite the system becoming (apparently) more decision-maker friendly.