CODAP Interrater Reliability

Last updated Jan 10,1997

Interrater reliability is a measure of the agreement among raters. The most commonly rated factors are task learning difficulty and training emphasis. The degree of "true" agreement among raters is contingent upon the error (systematic and random). These types of error occur, for example, when a rater rates carelessly, inadvertently inverts the rating scale, or when a group of raters generally agrees on the ratings of some tasks, but disagree on others (Goody, 1976). GRPREL was developed to identify and remove divergent raters or tasks (Staley and Weissmuller, 1981). GRPREL can optionally remove divergent raters or items and then recompute agreement. The removal option should not be exercised in cases where divergence represents "true" differences among raters. Normally, a rater is considered to be unacceptably divergent when his correlation with the mean task ratings for all raters is not significantly greater than zero at =.05 (one-tailed test). A task is considered to be divergent if its standard deviation is greater than some specified value.

GRPREL can adjust to accommodate various kinds of scales. For example, the training emphasis scale is considered to be an absolute scale. Hence, for training emphasis, ratings from 0 through 9 are allowed, and mean values (as well as agreement) are computed using 0 as a valid response. The task learning difficulty scale, however, is considered to be a relative scale with a rating range of 1 to 9. In this case, if raters were following directions, one would expect each rater's "mean" to be 5.0, the mid-point of the scale. Raters, however, do not rate all items (an assumption in a mean of 5.0). In addition, some raters are "hard" (i.e., give below average ratings), and other raters are "easy" (i.e., give above average ratings). The rater mean adjustment to the "SAMPLE" mean will compensate for both rater subset effects as well as for rater leniency. The overall rater mean and/or standard deviation can also be changed to a specified value. Similarly, standardized task means and standard deviations can be specified.

The number of iterations performed to remove divergent raters or tasks can be controlled by the user. There is a field that specifies the maximum number of iterations to perform, regardless of any other factors. In general, a minimum reliability coefficient (rkk) of .90 for the rater composite is required for a test (or set of ratings) to be considered a reliable measuring instrument. The user can raise the default values if additional iterations are desired. The rater correlation table will list removal recommendations for raters whose probability or correlation values are too low. The user can change the default minimums for these values to lower the number of raters that would be deleted.

GRPREL also creates task factors which can be used later in the CODAP system. For example, if the ratings are for task learning difficulty, this factor can be printed along with incumbent job description factors to show the relationship of learning difficulty of tasks to performance levels of tasks for any groups of interest. The GRPREL report lists all tasks with their associated means and standard deviations, and prints interrater reliability statistics for each iteration, a summary statistics table, and a rater correlation table with statistics and removal recommendations for each rater.

Back to the CODAP home page