Peer Assessment for Massive Open Online Courses (MOOCs)

The teach-learn-assess cycle in education is broken in a typical massive open online course (MOOC). Without formative assessment and feedback, MOOCs amount to information dump or broadcasting shows, not educational experiences. A number of remedies have been attempted to bring formative assessment back into MOOCs, each with its own limits and problems. The most widely applicable approach for all MOOCs to date is to use peer assessment to provide the necessary feedback. However, unmoderated peer assessment results suffer from a lack of credibility. Several methods are available today to improve on the accuracy of peer assessment results. Some combination of these methods may be necessary to make peer assessment results sufficiently accurate to be useful for formative assessment. Such results can also help to facilitate peer learning, online discussion forums, and may possibly augment summative evaluation for credentialing.


Feedback and assessment in open and distance learning are inherently difficult to begin
with (Chaudhary & Dey, 2013;Letseka & Pitsoe, 2013;Suen & Parkes, 1996). The problem of the reduction of individual feedback from, and interactions with, instructors becomes extreme in MOOCs. Due to the scale of MOOCs, feedback to individual students from the instructor has become virtually impossible. Yet, teaching without assessing whether the student has learned and without giving students feedback as to whether they have indeed learned the material correctly amounts to a one-way information dump or broadcasting, not education. A MOOC, in that form, would be essentially not different from the thousands of free how-to Youtube videos on the internet or the various free instructional videos provided by the Khan Academy (http://khanacademy.org) and cannot be considered a complete teaching-learning experience.

Attempts at Remedies
The teach-learn-assess cycle is essentially broken in a MOOC. Various attempts have been or are being made to re-introduce some degree of formative assessment feedback into the process to prevent it from becoming a one-way information dump or Additionally, many developments in ICT have enabled feedback and assessment activities analogous to those of feedback activities in a traditional classroom. However, only a limited subset of these methods and technology are applicable to the scale of MOOCs.
In terms of assessment, some MOOCs offer online multiple-choice quizzes that are machine-scored as progress checks and feedback to students. At the end of each instructional module, a number of multiple-choice questions would be posed to the student. These questions are intended to gauge the student's mastery of the concepts and other contents covered in that module. The scores on these tests would indicate whether the student has adequately learned the material and the scores are given to the student as feedback. Students who do not do well would be encouraged to return to the previous module to review the materials. This approach is basically an online version of the old programmed-learning approach (Bloom, 1971;Skinner, 1968), popular briefly in the 1960s and 70s and quite limited in applicability since it is appropriate only for certain types of course contents where abilities to recall or to differentiate concepts or to interpret or extract information from text or graphics related to the subject matter are the only important instructional objectives. It is also challenging to most instructors to develop good quality multiple choice test items to measure high-level cognition such as analysis, synthesis, and evaluation, in Bloom's taxonomy. It is not appropriate for courses in which the desired evidence of learning is to have students demonstrate an To provide feedback to students in general, in some cases instructors would provide answers to a limited number of most popular questions posted in the MOOC online discussion forum. The popularity of each question is often determined by a system of like/dislike votes similar to that used in Facebook. This, of course, is quite far from providing individual formative feedback and leaves the overwhelming majority of student questions unanswered. For the majority of the students, formative assessment and feedback would still be missing.
One solution that has emerged to address both the problem of the lack of formative feedback and that of a lack of revenue stream for the investment of resources in the development of MOOCs is to place MOOCs within a blended learning or flipped learning structure. This approach would have students view contents of a MOOC on their own and at their own pace. After learning the materials via the MOOC, they would attend local brick-and-mortar classes in which they would do assignments and participate in discussions with local instructors. While the MOOC portion may be free, the face-to-face sessions would be fee-based. Georgia Institute of Technology in the United States, for instance, has initiated a Master of Computer Science degree program for $6,000 to combine MOOCs with a large number of instructional tutors in a blended manner.
Coursera is also attempting to license contents of existing MOOCs to be coupled with local instructional staff for credit-bearing courses at traditional universities.
This blended-or flipped-learning approach appears to be a workable alternative that would solve the central problems of assessment, feedback, and revenue. The flipped or blended learning mode is fundamentally quite similar to many advanced seminars in universities in which students are assigned take-home readings from textbooks or reference materials and are then to provide reports and participate in instructor-led as scalable as MOOCs. These limitations would in turn lead to each assignment being rated by no more than a handful of peers realistically. The resulting assessment score data would then be one of a nested design with missing data in most cells. With a large enrollment for the course but only a handful of different peer raters per assignment, the distribution of rater abilities and knowledge for each assignment and between assignments will necessarily be uneven and imbalanced. Some assignments would be rated by excellent and knowledgeable raters while some would be rated by poor and uninspired raters.
In its most basic form, the process of peer assessment within a MOOC would be as follows: A scoring rubric is developed for an assignment, the latter usually in the form of a project, an artifact, or a written report, within an instructional unit in a MOOC.
Students are instructed to complete the assigned project and submit it online. Each project is distributed to several randomly selected fellow students by asking the fellow students to view the project online. Each fellow student rater is then to rate the quality of the project based on the predetermined scoring rubric. Raters are also asked to provide some written comments. The mean or median rating score is taken as the score for the project. The score as well as the written comments are then made available to the original student who submitted the project. Through this process, each project is rated by no more than a handful of peer raters and each peer rater would rate no more than a handful of projects.

Accuracy of Peer Assessment Results and Remedies
Perhaps the most glaring problem with peer assessment is how trustworthy the results are. After all, within peer assessment, the performance of a novice is being judged by other novices. Is it possible that peer raters misjudge the quality of the submission even with the guidance of the predetermined scoring rubric? Is it possible that peer raters judge a submission highly because the raters and the submitter share the same set of Whereas the CPR TM method considers only the inaccuracy of the peer rater, the credibility index (CI) approach takes into consideration the accuracy of the rater, the consistency of the rater, and the transferability of the level of accuracy between contexts and assignments. The approach attempts to garner the needed additional information without adding much more to the rater's burden beyond what is already gathered in the CPR TM method. Theoretically, this approach should improve the accuracy of peer assessment results and there is preliminary evidence that is supportive of that claim (Xiong et al., 2014). Additional research is currently underway to confirm its efficacy. If proved to be effective, the CI can also be used to rank peer answers and comments in online discussion forums based on the CI value of each responder, and thus is potentially capable of moving the system away from ranking comments based on popularity to one based on knowledge.

Nature of Peer Assessment Errors
If MOOCs are to be a complete educational experience, and not just a free multimedia version of traditional textbooks, the key seems to be whether there is a viable and scalable built-in formative assessment and feedback process. Among the various options available, peer assessment is the most widely applicable method to date. In spite of the many studies showing the efficacy of peer assessment in promoting learning, skepticism remains as to whether peer assessment results can be trusted.
One source of ambiguity in evaluating the accuracy of peer assessment results seems to be the problem of determining what constitutes the true score. Most studies that attempt to evaluate accuracy have used instructor rating as the absolute standard and the quality of peer rating is determined by how far it departs from instructor rating.

However, Piech et al. (2013) offer a different argument:
For our datasets, we believe that the discrepancy between staff grade and student consensus typically results from ambiguities in the rubric and elect to use the mean of the student consensus on a ground truth submission as the true grade.
In the case of Piech et al.'s situation, the ground truth submission was rated by hundreds of peer raters. Given the large number of peer raters, their decision to use the mean of student ratings as the 'true score' may be a manifestation of the trust in crowdsourcing.
While the majority of studies continue to consider proximity to instructor rating as the gold standard of accuracy, Piech et al.'s reasoning does reflect the complexity of the situation. There are at least six types of discrepancies in a peer assessment situation.
These include: A) the discrepancy between the rating given by a peer rater and rating by the instructor on the same piece of work; B) the random situational fluctuations of the ratings to that same piece of work given by that same peer rater under different conditions; C) the inconsistency of ratings given to other similar pieces of work with similar quality but may differ in context or style; D) the random discrepancies between different peer raters on the same piece of work using the same set of criteria or rubric; E) the systematic discrepancy between different raters on the same piece of work due to difference in rater competence or rater leniency/stringency; and F) the random situational fluctuations of the ratings to the same piece of work given by the same instructor under different conditions. The situation is analogous to a moving archer on horseback shooting at a moving target.
Rater training and a carefully constructed rubric can help reduce some of the errors from all sources. However, in addition to rater training and good rubrics, the different approaches to peer assessment discussed earlier can be viewed as attempts to tackle different combinations of these sources of error. The CPR TM is designed to minimize errors A and E in general, but the existence of other sources of error can render this effort ineffective for a given assessment. The Bayesian approach is designed to minimize error D. The CI approach is designed to minimize errors A, B, C, & E, but does require slightly more information from the rater than otherwise collected by other methods. No method has been developed to minimize error F, except for the desirable practice of developing clear rubrics. The cMOOC approach would not consider these to be errors at all, but part of the diversity of views upon which knowledge is to be gained.
It is theoretically possible to combine these approaches into a single most effective composite approach in which raters are calibrated after training via the credibility index approach and the resulting ratings are further refined via a Bayesian or empirical Bayes approach.
Finally, one remaining problem with peer assessment in MOOCs is the probability of an assignment being rated by all poor raters. This problem may be minimized if the peer rater distribution algorithm uses a stratified sampling process based on prior knowledge, or credibility index value, or performance as a peer rater in previous assignments, instead of the current random assignment process.
It should be noted that peer assessment, whether the results are accurate or not, is considered valuable as an instructional tool in its own right. Indeed, Topping (2005) folded peer assessment as part of a larger category of peer learning. However, accurate peer assessment results would further enhance this learning experience, as well as serve the purpose of assessment. Additionally, if can be made reasonably accurate, peer assessment results can be used for purposes beyond formative assessment. One such potential use is to facilitate online discussion forums by putting more weight on opinions of student raters whose judgments of peer performances are close to that of the instructor's. Another potential use is to use student raters' performance-as-raters to  (2002), peer assessment as formative evaluation does not violate the 1974 U.S. law known as the Family Education Rights to Privacy Act (FERPA). The key basis of the judgment seems to be restricted to the idea that peer assessment results for formative purposes do not constitute part of the student's school record. At this time, whether peer assessment results can be used as part of a summative grade, including credentialing and certification, and still not violate FERPA is not clear.