International Review of Research in Open and Distributed Learning Automatic Evaluation for E-Learning Using Latent Semantic Analysis: A Use Case

Assessment in education allows for obtaining, organizing and presenting information about how much and how well the student is learning. The current paper aims at analysing and discussing some of the most state-of-the-art assessment systems in education. Later, this work presents a


Résumé de l'article
Assessment in education allows for obtaining, organizing and presenting information about how much and how well the student is learning.The current paper aims at analysing and discussing some of the most state-of-the-art assessment systems in education.Later, this work presents a specific use case developed for the Universitat Oberta de Catalunya, which is an on-line University.An automatic evaluation tool is proposed so that it allows the student to evaluate himself anytime and receive instant feedback.This tool is a web-based platform and it has been designed for engineering subjects (i.e. with math symbols and formulas) in Catalan and Spanish.Particularly, the technique used for automatic assessment is Latent Semantic Analysis.Although the experimental framework from the use case is quite challenging, results are promising.

Introduction
Assessment in education is the process of obtaining, organizing, and presenting information about what and how the student is learning.Assessment uses several techniques during the teaching-learning process, and it is especially useful when evaluating open-answer questions since they allow teachers to better understand the assimilation of the student in the subject.In some cases, for instance, students with high punctuation in closed-answer tests report subjacent conceptual errors when being interviewed by a teacher (Tyner, 1999).
During the last years, the use of a computer for assessment purposes has substantially increased.The aims of using computer assessment include achieving and consolidating the advantages of a system with the following characteristics (Brown et al., 1999): first, to reduce the professors' workload by automating part of the student evaluation task; second, to provide the students with detailed information on their learning period in a more efficient way than traditional evaluation; and, finally, to integrate the assessment culture into the students' daily work in an e-learning environment.In fact, nowadays one of the most crucial things in assessment is feedback, so assessment of learning is generally intended to measure learning outcomes and report those outcomes to students (and not only to the system or teacher).
The current paper aims at analysing some of the most state-of-the-art assessment systems in education and presents a specific use case developed for the Universitat Oberta de Catalunya.Some examples of existing e-learning platforms are given.Next the use of latent semantic analysis as a semantic analyser algorithm of related documents is briefly described and explained in the context of assessment tasks.Then the authors present the above-mentioned use case, which takes advantage of latent semantic analysis in order to obtain the evaluation results.Finally, conclusions are shown.

E-learning Assessment Platforms
Some papers in the literature are oriented to automated essay-scoring research.The most relevant ones can be found in Miller (2003), Shermis and Burstein (2003), Hidekatsu et al. (2007), andHussein (2008).However, studies covering automatic essay scoring in engineering subjects are limited (to the best of our knowledge), though not inexistent.In Quah et al. (2009), for instance, the authors use a Support Vector Machine to build a prototype system, which is able to evaluate equations and short answers.The system extracts textual and mathematical data from input files in the form of distinct words for text and for mathematical equations using equation trees based on MathTree format.Then the system learns how to evaluate them, based on grades given at the beginning, learning the evaluation scheme and evaluating the subsequent scripts automatically.

Latent Semantic Analysis in E-Learning
The task of evaluating a document in our education context implies judging the semantic content of such a document.To this end, latent semantic analysis (LSA), also known as latent semantic indexing, a technique that analyses a semantic relationship between a set of documents and the terms they contain (Hofmann, 1999), has been successfully applied in multiple natural language processing areas such as crosslanguage information retrieval (Dumais et al. 1996), cross-language sentence matching (Banchs & Costa-jussà, 2010), and statistical machine translation (Banchs & Costajussà, 2011).
The aim of LSA is to analyse documents in order to find their underlying meaning or concepts.The technique arises from the problem of how to compare words to find relevant documents since what we actually want to do is compare concepts and meanings that are behind the words, instead of the words themselves.In LSA, both words and documents are mapped into a concept space.It is in this space where the comparison is performed.This space is created by means of the well-known singular value decomposition (SVD) technique, which is a factorization of a real or a complex matrix (Greenacre, 2011).
In the specific area of essay assessment, LSA has shown promising results in content analysis of essays (Landauer et al., 1997), where LSA-based measures were closely related to human judgments in predicting how much the student will learn from the text (Wolfe et al., 2000;Rehder, et al., 2000) and in grading essay answers (Kakkakonen et al., 2005).Other educational applications are intelligent tutoring systems which provide help for students (Wiemer-Hastings et al., 1999, Foltz et al., 1999b) and assessment of summaries (Steinhart, 2000).In this context, LSA has been applied to a variety of languages such as essays written in English (Wiemer-Hastings & Graesser, 2000), in French (Lemaire and Dessus, 2001), and in Finnish (Kakkakonen et al, 2005) since LSA is language independent.All these studies show that, although it does not take into account word ordering, LSA is capable of capturing significant portions of the meaning not only of individual words but also of whole passages such as sentences, paragraphs, and short essays.That is why we have chosen LSA in order to compare the semantic similarity of documents in the concept space (Pérez et al., 2006).
Particularly, in this work and differently from the previous literature, we investigate if LSA can be applied for e-assessment of mathematical essays.Additionally, experiments are performed both in Catalan and Spanish.LSA is integrated as follows.The documents containing the responses of the students are compared with one or more reference documents containing the correct answers created by the teachers.Then such semantic comparison of the students' and reference documents will allow teachers to generate an approximate evaluation of the students.For the document comparison and/or document retrieval, documents are typically transformed into a suitable representation, usually a vector-space model (Salton, 1989).A document is represented as a vector, in which each dimension corresponds to a separate term.If a term occurs in the document, its value in the vector is non-zero.Several ways of computing these values, also known as (term) weights, have been developed.One of the best known schemes is tf-idf (term frequency inverse document frequency) weighting.The tf-idf weight defines statistically how important a word is to a document in a collection.Such a representation is known to be noisy and sparse.That is why in order to obtain more efficient vector-space representations, space reduction techniques are applied (Deerwester et al., 1990;Hofmann, 1999: Sebastiani, 2002), so that the new reduced space is supposed to capture semantic relations among the documents in the collection.As a final step, a cosine distance similarity measure among each exam and its solution in the reduced space is calculated, obtaining a score that shows how a particular set of exams is similar in semantics with their corresponding solution.

The UOC's Use Case
This section addresses the creation of a free-text assessment tool through the Internet, allowing the automatic student assessment of the Universitat Oberta de Catalunya (Open University of Catalonia, UOC).The main characteristics of the university assessment system and the developed tool are described in the following subsections.
The Universitat Oberta de Catalunya The UOC is an online university based in Barcelona with more than 54,000 students.
Over 2,000 tutors and faculty work together, and administrative staff of around 500 provide services to all these students.The students follow a continuous assessment system, which is carried out online throughout the semester.Although this system is successfully used to complete their studies, one of the main problems is the growing number of students each year, which makes the task of marking their continuous assessment tedious and time-consuming.Likewise, more external tutors are needed to carry out this task, which makes it difficult to come to agreement on criteria.The Assessment Tool The tool developed at the UOC aims to provide an automatic assessment of assignments in the engineering subjects by using the latent semantic analysis technique, following the work carried out by Miller (2003), where the application of LSA to automated essay scoring is examined and compared to earlier statistical methods for assessing essay quality.The implementation of LSA is done using JAVA.
The web-based free-text assessment tool allows the professors to design as many evaluation tests as they want, with as many questions as they consider necessary for the evaluation.On the one hand, for each question, the professor associates several correctanswer models in order to generate enough reference answers to guarantee that the automatic evaluation system works correctly.On the other hand, the web-based platform allows students to realise as many evaluation tests as they want, generating, after each test realization, a report including the evaluation results of every individual question as well as the overall results.Moreover, the tool provides the students with the possibility of comparing the reference answers generated by the professor with their own answers in order to give detailed feedback and improve their learning process.The platform also includes a text editor that allows inserting formulas both in the statements and in the answers with the JavaScript plug-in MathML (Su et al., 2006).

Evaluation Experiments
In this section we describe the experimental framework in our case study.We include subsections that particularly describe the working framework, the web interface, the assessment experiments, and the results obtained.

Working framework.
The main objective of the tool is to help teachers in their evaluation tasks on a large number of students.These first experiments involve a controlled and relatively small number of students in order to establish the groundwork for further and more extensive experiments.The application framework covers the students in two consecutive semesters (with 54 and 70 registered students, respectively) of a single UOC's subject called Circuit Theory, a core subject belonging to the first year of UOC's Telecommunications Engineering Grade.
Apart from the single final evaluation that takes place at the end of the semester, the subject's assessment model contains four different single continuous assessment assignments (CAAs) distributed over the course of the semester and a single practical work that includes computer simulation exercises, structured as follows.The first three CAAs are made up of two different sections: a short question section and an exercises section.The fourth and last CAA contains only an exercises section.More specifically, the short question sections consist of a set of 5-6 questions about very concrete issues.
Each of these questions is provided with four possible answers, where only one of them Vol 14 | No 1 March/13 245 is correct, in such a way that the students have to specify the correct answer and give reasons for their choices.Due to the technical nature of the subject matter, mathematical equations usually appear in the wording of both questions and answers as well as in the students' corresponding justifications.
Within this context, the short questions section of the first three CAAs have been chosen as a specific application framework to perform the automatic evaluation experiments, due to the suitability of the structure and length of both the question and answers as well as to the nature (short text plus a few mathematical equations) of the justifications the students have to provide.

Web interface.
The automatic test assessment system is presented as a web platform, where access can be realized from two different profiles: the teacher and the student.The main task of the teacher is to provide questions and correct reference answers.Thus, a teacher can realise two different actions for each subject: to create a new test and to modify an existing one.In order to create a new test, the teacher must first define the following attributes: the name of the test, the subject in which it belongs, the position within the test set of the subject, and a brief description (Figure 2a).Once these attributes have been inserted, the teacher can register the empty test in the database.Then teachers can insert as many questions as they wish in the test.For each new question, the following attributes need to be completed: (a) statement, (b) maximum possible mark (c) minimum mark to pass the question, (d) question difficulty, and (e) language of the statement (Figure 2b).Moreover, a set of reference answers is associated with each question.Additionally, the teacher can consult the obtained results as well as the answers given by the students.Once authenticated, the students can perform the following actions: (1) evaluating themselves by realising a test, (2) checking the history of the realised tests, and (3) consulting the obtained marks as well as the maximum and minimum marks defined by the teacher.
In order to evaluate themselves, students are shown a list of alphabetically ordered subjects in which they can realise the evaluation by choosing a subject and selecting the test they wish to start with and the difficulty level.The statement of each question is presented to the students together with their corresponding mark.The students must answer within a text editor, where they can insert formulas thanks to a JavaScript plugin called MathEdit (Su et al., 2006), as seen in Figure 3a.Once the answer has been written and the test is finished, the system provides a score to the student together with the obtained marks in each of the questions (see Figure 3b).Likewise, the students can check, for each question, the answers they wrote as well as the reference questions written by the teacher.Apart from the realisation of the tests, the students have the possibility of logging into the platform in order to evaluate their progress.Thus, every student has access to a history in which they can see a list of completed tests.Once a completed test is chosen, the questions can be seen in detail, including the answer given by the student, the obtained mark, the maximum and minimum marks defined by the teacher, and the reference answers used by the automatic evaluation system in order to make the corrections.

Assessment experiments.
This section describes the automatic evaluation performed over the continuous assessment assignments of the students.The experiments carried out used the CAAs Then, for each student answer the procedure is as follows.
1. Vectorise the answer in terms of tf-idf, use the vocabulary of the set of solutions.
We've got a vector of dimension M.
2. Project the vector into the reduced space.
3. Compute the similarity of this reduced space vector with each solution.We keep the maximum distance.
The material used in the analysis presented three main problems.
1. Format files.The students' CAAs are delivered in many different formats, although they are mainly in PDF, Word, and Open Office Writer.Some of them are even scanned documents pasted as image files in Word or Writer documents.Therefore, not all the CAAs can be easily transformed into TXT format to be treated properly.Consequently PDF documents and all those documents containing image files were removed from the original set of files.
Table 1 shows, for each semester, the number of registered students, the number of original documents, and the number of used documents after removing PDF documents and documents with pasted images.The table also shows the vocabulary for each CAA.As can be seen, the vocabulary size is not correlated with the number of CAAs, so the vocabulary content of the CAAs varies largely among each set.
2. Mathematical formulation.Given that we are using a bag-of-words approach, the formulation extracted from Open Office documents was coded in MathML (Mathematical Markup Language), while the formulation extracted from Word documents was not, which made a big difference between CAAs regarding the final vocabulary.
Vol 14 | No 1 March/13 248 3. Language.The students submitted the CAAs in both the Catalan and Spanish languages.In this case, we assumed that the method presented in the current paper is able to take advantage of the vocabulary that is language independent, such as the mathematical variables.

Results.
In order to carry out the preliminary assessment experiments, CAA1 and CAA2 from semester S1 were used as development material, which allowed concluding that the best rank reduction in latent semantic analysis was five.
The results are shown in terms of the correlation obtained between automatic and human evaluations.We define human evaluation as the assessment made by the teacher in a traditional way, while automatic evaluation is defined as a computer-based assessment given by the methodology proposed in the current work (i.e., the quantifications obtained automatically using latent semantic analysis and the cosine distance).
Thus, by using the latent semantic analysis, automatic evaluations were obtained for each student, CAA, and semester.Then the correlations between automatic and human evaluations were computed for each semester and CAA collection.The correlation results obtained are reported in Table 2 (correlation column), together with the statistical significance of the correlation results (p column).
As can be seen from the table, in statistically significant results (i.e., where p < 0.05), the correlation varies from 52% to 69% (see CAA1 and CAA2 from semester S2).
Although these results are lower than those presented in Miller (2003), they are promising given that we are dealing with a complete textual subject, but with a subject On the one hand, we must take into account that the reference answers were written in Catalan by the teachers, while the students could choose whether to answer the tests in Catalan or Spanish, so the language of the tests was not the same in all the students' CAAs.On the other hand, unlike the students' CAAs, all the reference solutions were available in Writer format.Since only the mathematical formulas of the Writer documents were transformed into MathML, there was also disparity in the formulas in each CAA collection.
In order to see how these disparities could have affected the results, we computed the percentage of CAAs in each set that satisfied the following two requirements at the same time (i.e., the same two characteristics satisfied by the reference solutions).
1.The formulas were coded in MathML.
2. The students answered in the Catalan language.
The percentage of CAAs satisfying both characteristics are shown in Table 2 in the third column of every CAA result.It can be seen that the two statistically significant results with a correlation over 50% (i.e., CAA1 and CAA2 from semester S2) correspond to those results in which the codification and the language used is the same as the reference solutions in more than 25% of the cases.Therefore, it could be stated from the results that the correlation between human and automatic evaluations depends on the coherence of both the mathematical codification and the language used in the tests.beginning.However, we are aware that this is not the best method to collect the data, and both of them (PDF and image files) will be dealt with in future research.
Nevertheless, despite the difficulties in the material used, the preliminary experiments have shown some interesting results.After computing the correlation between the automatic and the human assessment tests it was shown that only two from the six evaluation tests provided correlation greater than 50% with statistically significant results.These two sets correspond to those set of PACs that have more similarity with the reference solution PACs: The mathematical formulas are coded in MathML, and the students answers were mostly written in the same language.
In automatic essay assessment we would expect a higher correlation.However, we are dealing with a challenging issue since it does include mathematical symbols and formulas, which makes the current analysis more difficult.Therefore, although for the time being the correlation results are not satisfactory, they have set a starting point that allows us to work with this kind of material in engineering subjects.Thus, future work will focus on improving the format of the materials to give coherence to them (i.e., by using the same formulation and dealing with the language issue).Additionally, we plan to experiment with non-linear space reduction such as multidimensional scalability in order to find further semantic similarities.

Figure 1 Figure 1 .
Figure1shows a schematic representation of the use of latent semantic analysis for automatic essay scoring.

Figure 2 .
Figure 2. Creation page of a new test (a) and creation form of a new question (b).

Figure 3 .
Figure 3. Question and text editor with MathEdit (a) and mark of the test once it is finished (b).
. Each semester included a set of three different CAAs (CAA1, CAA2, and CAA3).The data were tokenized, lowercased.The 20 most frequent words were discarded.As follows, we describe the procedure for treating the set of solutions with LSA:1.Compute N solutions in terms of tf-idf: Many portals can be currently found online.To overview some examples, for instance, the Online Learning and Collaboration Services (OLCS, http://www.olcs.lt.vt.edu) from

Table 2
CorrelationResults (corr.) and Statistical Significance (p)between Automatic and Human Evaluation, and Percentage of CAAs Satisfying the Same Characteristics as the Reference Solutions (same charact.)