International Review of Research in Open and Distributed Learning A Learning Analytics Approach Using Social Network Analysis and Binary Classifiers on Virtual Resource Interactions for Learner Performance Prediction

The COVID-19 pandemic induced a digital transformation of education and inspired both instructors and learners to adopt and leverage technology for learning. This led to online learning becoming an important component of the new normal, with home-based virtual learning an essential aspect for learners on various levels. This, in turn, has caused learners of varying levels to interact more frequently with virtual resources to supplement their learning. Even though virtual learning environments provide basic resources to help monitor the learners’ online behaviour, there is room for more insights to be derived concerning individual learner performance. In this study, we propose a framework for visualising learners’ online behaviour and use the data obtained to predict whether the learners would clear a course. We explored a variety of binary classifiers from which we achieved an overall accuracy of 80%–85%, thereby indicating the effectiveness of our approach and that learners’ online behaviour had a significant effect on their academic performance. Further analysis showed that common patterns of behaviour among learners and/or anomalies in online behaviour could cause incorrect interpretations of a learner’s performance, which gave us a better understanding of how our approach could be modified in the future.

Virtual learning environments (VLE) have replaced physical classrooms in various institutions and have been widely adopted by instructors and learners of various levels worldwide due to the COVID-19 pandemic.
The resources provisioned by course providers and/or instructors in these environments are used to supplement learners' learning and/or assess their understanding of the course material they have been taught up to a certain point in time.Virtual resources contribute to a learner's academic performance in educational institutions worldwide.Therefore, learners' behaviour within an online environment, for example, how they interact with such virtual resources, could help us study whether any correlation exists between their online activity and their performance in a course.Learning behaviour can be examined with the application of big data techniques to promote learner success (Khor and Looi, 2019).Hence, the purpose of this study was to model the behaviours of learners in a VLE and explore whether we could leverage big data techniques and use learners' interactions with virtual learning resources to predict whether a learner is successful in clearing a course, thereby helping us to understand how their behaviour in a virtual environment affects their academic achievement.
Social network analysis (SNA) is a means to examine social structures with the aid of networks and graph theory (Grunspan et al., 2014;Otte & Rousseau, 2002).The structures, commonly referred to as sociograms, are analysed with nodes as points, representing people or other entities of interest in the network, and ties or edges as lines, which are usually relationships or interactions between the entities.This method of visualising social structures allows us to quantitatively and qualitatively analyse them to derive valuable insights.SNA has been applied in education research to examine how learners form relationships through learning and how these relationships affect their learning outcomes.For example, SNA has been applied to understand the structure of the study networks formed between learners in an undergraduate class and how it could ultimately influence individual learners' academic performance (Grunspan et al., 2014).On the other hand, Rabbany et al. (2012) have applied SNA to understand learner interactions where social mining and other SNA techniques were exploited to discover structures in the network graphs generated from the manner and content of interactions between the learners.The study provides instructors with insight to better visualise the interactions among the learners, have a better understanding of the main influencers, and give them a better understanding of learner participation, especially with courses that rely on virtual resources.Chung and Paredes (2015) have developed a social network model for online learning and performance.The authors analysed social learning in an e-learning environment and used SNA to demonstrate how the properties of a learner's social network, along with the learner's contribution towards the learning of others and the content of their contribution, impact the learning and performance of others.Meanwhile, Dragulescu et al. (2015) make use of various tools to query the social interactions from various data sources and run SNA on the queries to show it can be used to model the interactions using data from a pool of resources.

125
A study conducted by Saqr et al. (2018) shows how online interaction data can be collated and processed in order to use SNA to study how collaboration between peers in a course influences performance.Furthermore, Rakic et al. (2018) explore how SNA can be used to study and analyse the use of virtual resources on an e-learning platform to find vital indicators of learner performance in different courses.Rakic et al. (2019Rakic et al. ( , 2020) ) further developed this study by using SNA with other machine learning methods.In all these studies, the use of resources on the e-learning platform was found to contribute significantly to a learner's performance.
Learning analytics, along with machine learning, has been applied to an assortment of institutions' VLE data to help predict learner performance and understand factors affecting it.Koza et al. (1996) and Mitchell (1997) describe machine learning as the construction and usage of algorithms that leverage data to make predictions or decisions and improve performance in the automation of specific tasks.
The most basic approaches in machine learning make use of either supervised or unsupervised learning.
For predictive tasks or data classification, supervised learning makes use of labelled data to train algorithms to learn how to predict outcomes or classify the data, while unsupervised learning finds patterns within unlabelled data to learn how to cluster or split the data into different groups.Ensemble techniques in either supervised or unsupervised learning use various learning algorithms to improve predictive or classification performance compared with that obtained with just a single supervised or unsupervised learning algorithm.
The machine learning approach to be used in a task is based on the objective and the data that has been made available.Wolff et al. (2013) have leveraged learning analytics methods to develop models to predict at-risk learners based on their behaviour within VLEs and their demographic data.Al-Azawei and Al-Masoudy (2020) have also developed a predictive model that used behavioural data from VLEs along with assessment scores and demographic data to predict academic performance.Clickstream data from the VLE can also be used to predict at-risk learners with the application of deep learning techniques (Waheed et al. 2020).

Research Methods
The main research processes that were undertaken in this study are summarised in Figure 1.Data exploration and visualisation were conducted to obtain information about the learners, the courses they enrolled in, and the virtual learning resources they accessed for each course they were enrolled in.Social network graphs were constructed depicting the online behaviour of each learner before computing the centrality values chosen for each node in the graph.Data preprocessing was then conducted to assign the binary labels graduated and did not graduate to each learner based on their final grade (Table 1).Finally, we trained and tested binary classifiers with the data we had prepared using supervised and ensemble learning algorithms to predict whether a learner was able to successfully clear a course.The machine learning technique was used not only to compare the performance across all classifiers but also to gain more insights via the analysis of each classifier's performance in the prediction task to find out the common behavioural aspects of learners that adversely affected prediction performance across all classifiers.

Description of Data Set
The publicly available anonymised OULAD was used in this study; its structure is illustrated in Figure 2. It primarily contains information about the learners from seven different courses, their activity within the VLE, and the assessments they completed for each course in 2013 and 2014.The data also contain their achieved results for the course.There are two semesters or presentations for each year, which commenced in February and October, and are labelled B and J, respectively.Some courses offered in B may not be offered in J, and vice versa.Details about the data contained in each of the seven records found in the data set are displayed in Figure 3 Besides identification code, domain, and course length, no further information regarding the structure or contents of the course was provided in the data set.Furthermore, as our study is based on how a learner behaved within a VLE, our analysis of the data set showed that the only behavioural features recorded were the virtual resources a learner accessed and the number of clicks made within that resource on a given date.
As the latter was not indicative of how the learner was interacting with the resource, and no data were given about the content of the resource, we did not include the number of clicks as a behavioural feature to avoid interpreting it incorrectly.
Each course in each presentation will have a separate VLE, in which a variety of virtual resources will be made available to supplement learners' learning and also to assess their understanding of the course

Data Preprocessing and Visualisation
Data preprocessing was carried out to ensure the data set had binary labels to learn from and predict before we trained a binary classifier.This was conducted by labelling learners who achieved a distinction or pass the result as graduated and those who got a final result of fail or withdrawn as did not graduate.The virtual resource node's centrality values were then computed and used to predict learners who did and did not graduate.

Figure 6
Graduating Versus Nongraduating Learners for Each Course in Each Presentation

Data Analysis and Results
In this study, SNA was used to analyse learners' online behaviour via their interactions with virtual resources in the course they were enrolled in.Due to course CCC's influence on the nongraduating population in both semesters in 2014, and because course GGG had more learners who graduated in these semesters, we chose to focus on predicting the performance of learners enrolled in these courses (CCC and GGG) in 2014 observe whether they would differ in terms of prediction performance.
Each learner has an undirected social network graph that depicts the unique resources a learner accessed for the entirety of a course.A black node is used for the learner while resource identification numbers are used to indicate which virtual resources the learner accessed.The edges in the network represent the learner accessing the virtual resource at least once in the course and are weighted based on the number of times the learner accessed that particular virtual resource for the duration of the course.
The social networks were constructed in this manner as we opted to visualise the learners and virtual resources as entities of the same type, among which there is an exchange of information from both sides.Nevertheless, we could not draw any edges between any two resources in the social networks we had constructed, and we also could not construct a social network dedicated to the resources as no information was provided on whether the resources interacted with each other.Furthermore, we could not construct social networks to visualise and observe these as no records were provided in the data set about interactions between learners or information related to forum discussions.
A summary of the learner population, the number of virtual resources in the course's VLE for the semester, and data related to the social networks for each course we focused on are presented in Table 3.After we constructed the social networks for each learner in courses CCC and GGG, of which some examples are shown in Figure 7, we computed the degree centralities for each network.Golbeck (2013) defines degree centrality as the number of edges a node has or the number of nodes a node is linked to.Degree centrality was employed based on the aim of our study, which was to analyse how a learner's interaction with virtual resources in a VLE affects their performance.

Samples of Constructed Social Networks Depicting Resources a Learner Interacted With
Based on the computed degree centralities of each network, we observed that all learners had a constant centrality value as they only have edges linking their nodes to the course nodes.Depending on the number of nodes in the social network graph, all virtual resources a learner access will approximately have the same centrality value, as there is only one edge linking each virtual resource node with a learner and no edges between any virtual resource nodes.The overall distribution of the virtual resource degree centrality values for the entire course cohort is shown in Figure 8, and the distributions of the virtual resource degree centrality values for the learners who graduated and did not graduate for courses CCC and GGG are displayed in Figures 9 and 10, respectively.

Distribution of Degree Centrality for Graduating and Nongraduating Cohorts in Course GGG in 2014
The distributions in Figures 9 and 10 indicate that virtual resource access does not seem to be a key contributor to the differences between the graduating proportions of students in courses CCC and GGG.Furthermore, Figure 8 shows that the distribution of the virtual resource degree centrality values for both courses in both semesters are similar; with the mean degree centrality values ranging between 0.058 and 0.065.
By comparing the range of values in each distribution shown in Figures 9 and 10, we found that the virtual resource degree centrality values for learners who graduated had a narrower range compared with those of learners who did not graduate, regardless of the course they were enrolled in.This could be due to differences in the total virtual resources accessed by the learners in either group for the duration of the courses, which can be observed in the differences between the distributions of the total number of virtual resources accessed (Figures 11,12).To further analyse which label predictions had impacted the accuracy of the classifiers, we analysed the precision, recall, and F1-score of each classifier for each label in courses CCC and GGG, which are displayed in Tables 5 and 6, respectively.These performance metrics further support the fact that the degree centrality features perform well in predicting a learner's performance.Nevertheless, most of the classifiers reflected poor recall for learners who graduated from course CCC (63%-75%) and learners who did not graduate from course GGG (60%-71%).To identify what contributed to poor recall for the groups in courses CCC and GGG, we compared the distribution of the virtual resource degree centrality values of the wrongly classified learners in each group and also the distributions of the total number of virtual resources these learners accessed .

Using Degree Centrality Values to Predict Learner Performance
By comparing these distributions together with those in Figures 9-12, we found that most of the learners 139 who were misclassified had virtual resource degree centrality values that were clustered around a particular range for both the graduating and nongraduating cohorts.Furthermore, the findings revealed that some learners who were wrongly classified had virtual resource degree centrality values that rarely occurred.All clusters and rare values of centrality values we observed are shown in Table 7.

Discussion
The chosen binary classifiers on the virtual resource degree centrality values are trained and tested, and we obtained an accuracy of 80%-85% on the test set of both courses across most of the binary classifiers.The overall level of performance we observed appears to be promising and encouraging.
Our analysis of the classifiers' performance shows that the accuracy for the test set of CCC_2014 was poorest for most of the ensemble learning methods compared with the other course cohorts, and when the metrics for each classifier were compared and analysed, most of the binary classifiers performed rather poorly in recall for learners who graduated from course CCC and learners who did not graduate from course GGG.
Our initial assessment was that the classifiers would have been unable to classify some learners in the graduating and nongraduating cohorts because of inconsistencies in their virtual resource degree centralities.
We derived insight from analysing the distribution of the virtual resource degree centralities together with the distribution of the total resources accessed by learners who were wrongly classified.The binary classifiers we employed in this study were not able to separate clusters due to common behaviours among the graduating and nongraduating cohorts into the different groups of learners they consisted of during the training of each classifier, and there were very few instances of the rare centrality values to be trained upon.Therefore, the binary classifiers, regardless of whether they were based on supervised or ensemble learning, faced difficulties in classifying learners who had such virtual resource degree centrality values.
A limitation of our study is that the machine learning classifiers were unable to correctly predict whether some learners would or would not graduate due to anomalies in their online behaviour.If more data were available related to how a learner had interacted with a virtual resource and/or the amount of time spent with/on it, as well as more data indicative of the actual content in a resource, we would have been able to incorporate them into the construction of the social networks for a more accurate depiction of a learner's

Conclusion
In this study, we attempted to predict learners' academic performance based on their interactions with the resources provided in a course's virtual learning environment.SNA was performed to obtain the degree centrality of the virtual resources that learners interacted with.Using the virtual resource degree centrality value for each learner, supervised learning and ensemble learning binary classifiers were leveraged to predict whether a learner would graduate from a course.
The overall accuracy we obtained with the chosen binary classifiers on degree centrality values is promising.
The performance metrics for all classifiers revealed that the virtual resource degree centrality value is a viable feature to predict learner performance, which further implies that learners' interactions with virtual resources have a significant effect on their performance.This was true for both the courses we focused on in this study despite the differences in the proportions of graduating and nongraduating cohorts.
Instructors and course facilitators may make use of our framework to monitor a learner's learning based on their interactions with virtual resources at any point in the course, especially the SNA component, to help them visualise and understand a learner's online behaviour.The social network depicting the online behaviour of each learner is straightforward to understand, and instructors will be able to detect learners who are falling behind based on the size of their social network compared with other learners.This, coupled with data related to the learner's session activity with each virtual resource they interacted with, would greatly assist in promptly providing early intervention and support to learners who are performing below average.With more comprehensive data about resources and how learners interacted with them, the perspective of social networks may be shifted to better understand how learners are interacting with each resource and whether a learner is facing any difficulty with a resource based on their behaviour with it (e.g., less or more time spent on a resource compared with other learners, unusual interaction with a resource).
In addition, by leveraging machine learning to predict learner performance, instructors and course facilitators will be able to analyse and gain in-depth insights about common learner behavioural traits or anomalous behaviour in the past and how these affected learners' performance.At-risk learners could be identified early, and in-time intervention provided.Eventually, the success of a course could be improved, and subsequently, dropout risk could be reduced.
Our study primarily focused on implementing a framework for visualising the online behaviour of learners in courses that heavily rely on virtual environments to disseminate knowledge and assess the understanding of each learner.We also demonstrated that data derived from analysing learners' online behaviour can be used to predict whether learners can successfully complete a course.With the performance scores we obtained with our framework, along with the insights we gained into the gaps in our predictive models, we have gained a good idea of what works and what can be done to improve the predictive models.

143
In the future, we seek to improve the prediction models' performance by using a range of final result values instead of categorical labels to overcome the clustering in the virtual resource node centrality values among graduating and nongraduating learners.Further work may also include examining which virtual resources contribute most to a learner's performance.Finally, we aim to apply the proposed framework to other education-related data sets with more data on how students interacted with each other to better understand how this interaction affects their learning.
Figure 2 Structure of Data Set

Figure 5 Figure 5
Figure5illustrates the percentage of graduated and nongraduated learners for four different presentations: 2013B, 2013J, 2014B, and 2014J.More learners were able to graduate in 2013B compared with 2014B, and more learners were able to graduate in 2013J compared to 2014J.

Figure 9 A
Figure 8Distribution of Degree Centrality for Entire Learner Cohort inCourses CCC and GGG in 2014 Figure 11 Distribution of the Total Number of Virtual Resources Accessed for Graduating and Nongraduating Cohorts in Course CCC in 2014

Figure 13 Figure 14 Figure 15 Figure 16
Figure 13 Distribution of Virtual Resource Degree Centralities for Wrongly Classified Learners in Course CCC in 2014

A Learning Analytics Approach Using Social Network Analysis and Binary Classifiers on Virtual Resource Interactions for Learner Performance Prediction Khor and Dave 127
Figure 1Main Research ProcessesTable1Assignment of Class Labels to the Final Result

A Learning Analytics Approach Using Social Network Analysis and Binary Classifiers on Virtual Resource Interactions for Learner Performance Prediction Khor and Dave 130
material.These resources can be in the form of Hyper Text Markup Language (HTML) pages, Portable Document Format (PDF) files, or some other form of media.A basic summary of the courses, the learners in each course, and virtual resources in the VLE for each course across all presentations are presented in

Table 2 .
The courses were not offered if they had zero learners and resources in a particular presentation.Number ofLearners and Virtual Resources per Course (in Each Semester or Presentation) Note.STEM = science, technology, engineering, and mathematics.

Table 3
Summary of Data for Social Networks of Each Learner inCourses CCC and GGG in 2014

Table 4
summarises the accuracy obtained with the training and test data sets for each course.The binary classifiers we had trained performed well with the virtual resource degree centrality values as features; the accuracy we obtained with them was primarily at least 80% for both the training and test sets.No disparity A Learning

Analytics Approach Using Social Network Analysis and Binary Classifiers on Virtual Resource Interactions for Learner Performance Prediction Khor and Dave 137 existed
in accuracy between the supervised learning classifiers and the ensemble learning classifiers.However, the accuracy for the CCC_2014J test set was the poorest (around 70%-72%) for most of the ensemble learning methods.

Table 5
Classification Report for Predicting Performance in Course CCC with Virtual Resource Degree Centralities A Learning

A Learning Analytics Approach Using Social Network Analysis and Binary Classifiers on Virtual Resource Interactions for Learner Performance Prediction Khor and Dave 141Table 7
Clusters and Rare Values Observed in Wrongly Classified Data

A Learning Analytics Approach Using Social Network Analysis and Binary Classifiers on Virtual Resource Interactions for Learner Performance Prediction Khor and Dave 142 online
behaviour, and the classifiers could have distinguished and predicted learners who graduated from those who did not.Such data could also provide more insights into factors affecting learners.