Using Educational Data Mining Techniques to Identify Profiles in Self-Regulated Learning: An Empirical Evaluation

With the increased emphasis on the benefits of self-regulated learning (SRL), it is important to make use of the huge amounts of educational data generated from online learning environments to identify the appropriate educational data mining (EDM) techniques that can help explore and understand online learners’ behavioral patterns. Understanding learner behaviors helps us gain more insights into the right types of interventions that can be offered to online learners who currently receive limited support from instructors as compared to their counterparts in traditional face-to-face classrooms. In view of this, our study first identified an optimal EDM algorithm by empirically evaluating the potential of three clustering algorithms ( expectation-maximization , agglomerative hierarchical , and k-means ) to identify SRL profiles using trace data collected from the Open University of the UK. Results revealed that agglomerative hierarchical was the optimal algorithm, with four clusters. From the four clusters, four SRL profiles were identified: poor self-regulators, intermediate self-regulators, good self-regulators, and exemplary self-regulators. Second, through correlation analysis, our study established that there is a significant relationship between the SRL profiles and students’ final results. Based on our findings, we recommend agglomerative hierarchical as the optimal algorithm to identify SRL profiles in online learning environments. Furthermore, these profiles could provide insights on how to design a learning management system which could promote SRL, based on learner behaviors.


Introduction
The increased adoption of technology to enhance learning along a continuum that ranges from physical classrooms to online learning has opened valuable opportunities for decision makers in institutions of learning.The current coronavirus pandemic has also forced many institutions of higher learning to adopt online teaching and learning resulting in many new datasets being generated.These datasets can be used to understand how to enhance learning pedagogies such as self-regulated learning (SRL) (Coman et al., 2020).Machine learning offers the potential to explore educational data to detect learner profiles that can be used to provide targeted interventions to online students.The behavior of students in online learning environments can be measured from log data that contains page views, access to learning materials, frequency and duration of logins, assignment submission deadlines, number of clicks on learning materials, number of forum posts by students, and quiz and assignment scores (Aljohani et al., 2019;Alshabandar et al., 2018;Barnard et al., 2010;Kuzilek et al., 2017;Lodge & Corrin, 2017).
Over the last three decades since the recognition of SRL, there has been emphasis on the importance of SRL skills in relation to academic achievement.SRL is a process through which students manage their learning while being guided by their own motivation, behavior, and metacognition.Students with high levels of SRL skills are able to play an active role in achieving their academic goals (Klug et al., 2011;Pintrich, 2004).Learners who employ SRL strategies such as time management, help-seeking, and self-monitoring perform better than those who do not (Broadbent & Poon, 2015).The identification of SRL profiles in online learning has been based mostly on data collected using student self-report tools (Barnard et al., 2010;Broadbent & Fuller-Tyszkiewicz, 2018;Valle et al., 2008;Yot-Domínguez & Marcelo, 2017).These self-report tools include the Online Self-Regulated Learning Questionnaire (Barnard et al., 2010), the Motivated Strategies Learning Questionnaire (Broadbent & Fuller-Tyszkiewicz, 2018;Valle et al., 2008), and the Survey of Self-Regulated Learning with Technology at the University (Yot-Domínguez & Marcelo, 2017).Although selfreport tools are easy to implement when measuring SRL, students tend to overestimate their skills, and hence may fail to capture the actual learning behaviors exhibited during an online course (Araka et al., 2020;Gašević et al., 2017).Learners also often may fail to recall the strategies they use during learning as self-report tools are employed before or after the learning process (Broadbent & Fuller-Tyszkiewicz, 2018;Elsayed et al., 2019).Literature reveals that researchers rely on both trace data collected from educational systems such as learning management systems (LMSs) and self-report data (Ainscough et al., 2019;Çebi & Güyer, 2020;Gašević et al., 2017;Kim et al., 2018).Using trace data to measure SRL strategies has been viewed as unobtrusive since the tools are deployed without learners being aware and, therefore, they do not affect leaners' engagement behavior and performance (Schraw, 2010).Educational data mining (EDM) techniques therefore are likely to measure and profile learners more accurately as compared to self-report tools, as they use actual datasets collected from online learning environments.EDM is becoming extremely valuable for educators and decision makers especially in higher education institutions as it provides great opportunities for exploring huge datasets already stored in many learning environments.EDM has made it possible to detect students' online learning behavior (Khanna et al., 2016;Siemens & Baker, 2012;Winne & Baker, 2013).With EDM techniques being part of machine learning algorithms, there is a need for an empirical analysis to establish the optimal values of parameters and the best algorithm to use with educational data.

133
Recent studies have investigated the measurement and promotion of SRL on massive open online courses (MOOCs) (Kizilcec et al., 2017;Maldonado-Mahauad et al., 2018;Wong et al., 2020).However, there is little evidence to show how university and college students self-regulate when engaging in open and distributed learning using LMSs which are commonly used to facilitate distance learning in higher education (Araka et al., 2020).In view of this, the current study investigates SRL profiles using a dataset collected from the Open University, UK so as to allow for more research on the relationship between students' learning behaviors and academic performance.Moreover, the study seeks to inform researchers, educators, and designers of online learning environments about the optimal EDM techniques that can be used to design and provide targeted interventions for ODL students.
The profiling of learners into groups based on students' SRL skills has been done using step-wise cluster analysis (Ainscough et al., 2019;Çebi & Güyer, 2020;Valle et al., 2008;Yot-Domínguez & Marcelo, 2017), a K-means clustering algorithm (Li et al., 2018), latent class analysis (Barnard et al., 2010), and agglomerative hierarchical clustering algorithm (Gašević et al., 2017).Our review of literature revealed that different data mining techniques vary in their performance depending of the source of the dataset and type of e-learning environment.For example, EDM techniques used to measure and promote SRL for MOOCs are different from those used in LMSs.Moreover, there is a lack of evidence showing which algorithm performs better in identifying SRL profiles from data collected from an LMS.In view of this, the current study explored the appropriate EDM algorithm that could be used to profile online learners and group them into appropriate clusters so as to allow for the provision of interventions geared towards supporting SRL.
Specifically, the study was guided by the following research questions: 1. What EDM techniques are currently being used to identify SRL profiles in online learning environments?
2. What EDM algorithm is optimal in identifying SRL profiles in online learning environments?
3. What SRL profiles can be identified from students who engage in online learning?4. How are the SRL profiles identified from an online learning dataset associated with students' final results?
In this paper, the literature review section discusses previous studies on the profiling of learners according to their SRL strategies.Next, the methodology used to address the research questions is outlined.A review of the current EDM techniques being used to identify SRL profiles follows.Then, experimental evaluation of the EDM algorithms identified from the review is presented.The results section offers the findings of the experimental evaluation.Finally, the conclusions and future implications of the study are discussed.

Literature Review
Current research has proved that data mining techniques can be used to enrich decision making in different domains such as finance, healthcare, and e-commerce by transforming raw data into information (Madni et al., 2017).Educational data mining is also critical in analyzing data to improve pedagogical aspects of teaching and learning (Coman et al., 2020).Open and distributed learning has tremendously grown and been adopted by institutions of higher learning (Saadati et al., 2021).Students who engage in online learning, especially higher education, are supposed to play an active role in the learning process.However, literature reveals that students, individually or collectively, do not regulate their own learning (Cerezo et al., 2016;Dabbagh & Kitsantas, 2005).Additionally, online learners are not directly supported by instructors as compared to their counterparts in traditional face-to-face learning.Consequently, there is a need to provide support for SRL and student engagement that is geared towards enhancing self-regulatory skills (Silvola et al., 2021).In view of this, there is need to examine how learners behave and engage in online learning so as to establish the right interventions to be provided to the learners.Student engagement in online learning, especially behavioral and cognitive aspects, are observable elements that indicate how students participate and get involved in their learning activities (Silvola et al., 2021).SRL, on the other hand, is concerned with learners being proactive in their learning, taking their own initiative to control their learning by setting academic goals and identifying strategies to achieve those goals (Azevedo, 2009;Zimmerman, 1990).In the current study, student level engagement behavior in online learning activities is therefore an indicator of SRL level.For instance, a highly active student, identified through the number of resources accessed and the learning activities engaged on, is in control of the learning process and therefore exhibits a high level of self-regulatory behavior.Students' engagement behaviors and learning patterns in online learning environments, such as an LMS, can be measured using trace data.The dataset features may include content or page views, frequency of logins, access to learning materials, forum posts by students, and quiz and assignment scores (Araka et al., 2020).
Previous studies indicate that distinct profiles of SRL exist among students who engage in online and blended learning.The profiles can be identified using EDM methods applied to self-report data, trace data, or both.For example, Barnard et al. (2010) used latent class analysis to identify five profiles of selfregulators: super self-regulators, competent self-regulators, forethought-endorsing self-regulators, performance/reflection self-regulators, and non/minimal self-regulators.The algorithm was applied on data collected using a self-report online questionnaire known as the Online Self-Regulated Learning Questionnaire (OSLQ) (Barnard et al., 2010).
In another study, Li et al. (2018) analyzed trace data that comprised of logs related to access of learning materials, completion of quizzes, and answer logs to develop profiles in SRL.From the data, various behaviors were measured including number of completed quizzes, total access time, reviewing time, scores of completed quizzes, anti-procrastination and irregularity of study interval, and pacing (Li et al., 2018).The k-means clustering algorithm was applied to the data and four distinct clusters were identified: early completers, late completers, early dropouts, and late dropouts.However, the data only comprised of assessment data which did not indicate student interactions with the course.The students' activities were limited to listening and reading, and this may not reflect actual learner behaviors in an online learning environment.Ainscough et al. (2019) used a mixed approach that included both trace data and self-report data to profile online learners into three clusters: high self-regulators, medium self-regulators, and low self-regulators.While trace data was used during the analysis, the SRL skills that were identified were based on self-report data collected from learners in various stages during the study period.A two-step cluster analysis was used to group the learners.The first step was the pre-cluster formation.In the second, the hierarchical clustering algorithm was used to merge the pre-clusters, leading to the three distinct groups (Ainscough et al., 2019).The trace data used in the study comprised average word count for each meta-learning question, submission time for the meta-learning tasks, and completion rate of the tasks.
Finally, Çebi and Güyer (2020) presented various learning activities to students using the Moodle LMS.The learning activities included tutorials, video, concept maps, exercises, and summary, highlight, and forum activities.The data were collected from three sources: self-report data, trace data, and assessment data.Cluster analysis involving hierarchical clustering and k-means were used to identify three clusters.The study, however, was limited to only three weeks and a single course and, therefore, researchers may not have had the opportunity for proper observance of behavior change in leaners as far as SRL is concerned.
Table 1 presents a summary of the rest of the studies that we reviewed.Note.SR = self-report.TD = trace data.AD = assessment data.

Methodology
To address the research questions in the current study, we used a mixed method approach.First, a systematic review of the literature on current EDM techniques used to profile SRL was carried out.The review followed five steps of systematic review methodology (Khan et al., 2003).The review stages included (a) framing the research questions, (b) identifying relevant literature, (c) setting the articles' assessment criteria, (d) presenting review results, and (e) discussing the results.This review formed the foundation for the second study which involved experimental evaluation of EDM algorithms in order to establish the optimal algorithm to identify SRL profiles from a dataset obtained from the Open University in the UK. 137 Finally, correlation analysis was used to identify the association between the SRL profiles and students' academic performance.

Review of Educational Data Mining Techniques Used in Profiling SRL
The reviewed articles in this study were iteratively searched from international journals and databases which included Google Scholar, SCOPUS, Science Direct, Elsevier, ERIC, IEEE Xplore, and ACM digital libraries.The articles were searched using keywords: "educational data mining techniques" AND "learner analytics" AND "measurement of self-regulated learning" AND "assessment of self-regulated learning" AND "clickstream data" AND "student behaviors" AND "online learning" AND "self-regulated learning profiles."A total of 72 papers was identified.After reading the full text of each article and applying the inclusion criteria described in Khan et al., 2003, 48 papers were removed.A total of 24 papers, 12 journal articles and 12 conference articles, met the inclusion criteria.A summary is presented in Table 2.

Inclusion Criteria
There were four inclusion criteria used to obtain relevant literature for the systematic review: a) articles that used EDM or LA techniques for measuring SRL in online learning environments; b) articles that described machine learning experiments using trace data obtained from higher institutions of learning; c) articles that described experiments using self-report data integrated with trace data to construct models for measuring SRL; and d) articles that described software application(s) that implemented EDM algorithm(s) for SRL measurement.

Systematic Review Results
In this section, we present a review of the literature on current EDM techniques used to group learners into various SRL profiles according to their behavioral interactions in online learning environments.Table 2 presents a summary.The EDM algorithms identified from the review can be categorized into clustering algorithms, temporal data mining, and other techniques that include natural language processing (NLP) and classification.These EDM categories are discussed in this section.

Clustering Algorithms
Clustering algorithms represent the class of unsupervised machine learning techniques that classify learners into groups based on the similar interaction behaviors inferred from log data.Several clustering algorithms have been identified in this study including expectation-maximization, K-means and agglomerative hierarchical.
Expectation-maximization (EM) has been used to identify SRL behaviors and profile learners into various groups based on interaction behaviors.For example, Bouchet et al. (2013) 2019) investigated how EM can cluster students based on learning sequences which were also used to identify the SRL strategies based on the indicators learners used.The agglomerative hierarchical was utilized to identify learning patterns from the SRL strategies identified from the clusters (Matcha et al., 2019).In this study, various learning behaviors were identified: reading-oriented students, exerciseoriented students, and students oriented toward MCQs and video.Other students exhibited diverse behaviors such as the use of exercises, video views, and MCQs in learning.Three groups of learners were identified: high-, moderate-, and low-level SRL engagers.
The K-means clustering algorithm was used in a number of studies.The K-means algorithm iteratively divides a given dataset into a number of distinct number of clusters.The value of k therefore represents the 141 number of dissimilar clusters identified from a dataset.The data points in each cluster are similar to each other and dissimilar from data points in other clusters (Nuankaew et al., 2019).In their study, Zheng et al. (2020) employed the K-means clustering algorithm to identify profiles in SRL for learners taking STEM courses in engineering design.In this study, principle component analysis was used to reduce the highdimensionality of the data (Zheng et al., 2020).Given that K-means is an unsupervised machine learning algorithm, the number of clusters needed to be pre-determined; the ball statistic was used to establish the optimal number of clusters.The clusters identified in that study included competent self-regulated learners, minimally self-regulated learners, cognitive-oriented self-regulated learners, and reflective self-regulated learners.However, the study had limitations.For one, the indicators of the SRL were based on an Energy 3D learning environment that is specifically used by engineering students.The study therefore may not be applicable across other non-engineering courses and programs.Similarly, Valdiviezo et al. (2013) used the k-means algorithm to identify three clusters: high, medium, and minimal access and usage levels, based on students' online interaction behaviors from virtual learning interaction (VLI) data from the Moodle LMS.
The highest level of self-regulated learners, according to the study, were those students who had the greatest amount of interaction on forums, in terms of responding, viewing and adding discussions, quizzes, reading and writing messages, and accessing online learning resources.The k-means gives accurate results for similar experiments in the area of modelling student learning behaviors (Valdiviezo et al., 2013).Finally, Kizilcec et al. (2013) used k-means to identify groups of learners based on engagement behaviors as measured from trace data collected on a MOOC platform.
The agglomerative hierarchical algorithm, which helps to identify an unknown number of clusters given variables of interest from datasets, was also identified in the review.For example, Sun et al. (2016) investigated the effect of SRL on performance trajectory in a flipped classroom using the agglomerative hierarchical clustering algorithm.Six trajectory groups based on students' performance and trace data from interactions on the LMS were identified.The agglomerative hierarchical algorithm has also been used in other studies to identify distinct groups of learners based on their SRL variables as reported using an MSLQ self-report tool (Pardo et al., 2016(Pardo et al., , 2017)).The groups were then used to investigate the association between the students' online activity interactions and academic performance.Additionally, agglomerative hierarchical, based on Ward's method, was used to identify profiles of learners from trace data (Cicchinelli et al., 2018).

Temporal Data Mining
Temporal data mining encompasses two main techniques: process mining and sequence mining.A process mining algorithm is used to describe the paths followed by learners in an online learning environment (Rodriguez et al., 2014).Sequential mining on the other hand is used to identify sequences of learning activities using learner interaction logs.The objective is to determine the path followed by online students and the frequency of the activities carried out by the students (Wong et al., 2019).Sequence mining and process mining have been used to identify learning paths especially on MOOC platforms.Process mining is usually carried out before sequence mining.This helps generate process models that are based on students' time-stamped actions captured during the learning process.The sequence of learning actions that students perform during a learning episode will help understand the path followed by learners.The output is exploited for cluster analysis (Matcha et al., 2019).

142
In the review, several studies used process and sequence mining to investigate the presence of SRL strategies detected in trace data from both MOOCs and LMSs.For example, Cerezo et al. (2020) used process mining to measure SRL process from students' interaction data generated from the Moodle LMS.The inductive miner algorithm was used to produce process models that demonstrated students' learning behaviors.The process models reproduced students' interaction on the LMS.In that study, the highly regulated students were found to have followed the learning paths suggested by the instructor.This group of learners also performed activities related to forum discussions.In a related study, Kinnebrew et al. (2013) used differential sequence mining to identify and classify learners into groups based on their behaviors.Sequence mining requires that the trace data, which contain student interaction logs that indicate students' learning patterns, is first transformed into a sequence of actions.In this study, sequence mining was used to identify frequent patterns from a set of sequences.The indicators captured by Betty's Brain, a software agent, included read, edit, query, explain, and quiz.The algorithm analyses the sequence of actions and classifies learners into three groups: high, low, and medium engagers.Likewise, Maldonado-Mahauad et al. ( 2018), in their study whose main objective was to identify learning interaction sequences, clustered students with similar behavioral characteristics.Process mining was used to first identify the learning paths followed by learners in a MOOC course.The interaction sequences that were used for exploratory analysis were later used for clustering of learners into profiles.For clustering, agglomerative hierarchical was used to cluster learners according to the interaction sequences they followed.Three groups were identified: sampling learners (low level SRL), as well as comprehensive learners and targeting learners, who exhibited similar SRL behaviors.

Other EDM Techniques
Other machine learning algorithms and statistical modeling were also applied on multimodal data to measure the SRL of online learners (Di Mitri et al., 2016, 2017;Trevors et al., 2016).Likewise, statistical modeling, such as association techniques, along with other techniques, such as confirmatory factor analysis, was applied.For example, Crossley et al., 2016 used natural language processing (NLP) tools to complement trace data with language properties in understanding learner behavior especially from forum posts.The indices of NLP that were used included text length, social collaboration, sentiment analysis, text cohesion, syntactic complexity, lexical sophistication, and quality of writing.Classification techniques have also been used to categorize learners according to their learning patterns.For example, logistic regression was used to classify learners into different demographic and underrepresented groups based on trace data collected from an LMS (Bosch et al., 2018).Statistical modeling and frequency of learning activities were also performed so as to better understand various online learning behaviors.For example, Jansen et al. ( 2020) investigated the levels of compliance to the SRL interventions that were provided to learners by the MOOC.Neural network techniques have also been used to determine the extent to which students' learning paths conform to the pre-determined course structure.The page clickstream data was used, including the sequence of video interactions, assignment and quiz navigations, welcome page views, and discussion sessions (Yu et al., 2018).

Sources of Data and Feature Sets for Measuring SRL
As presented in Table 2, the sources of datasets and the features sets used for profiling learners based on behavior patterns were also investigated.A majority of the studies used trace data collected from LMSs such as Moodle (Cerezo et al., 2020;Jo et al., 2016;Manzanares et al., 2017;Montgomery et al., 2019;Sun et al., 143 2016;Valdiviezo et al., 2013), and Canvas (Park et al., 2018;Rodriguez et al., 2019).In measuring SRL and identifying SRL profiles, some studies relied on trace data in MOOCs such as those offered at the Coursera website (Crossley et al., 2016;Jansen et al., 2020;Kizilcec et al., 2013;Maldonado-Mahauad et al., 2018;Wong et al., 2019).Other online learning environments included Energy 3D (Zheng et al., 2020), Betty's Brain (Kinnebrew et al., 2013), and LON-CAPA (Bosch et al., 2018).Moreover, datasets collected from agent-based software applications such as MetaTutor, an agent-based system purposely developed to promote SRL, were used to profile and cluster learning according to students' interaction behaviors in a virtual learning environment (VLE; Bouchet et al., 2013).
The findings reveal that the dataset features used for profiling and measuring SRL in online learning are determined by the type of e-learning environment from which the data was collected.For example, for studies that used LMS data, the indicators include forum-related activities such as posting and updating forums, viewing, and replying to other students' posts.Other learning activities considered are quiz events such as quiz completion status and submission time in relation to the set deadlines, course module views, writing and reading messages, and the frequency and regularity of student logins (Jo et al., 2016;Montgomery et al., 2019;Valdiviezo et al., 2013).For trace data from MOOCs, learning activities related to video interactions such as video views and reviews, quiz events, assignment attempts and reviews, and course completion status were considered (Jansen et al., 2020;Kizilcec et al., 2013;Maldonado-Mahauad et al., 2018;Wong et al., 2019).Some researchers used multimodal data to measure SRL (Di Mitri et al., 2016, 2017;Trevors et al., 2016).

Discussion on the Systematic Review
The main objective of the systematic review was to identify the EDM techniques that are currently being used to measure SRL using trace data from online learning environments.The results reveal that clustering algorithms are more commonly used as compared to temporal data mining and classification algorithms.
From the review, it can be established that SRL dataset features from online learning environments could potentially be influencing the type of algorithm used to profile learners based on their SRL skills.For example, it can be observed that process and sequence mining were mostly applied on datasets collected from MOOCs and PLEs where the feature sets considered were the video interaction events, quiz, and assignment type and submissions timelines (Kinnebrew et al., 2013;Matcha et al., 2019;Rodriguez et al., 2014;Wong et al., 2019).On the other hand, clustering algorithms were mostly applied on LMS data where the feature sets such as module and page views, login frequency and regularity, and assignment and quiz 144 views and scores were mostly considered (Cicchinelli et al., 2018;Jo et al., 2016;Manzanares et al., 2017;Montgomery et al., 2019;Park et al., 2018;Sun et al., 2016;Valdiviezo et al., 2013).
From the review, it can be argued that there is no empirical evidence that shows which EDM algorithm for profiling SRL using online learning datasets is optimal.The experimental evaluation carried out in the next section was therefore conducted with the objective of establishing the optimal EDM algorithm for profiling learners according to their course interaction behaviors.

Experimental Evaluation
In this section, we describe the experiment carried out to compare the clustering algorithms identified from the systematic review.The algorithms identified from the literature review were compared to determine the optimal number of clusters formed by the best performing algorithm.For research questions two and three, a dataset collected from a virtual learning environment at the Open University in the UK was applied to the algorithms identified to profile learners into clusters and also test for any association between SRL profiles and academic performance.

Dataset Description and Preprocessing
The dataset collected from the Open University in the UK was used to identify the optimal clustering algorithm and the optimal number of clusters in online learning.The Open University Learning Analytics Dataset (OULAD) was chosen for this study as it represents students' actual behaviors in an online LMS as compared to other sets of data (Jha et al., 2019).The dataset contains three categories of student information: demographic, interactions in the form of logs, and assessments.The dataset is organized in tabular form with seven files.The data represents 22 courses and 32,593 students, their assessment results, and their interactions with a virtual learning environment (VLE) (Kuzilek et al., 2017).The current study used the dataset extracted from the studentInfo, vle, and studentVle tables (N = 735).The dataset represents students' interactions in one course offered in two semesters.The interactions are represented by the number of clicks/visits to specific learning resources and activities, such as course notes in the form of HTML pages and pdf files, and learning activities in the form of discussion forums and quizzes (Kuzilek et al., 2015).According to Kuzilek et al. (2017), the resources that were being accessed by the students included the course homepage, external and internal URLs, course subpages, resources, discussion forums, a glossary, collaboration tools, and course content.Table 3 presents a summary of the OULAD dataset and its features (Kuzilek et al., 2017).After feature extraction, which was done using id_ student, code_module and code_presentation as unique identifiers from three files that included studentVle, studentInfo and courses, one file was generated containing 5 columns and 735 rows.The extracted file contained one course named AAA, which was offered in two separate semesters to two separate cohorts one in 2013 and another in 2014 represented by 2013J and 2014J.Table 4 presents a summary of the sample dataset obtained for experimental evaluation.The sum of clicks captured students' interactions with various resources stored on the VLE.The clickstream data, which is also referred as number/sum of clicks in this study, represents the number of interactions students made when accessing various learning activities and resources.The preprocessed data was then imported to a Python environment where various clusters were formed using the three algorithms: k-means, expectation-maximization, and agglomerative hierarchical.The algorithms were implemented for clustering and visualization in the RStudio environment where the statistical evaluations were computed.

Experimental Procedure
First, the Python programming language was used to visualize scatterplots for the clusters formed by the three algorithms being compared, where the number of clusters was varied from 3 to 10 for each algorithm.Secondly, the clusters formed were compared using internal validation indices provided by the clValid (Brock et al., 2008) and the NbClust (Charrad et al., 2014) R packages.The functions were used to compare the algorithms based on the internal information of the data by evaluating the "goodness" and quality of the clusters formed.The outputs from the evaluations were used to determine the optimal number of clusters and the best performing algorithm (Rodriguez et al., 2019;Van-Craenendonck & Blockeel, 2015).The clValid uses the Dunn index, Connectivity, and the Silhouette index to establish the optimal number of clusters and the best performing algorithms (Brock et al., 2008).The NbClust, on the other hand, determines the optimal number of clusters in the dataset using the results of 30 inbuilt indices (Charrad et al., 2014).

Experimental Evaluation Results
In this section, experimental results for the three clustering algorithms are discussed.First, we examine the results of the three clustering algorithms.Second, the clustering evaluation carried out to determine the most appropriate algorithm with the optimal number of clusters is described.Last, we present the results of the test for independency between the optimal clusters and students' final academic achievement.
As presented in Figures 1, 2, and 3, the scatterplots demonstrate the clusters formed by the K-means, expectation-maximization, and agglomerative hierarchical algorithms while varying the number of clusters from 3 to 10.

Evaluation of Clustering Results
After the clusters were formed by the algorithms, an evaluation was carried out using the clValid R package that compared the cluster results and gave the optimal scores for the best performing algorithm (Brock et al., 2008).The results are presented in Table 5. Note.The optimal score value for Connectivity, which identifies the optimal number of clusters with lowest score and Dunn index and Silhouette which identifies the optimal number clusters with highest score are in bold (Brock et al., 2008).
The results indicate that the agglomerative hierarchical algorithm is the best performing with the optimal score of 8.4552 for Connectivity and 0.7111 for Silhouette measures when there are 3 optimal clusters.However, the Dunn index proposes 4 optimal clusters with optimal score of 0.0609.We also evaluated the clusters using the NbClust function.The NbClust function provides 30 internal validation indices that allow simultaneous evaluation of algorithms in order to determine the optimal number of clusters for a given dataset (Charrad et al., 2014).From these 30 indices, seven proposed 3 as the optimal number of clusters, fifteen proposed 4 clusters, while two proposed 5 clusters.The rest of the indices, such as the Dindex and Hubert, gave graphical results.They also indicated 4 clusters would be optimal.Based on the majority rule, we conclude that the best number of clusters in the dataset would be 4.

Self-Regulated Learning Profiles Identified from Students' Interaction Data
After the experimental evaluation of the clusters formed by agglomerative hierarchical clustering, it was revealed that the students' interaction data could optimally be categorized into four distinct clusters.The clusters seen in the dataset included: a) Cluster 0: This cluster represented students whose number of clicks were 5,000 and over.
b) Cluster 1: This cluster represented students who had the second highest number of clicks.The range was approximately 2,500 to 5,000.150 c) Cluster 2: This cluster denoted total clicks that ranged from 1,000 to 2,500.
d) Cluster 3: This cluster seemed to have similar characteristics to cluster 2 in general, and contained the lowest number of clicks, ranging from 0 to 1,000.
The classification of students into four profiles was based on behavioral activities that represented the number of resources accessed.The resources included homepage, subpages, external and internal URLs, discussion forums, course content, assignments, and course content.The SRL profiles were identified using the agglomerative hierarchical clustering algorithm.Using exploratory data analysis, the clusters formed were mapped onto four SRL profiles: exemplary self-regulators, good self-regulators, intermediate selfregulators, and poor self-regulators.These are illustrated on the scatterplot in Figure 4.

Figure 4
Clusters Mapped on to SRL Profiles The original dataset included the final results of the students.Based on these results, it was possible to identify the distribution of clusters among students who had passed with distinction, passed, failed, or withdrawn.The exemplary and good self-regulators had the highest number of clickstream interactions and performed the best in terms of the final grades.The students in these two profiles either had a distinction or a pass in their final results.As presented in Figure 5A, among students who passed, 35.11% were intermediate self-regulators while 16.81% and 4.89% were good and exemplary regulators respectively.Among the students who passed with distinction, good and exemplary self-regulators represented the highest percentage at 20.93% and 32.56% respectively as illustrated in Figure 5B.The number of poor and intermediate self-regulators found among the students who had passed with distinction reveals that there could be other factors contributing to their academic performance.As shown in Figure 5B and 5C, the poor and intermediate self-regulators had a low to medium number of clickstream interactions.The majority of the students in these groups exhibited similar academic results.They either failed or withdrew from the 151 course.It can also be observed that some students who were classified as good or exemplary self-regulators withdrew from or failed the course.This implies that there could be external factors that influenced their academic performance.Lastly, as shown in Figure 5D, among the students who withdrew, 73.33% were poor self-regulators while 19.05% represented the intermediate self-regulators.

Relationship Between the SRL Profiles and the Students' Final Results
The chi-square test was carried out to establish the correlation between the SRL profiles formed by the agglomerative hierarchical clustering algorithm and students' final results.A contingency table was computed from the values of the distribution of students among the four clusters of SRL profiles and the four categories of the students' final results: passed with distinction, passed, failed, and withdrew.The computed p-value was 0.00 (8.988568648725134 e-22 ).When the p-value obtained is compared with the alpha value of 0.05, since p < 0.05, we can conclude that there is a significant relationship between the SRL profiles and the students' final results.

General Discussion
In this research, two related studies were carried out.First, a review of the literature describing EDM techniques for identifying profiles in SRL was undertaken.The results from the review indicate that a clustering technique is the most appropriate, preferred over other techniques such as temporal data mining, natural language processing, neural networks, and classification.It was observed that clustering was most often the most appropriate technique when using online educational datasets from LMSs.The findings led us to conduct the second study which aimed at experimenting with the clustering techniques such that three algorithms were compared: k-means, agglomerative hierarchical clustering, and expectation-maximization.The clustering algorithms were evaluated using internal validation measures to identify the optimal algorithms and number of clusters.The findings demonstrate that agglomerative hierarchical clustering is the best performing algorithm.These findings align with results from previous studies (Çebi and Güyer, 2020;Gašević et al., 2017).Cluster evaluation was carried out to establish the optimal algorithm with an optimal number of clusters.Using the NbClust function, where 30 inbuilt indices were used to simultaneously compare the clusters, fifteen indices proposed 4 clusters while seven indices proposed 3 clusters.Based on the majority rule, we concluded that the optimal number of clusters is four (Charrad et al., 2014).Furthermore, an exploration and analysis of the clusters formed by the optimal clustering algorithm, agglomerative hierarchical, indicate that four SRL profiles existed in the online dataset collected from a virtual learning environment.The four clusters were further examined and mapped onto four SRL profiles based on the learners' behaviors as inferred from the OULAD dataset.
The SRL profiles identified include exemplary self-regulators, good self-regulators, intermediate selfregulators, and poor self-regulators.The SRL clusters differed from each other in terms of the frequency of the sum of clicks which represents the clickstream interactions students had with online learning resources such as course homepage, external and internal URLs, course subpages, resources, discussion forums, glossary, collaboration tools, and course content.Additionally, since the OULAD dataset included students' final results, it was possible to identify the distribution of each of the profiles among the students who had distinction, pass, fail, or withdrawn.It was observed that the exemplary and good self-regulators had the highest number of clickstream interactions, i.e., above 2,500.The intermediate self-regulators had a medium number of clicks that ranged from 1,000 to 2,500, while poor self-regulators had the lowest number, i.e., below 1,000.The distribution of students in the various profiles also indicates that a majority of the poor and intermediate self-regulators either failed or withdrew from the course.
Finally, a test of independence to establish the relationship between the SRL profiles and the students' final results revealed a significant relationship between the two categorical variables.Profiling students according to their SRL skills helps instructors in identifying learners with similar interaction behaviors.These SRL profiles may be helpful in developing and providing customized and targeted interventions based of each group's characteristics.

Conclusion
Online learners differ in terms of the behaviors they exhibit during online learning.Identifying existing behavior groups will help educators provide targeted SRL interventions instead of offering one-size-fits-all treatments to students.While any algorithm can be applied to determine the number of clusters available in a given dataset, any algorithm may fail to identify the optimal number of clusters given differences in datasets.For example, datasets from educational environments differ from datasets obtained from other industries.Additionally, our review of literature revealed little knowledge exists about the most appropriate algorithm to use with datasets from online learning environments such as LMSs.This study sought to solve this problem from three perspectives: (a) the most appropriate EDM techniques being applied in identifying SRL profiles, (b) the best performing algorithm, and (c) the optimal number of SRL profiles available in trace data collected from an online learning environment.
The current study has provided insights into the identification of SRL profiles using EDM techniques such as clustering algorithms in online learning environments.The OULAD dataset was applied to the experimental comparison of the algorithms.The findings revealed that it is now possible for SRL interventions to be targeted to the right groups, based on learners' behavioral characteristics.This will enhance students' abilities in terms of SRL skills which have been found to be poor in most online learners (Goda et al., 2020).Moreover, given the large number of students enrolling in online learning and the limited number of instructors, it will be necessary to use EDM techniques to identify SRL profiles which can then be used to establish the nature and level of student interactions in online learning environments such as an LMS (Goda et al., 2020).
The findings from this study imply that EDM techniques offer great opportunities for researchers to use trace data collected from online learning environments to explore supporting SRL.Profiling learners according to their SRL strategies will be a first step in providing targeted SRL interventions.The findings from this study offer insights into two areas: first, that EDM techniques can be used to identify learner profiles in terms of SRL skills in open and distributed learning environments.Second, clustering students based on their levels of self-regulation provide a means of understanding where online learners are situated so as to develop guidance and support aligned to learners' needs hence offering the opportunity for instructors to provide targeted interventions for each of the formed clusters.The results from this study also contribute to the measuring of SRL in online learning environments by giving insights into how to build machine learning models that can ultimately be used to provide SRL interventions.
The findings concerning the association between SRL profiles and students' final results were based on correlation analysis.The results may therefore have failed to reveal all the intervening factors that could have contributed to the success or failure of the online learners.It would therefore be interesting for future studies to consider variables other than clickstream interaction behavior that could affect the clusters.
Given that this current study did not consider specific SRL strategies such as time management, helpseeking, elaboration, and rehearsal, and how they could be inferred from the trace data, an empirical study could be carried out to profile learners based on identifying specific SRL strategies and examining how they could be measured, monitored, and even promoted in an actual online learning environment (Araka et al., 2021).Finally, we propose that future studies could examine how targeted interventions could be designed to promote SRL strategies based on learner needs in each SRL profile.For example, it would be interesting used EM to identify three clusters of learners from trace data derived from learner behaviors.Similarly, Manzanares et al. (2017) used EM to group learners into three clusters.Since the EM algorithm involves predetermining the number of clusters, Manzanares et al. (2017) used the bi-stage cluster node to determine the value of k.Additionally, Matcha et al. ( Figure 1

Table 1
Summary of SRL Profiles Identified From Previous Studies

Table 2
Algorithms Used to Measure SRL in Online Learning Environments

Table 4
Summary of the Preprocessed Sample OULAD Dataset for Module AAA

Table 5
Optimal Algorithm and Cluster Evaluation Results