Using Survival Analysis to Identify Populations of Learners at Risk of Withdrawal: Conceptualization and Impact of Demographics

High dropout rates constitute a major concern for higher education institutions, due to their economic and academic impact. The problem is particularly relevant for institutions offering online courses, where withdrawal ratios are reported to be higher. Both the impact and these high rates motivate the implementation of interventions oriented to reduce course withdrawal and overall institutional dropout. In this paper, we address the identification of populations of learners at risk of withdrawing from higher education online courses. This identification is oriented to design interventions and is carried out using survival analysis. We demonstrate that the method’s longitudinal approach is particularly suited for this purpose and provides a clear view of risk differences among learner populations. Additionally, the method quantifies the impact of underlying factors, either alone or in combination. Our practical implementation used an open dataset provided by The Open University. It includes data from more than 30,000 students enrolled in different courses. We conclude that low-income students and those who report a disability comprise risk groups and are thus feasible intervention targets. The survival curves also reveal differences among courses and show the detrimental effect of early dropout on low-income students, worsened throughout the course for disabled students. Intervention strategies are proposed as a result of these findings. Extending the entire refund period and giving greater academic support to students who report disability are two proposed strategies for reducing course withdrawal.


Using Survival Analysis to Identify Populations of Learners at Risk of Withdrawal: Conceptualization and Impact of Demographics
Academic withdrawal constitutes one of the biggest challenges in education, in particular for online higher education (OHE) institutions, where withdrawal ratios are reported to be higher (Bawa, 2016;Simpson, 2010).Aside from its macroeconomic impact, withdrawal causes frustration in terms of expectations, as well as being a waste of time and money from the student's perspective (Lee & Choi, 2013;Simpson, 2010).
These facts justify the interest of and motivate these institutions in designing targeted interventions aimed to reduce it.
A critical first step towards a successful intervention is the accurate and reliable identification of learners at risk (Rienties et al., 2016).This identification is mostly understood in terms of prediction.Most research works focus on determining individual risk and on increasing prediction ratios rather than on understanding the reasons behind the risk.While determining if a particular student is at risk can be valuable, the essential issue when considering an intervention is identifying a common risk factor behind a group of learners who may constitute an intervention target.
Furthermore, timely execution is essential.Time plays a particularly relevant role when designing and implementing interventions oriented to reduce course withdrawal and overall university dropout.The moment when a student decides to abandon a course is critical in terms of the intervention design.At the course level, Simpson (2010) showed that 40% of new students at the Open University withdraw from courses before the first assignment.At the university level, Grau-Valldosera et al. (2019) showed that periods of non-enrolment could result in dropout, despite the intention to continue at the time of the break.
In both cases, it would be inefficient to implement interventions after the student has effectively dropped out.
When added to the relevance of time, the concept of population at risk-rather than individual at riskmakes us consider survival analysis as a suitable technique.However, a literature review revealed that research using this technique mainly focused on analysing university dropout (Cobre et al., 2019) or attrition in MOOCs (Rizvi et al., 2022;Xing et al., 2019) and was not linked to interventions.Our article focuses on the use of survival analysis as part of the intervention process, detecting populations of learners at risk of withdrawal at the course level in regular OHE courses.The method described will determine the significance and influence of a set of variables on course withdrawal, providing information to select intervention targets and coherent strategies.Additionally, survival curves will provide additional insight which will help the intervention design.
Besides setting the conceptual framework, we performed a practical implementation based on an open dataset from a world-leading online university: The Open University Learning Analytics Dataset (OULAD; Kuzilek et al., 2017).This dataset contains data from more than 30,000 students enrolled in 22 online course editions from different disciplines, including the withdrawal date for students who abandon different courses.Based on these data, we analysed the impact of students' demographics on withdrawal, determining risk factors and quantifying their impact.Demographics have been identified as some of the causes behind withdrawal (Hachey et al., 2022;Muljana & Luo, 2019) and constitute key features for early dropout prediction in online environments (Radovanovic et al., 2021).Nonetheless, the proposed method is applicable to any other variable of interest such as academic performance, background, or psychological features/traits that may impact it.

Literature Review
The

Concept of Withdrawal
The analysis of dropout has long been present in educational literature, with 1900-1950 being considered the age of early development, broadening horizons in the 1990s, and showing rising interest in recent years.
Works by Tinto (1975) expounded upon one of the most relevant initial models explaining dropout in traditional education.The core of this theory is the student integration model, where persistence is explained by a student's motivation and ability to match the social and academic characteristics of the institution where she is studying.Years later, Bean (1985) introduced the student attrition model, which relies on the concept of behavioural intention, where dropout is conditioned by a mixture of academic, social-psychological, environmental, and socialisation factors.
These two theories and their combination in Cabrera et al. (1992) and later in Rovai (2003) have been at the core of subsequent studies on the topic.According to Rovai (2003), academic performance and dropout are a combination of student characteristics, student skills, external factors, and internal factors.These four make up the composite persistence model (CPM) and reflect the multivariate nature of dropout.
The term university dropout is commonly used to describe situations where students leave the university before obtaining a formal degree (Larsen et al., 2013).Behind this definition lies a complex phenomenon, evidenced by the list of related terms such as dropout, departure, withdrawal, failure, non-continuance or non-completion (Xavier & Meneses, 2020).Dropout is the opposite of retention, defined as "continued student participation in a learning event to completion, which in higher education is a course, program, institution, or system" (Berge & Huang, 2004, p. 3).
At the course level, most papers dealing with withdrawal do not provide a formal definition (77.78% according to a recent scoping review; Xavier & Meneses, 2020).In our research, we used the definition provided by the Open University as "cease studying a module without the intention to resume the study of that module" (Open University, 2022, p. 6).

Approaches for the Identification of Populations at Risk: Survival Analysis
The first stage of a correct intervention design is an accurate and reliable identification of learners at risk (Rienties et al., 2016).Surveys and different data mining techniques are typical approaches used in this identification.Prevalent techniques include decision trees and random forest (Behr et al., 2020), but a whole set of methods can be found in the literature (Xing et al., 2019).However, only a low percentage of studies make use of longitudinal data approaches and, in particular, survival analysis.Ameri et al. (2016) indicated that "there is only a limited attempt at using these methods in student retention problems" (p. 904).Xing et al. (2016) also showed that the performance of classical techniques used to predict dropout could be improved by accommodating temporal modelling approaches.
The use of survival analysis at the course level in the literature is focused on MOOC scenarios (recently Moreno-Marcos et al., 2019;Rizvi et al., 2022;Xing et al., 2019).The existing studies covering survival analysis in OHE all focus on analysing the semesters when students drop out from the university rather than withdrawal from within courses (Ameri et al., 2016;Cobre et al., 2019;Villano et al., 2018).Two of the studies (Ameri et al., 2016;Villano et al., 2018) focused more on comparing the prediction capability of survival methods to existing techniques.On the other hand, Cobre et al. (2019) tried to identify in which semesters students are most likely to drop out, applied in two different academic programmes in Brazil.
Although some studies (Ameri et al., 2016;Villano et al., 2018) highlighted its interpretation of results and its suitability for analysing underlying student issues and helping the design of interventions, none of the studies examined survival analysis itself.Moreover, to the best of our knowledge, none of the studies examined within-course withdrawal.Considering the importance of the moment of withdrawal as well as the method's longitudinal approach and interpretability, we consider it a suitable approach to designing targeted actions oriented to reducing withdrawal.

Influence of Demographics
Rovai's model indicates the relevance of a student's personal factors linked to dropout in online studies.
Focusing on online education, different compilations (Hachey et al., 2022;Lee & Choi, 2011;Muljana & Luo, 2019) investigated the relevance of these factors and showed a lack of consensus among the studies analysed.As noted by Lee and Choi (2011), "findings of many studies were incompatible with one another regarding the relationship between demographics and online students' persistence in online courses" (p.

603).
In particular, the correlation between gender and course withdrawal is unclear.Some works have indicated a relation, which can even depend on the field of study (Cochran et al., 2014).This work indicated that males showed higher withdrawal rates in courses linked to disciplines such as education or health, but lower in those related to business and math.A large number of studies, however, did not establish a correlation between gender and withdrawal (James et al., 2016;Strang, 2017).
Regarding age, OHE students are older than those in face-to-face learning environments.Once enrolled, older students would have a lower dropout rate (James et al., 2016).Other research, however, did not identify any age-related effects (Strang, 2017).
Prior academic achievement is linked to persistence in online learning (Lee & Choi, 2011) and can even be used for prediction (Hachey et al., 2014).Regarding socioeconomic status, it is considered a relevant factor (Hachey et al., 2022).When considering re-enrolment, having a full-time job and cost factors have a negative impact on retention (Grau-Valldosera et al., 2019).Specifically, students requiring financial aid to re-enrol show higher dropout (Cochran et al., 2014).
Few references can be found to the impact of disability.However, in a few studies, disability is cited by some students as a reason for withdrawal (Shah & Cheng, 2019).
Although several research works have used the OULAD dataset, none has been found covering demographics' role in withdrawal.The closest analysis found (Rizvi et al., 2019) considered the impact of these factors on academic outcomes in terms of pass-fail.This study reported that region, neighbourhood poverty level, and prior education constitute strong predictors of failure.

Research Questions
Considering the lack of studies that analyse withdrawal at the course level in regular OHE with a longitudinal approach, the relevance of reliable identification of learners at risk, and the potential of As mentioned, the OULAD dataset includes data from 22 editions of 6 different courses.Detailed information on the dataset is provided in the next section.

Method Survival Analysis
Survival analysis is "a collection of statistical procedures for data analysis where the outcome variable of interest is time until an event occurs" (Clark et al., 2003a, p. 237).The method is particularly used in medical research, where survival time or time to relapse is under consideration (Bradburn et al., 2003a(Bradburn et al., , 2003b;;Clark et al., 2003aClark et al., , 2003b)).The portability of the method to other disciplines has been suggested in recent studies (Emmert-Streib & Dehmer, 2019).
Kaplan-Meier (KM) estimates and, specifically, KM curves are common in most survival analyses when the goal is to compare two populations.They are the simplest way to compute survival over time (Clark et al., 2003a).KM estimates help to establish whether life expectancy is different for different populations who have different characteristics, or whether a specific treatment can be more advisable than others.Linked to this estimation, the hazard function indicates the probability of not surviving beyond a certain point in time.
The statistical significance of the resulting curves can be checked with the log-rank test (Clark et al., 2003a).
This test compares the estimates of the hazard functions of the two groups at each observed event time under the null hypothesis that both groups share the same hazard functions.The original test assigns equal weight to early and late events.Modified versions use weighted functions.In particular, Peto-Peto's logrank test (Peto & Peto, 1972) assigns weights depending on the estimated percentile of the failure time distribution, giving higher weight to earlier events, and is commonly used within this group.
However, KM estimates cannot quantify the impact of a given parameter, particularly when dealing with different variables, i.e., the covariates.When this is required, parametric methods must be used.Fully parametric methods need to assume statistical distribution in the data.If this distribution is known, they can provide more precise models.Semi-parametric methods have the advantage of being able to quantify the impact without assuming a specific distribution.The most used semi-parametric method is the Cox proportional hazards model (Bradburn et al., 2003b).This model is based on a proportional hazard assumption and computes a baseline time-dependent hazard associated with a reference group.This hazard is modified based on the multiplicative effect of the values of the different covariates, whose individual influence is considered constant over time.Once the method is computed, the assumptions need to be checked.

Porting Survival Analysis to Withdrawal Analysis
Approaching a generic problem through survival analysis requires a precise mapping of three concepts: the lifespan, the event under consideration, and the period of observation (Clark et al., 2003a).
In the case of withdrawal, the number of days a student remains enrolled after the specific course starts constitutes the lifespan.The event under consideration is the withdrawal decision.The analysis would also need to monitor on a periodical basis whether the student has withdrawn.To set up a common reference among courses, the course start date would be considered as t = 0. Negative values indicate days before the course starts.Survival curves reflect how a population survives after a certain time.Figure 1 depicts these concepts in a hypothetical course lasting 250 days with four students enrolled.KM plots provide a graphical view of the individual impact of specific covariates.To aggregate and quantify the impact of those found relevant, we used the Cox proportional hazards model, due to its simplicity compared to parametric methods.

Dataset
These concepts were translated into practice using the public dataset offered by The Open University (OU; Kuzilek et al., 2017).This dataset provides information about 22 editions (presentations in the dataset nomenclature).A total of 32,593 students are enrolled in these courses.The typical presentation length is around nine months.
Courses included in the dataset were offered via a virtual learning environment (VLE), and each had over 500 students.While part of the OU course portfolio, students without a previous academic background could also enrol.These high withdrawal ratios may be explained by the fact that they constitute regular OU courses, with high academic standards, but at the same time, require no prior qualification for enrolment.All courses share a common framework for evaluation, including a set of tutor-marked assignments and optionally some computer-marked assignments.Also, there is usually a final exam at the end of each course.
With respect to those students withdrawing, the dataset includes information regarding the date of withdrawal.This date is either the date on which the student notified the university of her withdrawal or the date on which the student's participation in the module ceased, whichever came first.The Open University actively seeks to reduce withdrawal and may monitor online student activity to detect it.Students considering withdrawal are advised to contact the module instructor and, if their decision is final, formally report their decision (Open University, 2022).
The dataset also includes some personal information.Table 3 summarises those characteristics in the dataset considered relevant to our study.This information was required to approach RQ2.Age, gender, and disability are available directly in the dataset.Previous academic background is expressed as the highest educational level the student achieved before the module started.Region indicates the area where the student lives.Student economic situation is expressed by the index of multiple deprivation (IMD) used in the UK (Kuzilek et al., 2017;Rizvi et al., 2019).
The dataset presents IMD figures in bands ranging from 0%-10% to 90%-100%; 0%-10% means that a student lives in the most deprived UK areas, while 90%-100% points to the least deprived areas.

Results
The preceding section identifies two main steps for practical implementation: 1. Use KM estimates to determine populations at risk and the impact of individual covariates on withdrawal.
2. Analyse the combined impact, quantifying the simultaneous effect through Cox proportional hazards model.
The Cox model requires a prior setup of reference values for the covariates.For the categorical variables shown in Table 3, we generated dummy variables and considered the values shown in Table 4 as reference values.

Disability No
Note.IMD = index of multiple deprivation.
When the number of possible values was high, we selected reference values that reflected a more central position (e.g., IMD band = 50%-60%).For the specific IMD scales, we grouped low IMD scales (0%-30%) and high IMD scales (above 80%) to reduce the overall number of values.

Significant Differences Based on IMD Band, Prior Education, and Declared Disability
Covariates to perform KM estimates were extracted from Table 3.Using Peto-Peto log-rank tests, we computed p-values.Data in Table 2 reflect that different courses show differences in withdrawal ratios.For this reason, we also performed a per-course analysis to determine whether covariates were significant both at the global and individual course levels.The results are shown in Table 5. Prior highest education level and IMD band have a clear impact when considering either the global data set or individual courses.At the course level, more data would be needed for course AAA to provide statistically significant results.While age looks relevant globally, its effect disappears in most courses when analysed individually.Thus, no definite conclusion can be extracted at this stage.More data would also be needed, as there is a low ratio of students in one of the scales considered in the dataset.
KM plots help to visualise differences.As an example, we show the global impact of two covariates: gendernot significant according to the test-and the IMD band, which is significant.For clarity, in the case of IMD plots, we compared the high (> 80%) and low (< 30%) groups.The results are shown in Figure 2. Figure 2 shows minor differences based on gender.Regarding IMD bands, this figure reflects higher withdrawal ratios for the low IMD group, with a higher impact of early withdrawal.

Previous Risk Factors also Present when Considering Simultaneous Effect
We used the Cox model to evaluate and quantify the simultaneous effect of the different covariates.The final Cox models were developed with two strata variables (course and disability) and a set of dummy variables linked to IMD band, region, gender, and previous higher education.To understand the impact of disability, we compared baseline survival functions for the different strata generated at the course level.Table 7 reflects the withdrawal increase ratio for individual courses when disability was a factor.As a final check, we performed a graphical comparison of withdrawal differences based on the findings above.We generated populations based on the combination of IMD differences-high versus low groupand disability.The results are shown in Figure 3, where a reference group based on data in Table 4 (no disability, IMD band 50%-60%) is also reflected.

Discussion
Two research questions were addressed in this work.The first one, regarding the use of survival analysis, aimed to detect populations at risk of withdrawal and the factors behind it.The second one aimed to translate these concepts into practice, determining the relevance and impact of demographics.
From a methodological perspective, the basics behind the answer to RQ1 are covered in the subsection covering the portability of survival analysis to learning analytics scenarios.Identifying students at risk of withdrawal through survival analysis has required the mapping of three concepts: the event under consideration (the withdrawal decision), the period of observation (a course), and the lifespan (the time the student remains in the course).This mapping allows us to identify both at-risk populations and the associated risk factors.Figure 1 concentrates on the basics behind this mapping.
Survival curves provide a graphical insight into the differences among populations based on a set of factors (Figures 2 and 3 offer clear examples).These curves constitute a relevant difference from other data mining techniques.They do not only provide information on final withdrawal ratios, but also show when withdrawal occurs.Statistical relevance of a given factor can also be determined (see Table 5 for examples), and for those factors considered relevant, the impact can also be quantified (Tables 6 and 7 serve as examples for this point).These facts make survival analysis a particularly suitable technique for analysing withdrawal.
All in all, figures 1 (from a theoretical perspective), 2, and 3 (from a practical approach), combined with the data in tables 6 and 7, demonstrate the potential of survival analysis identifying populations of learners at risk.
The second question (RQ2) translates methodology into practice.The application of the method indicates that certain demographic characteristics have an impact on course withdrawal and that this impact is dependent on the course itself.Specifically, three analysed factors increase withdrawal risk: a previous level of education below the reference group (A level), a low IMD band, and a declared disability (see Table 5).
As mentioned, the specific impact is different depending on the course (see Table 7).We can compare these findings with previous literature regarding the impact of demographics on withdrawal.

Withdrawal and Demographics: Comparison with the Literature
From a global perspective, the influence of these personal background factors is consistent with the theoretical models (Bean, 1985;Cabrera et al., 1992;Rovai, 2003;Tinto, 1975) and justifies the interest that literature compilations put on them, in particular in OHE (recently, Hachey et al., 2022;Muljana & Luo, 2019).
Our results have shown a different impact for age and gender across the analysed courses, supporting the inconclusive findings reflected in Hachey et al., 2022 andMuljana andLuo, 2019.We have not found that being male reduces risk in some courses, while increasing it in others (as found in Cochran et al., 2014).
However, we agree that the relevance and the specific impact of a factor depend on the course under analysis.
Regarding the economic situation, our results at course level are aligned with those indicating the impact of financial hardship at course and university level (Cochran et al., 2014;Grau-Valldosera et al., 2019).Our work indicates a direct relationship between socioeconomic inequality and educational disadvantage, as shown in the lower panel in Figure 2.
The impact of a poor academic background on withdrawal is consistent with earlier research linking lower previous achievement to higher university dropout (Cochran et al., 2014;Lee & Choi, 2011).
Disability is one of the potential reasons behind some dropouts according to Shah and Cheng (2019).Our results confirm this fact and quantify its impact on course withdrawal.Our findings indicate that students with disabilities taking online courses would be more likely to withdraw from these courses, in particular those students in low IMD bands.Due to our concerns about equity, we believe more studies on this topic should take specific care of anonymisation and ethical issues.

Implications
The intersectionality and the reliable estimation of risk allow us to identify two points for a potential intervention with targeted populations.First, less affluent students could be contacted, even before the start of the course, and offered options regarding financing.Second, disabled students coming from more deprived areas might benefit from continuous support, which might reduce the slowly increasing difference in withdrawal rates when compared with non-disabled students reporting the same economic condition.
While these demographic factors affect all courses analysed, their impact on dropout is different for each course.This difference needs to be considered when evaluating the outcomes of specific interventions.
We can also find factors that show statistical significance in only some courses.To mention just a couple of examples, gender for course DDD or region for course BBB (see Table 5) warrant investigation.For these cases, we encourage a closer look that considers course-specific details which may explain why.
We also remark on the potential of survival analysis to detect situations that would otherwise remain hidden.Figure 2 (lower panel) and Figure 3 reveal a sudden drop around the second week of the course, particularly affecting low-income students.In fact, this week corresponds to the end of the full-refund period for a given course.A potential intervention aimed at reducing withdrawal would consider extending the period of full refund for low-income students.It is important to highlight that this kind of finding would remain hidden if using techniques which focus only on final ratios and not on temporal evolution.
Our detailed analysis also reveals potential fails in intervention design which do not include a proper identification stage.We can consider for instance prior level of education.It is noticeable that course GGG shows a higher risk of withdrawal for students with a previous higher level of education.While this could be shocking at first glance, this course constitutes a propaedeutic course.Those students who already have this knowledge simply abandon the course.Besides this example, and considering potential interventions, the method reveals that analysis at both a global level and at the level of the individual course is critical to properly identify populations at risk.
Finally, survival analysis provides a clear view of the impact of the different factors analysed.For the case of demographics, IMD band, prior educational level, and declared disability have emerged as the most relevant factors in dropout.It is worth noting that these factors emerge as relevant both at the aggregate level and when considering individual courses.
survival analysis, we formulated this research question: RQ1: How can survival analysis be used to identify populations of learners at risk of withdrawal at the course level, providing insight into the factors behind that withdrawal?Additionally, considering both the relevance of time and the potential impact of demographics on withdrawal, we posed a second research question, addressing practical implementation: RQ2: What is the specific impact of demographic factors over time on course withdrawal?Which of these factors impact the withdrawal regardless of the course itself?Specifically, we decided to analyse the impact of these demographic characteristics based on the OULAD dataset: (a) age, (b) gender, (c) disability, (d) region, (e) previous academic background, and (f) student's economic situation.

Figure 1 Graphical
Figure 1Graphical View of a Hypothetical Course and Associated Survival Curve

Figure 2 Survival
Figure 2Survival Curves for Different Groups Based on Gender and IMD Band

Figure 3 Survival
Figure 3 Survival Curves for Groups With and Without Declared Disability in High and Low IMD Bands

Figure 3
Figure 3 clearly shows withdrawal risk differences among groups.Besides final survival expectancy-with differences around 35.88% by the end of the course-low-income students drop out earlier.Also, the multiplicative effect of disability and low IMD is clearly displayed.Being in the high IMD group does not significantly reduce withdrawal rates when compared to the reference group.The impact of these findings on potential intervention designs will be addressed in the next section.

Table 3
Characteristics in the OU Dataset Linked to the Research Questions disabilityIndicates whether the student has declared a disability Note.Variable names used match those in the OULAD dataset.

Table 4
Reference Groups for the Computation of the Cox Model
Table 6 summarises those variables that appear relevant at either the global or individual course level.

Table 6
Hazard Risk Factor Relative to the Reference Group Based on the Values of Covariates Only covariates significant in at least one course shown.IMD = index of multiple deprivation.

Table 7
Impact of Disability on Withdrawal Risk-Individual Course Level