Data Analysis and Classification of Student Completion

3 min readJun 24, 2021

Student admissions is a bad business proposition for universities. Why? Because it costs a lot of money upfront in marketing with the aim of student recruitment — all while there’s not much of a guarantee or promise of returns. Once a student is recruited the university is expected to conduct placement testing, academic advising, orientation, etc. Other services that follow are housing, financial planning, enrollment into special service programs (SPSERV), etc. All of this requires serious manpower and financial resources well before any tuition is paid. Not only do universities invest a lot of money to get students in the door, they also hope that these prospective students will complete their coursework and graduate. A prospective student costs more to the university than what they pay if the student transfers out after the 1st year. Overall the problem boils down to the following: The challenge of being able to identify prospective students that will graduate rather than transferring out.

In this analysis we focus our efforts on student completion rates (completed = graduated, incomplete = transferred out). The goal is to identify students who are more likely to graduate so we can focus our resources on these students. Based on previous years of enrollment and student completion rates we can make a few observations that can be insightful:

Completion rates of students based on special programs (SPSERV), major and housing.

In these graphs we can tell which categories of students have experienced the highest completion rate. From left to right in the top row:

The first graph depicts the overall student completion rate.
The FIS program has the highest completion rate among the specials service programs (SPSERV).
STEM Majors have the highest completion rate when it comes to majors.
Commuter students have higher completion rates compared to students who live on campus.

These observations could be useful as they tell the university which departments can be improved upon for a higher student completion rate. For instance, students who dorm have lower completion rates, which suggests the residential life on campus has room for improvement. Commerce majors could use more support such as tutoring and lower student to instructor ratio.

Similarly, we can see from the graph above that the difference in the completion rates of students based on scholarship is negligible, which suggests that lowering first year scholarship to a certain extent might be feasible.

We can also observe the features that contribute towards completion rates and rank them based on which features are the most informative. Below are the top 5 features from our dataset ranked in terms of how informative (total 65%) they were towards making predictions:

Feature ranking:
1 SAT 0.21442193594644499
2 Scholarship amount 0.18714307445903064
3 Median Income 0.17662729315926007
4 dist 0.09385720822622844
5 Zip_Count 0.08501271173836272

Furthermore we can use a supervised learning method (Decision tree classifier) to solve this binary classification problem. We compared accuracy of our classification using various techniques:

Hyperparameter tuning of our Decision Tree model.
Ensemble methods (Random forest)
Recursive feature elimination
Meta estimator (Bagging classifier)

With our final model we were able to identify students who are likely to graduate with around 90% accuracy as seen in the above Confusion Matrix plot. For more details you can check out my Github repo for this project here.

In relation to this project, What else can we do ?

We can use student high school course work /grades to assess student preparedness for college, which directly impacts completion rate.
We can do a separate analysis on identifying university services that contribute towards completion rates.
We can use student’s college GPA to determine drop out risk rates.
We can identify geographic hotspots to recruit students using clustering.

I hope this blog provides some insight into how we can approach a binary classification problem about identifying students who will graduate. This can be extended to more complex business problems with multiple classes.

Thanks for reading!

Data Analysis and Classification of Student Completion

Written by Sachin Naik