Learn Regression: Predict University Graduate Admissions Using Regression

Problem Statement

Predict Graduate University Admissions Using Multiple Linear Regression

To download the dataset:(https://www.kaggle.com/mohansacharya/graduate-admissions)

Dataset Description:

This dataset is created for the prediction of Graduate Admissions from an Indian perspective. It contains the following attributes:

  • GRE Scores (out of 340)
  • TOEFL Scores (out of 120)
  • University Rating (out of 5)
  • Statement of Purpose (out of 5)
  • Letter of Recommendation Strength (out of 5)
  • Undergraduate GPA (out of 10)
  • Research Experience (either 0 or 1)
  • Chance of Admit (ranging from 0 to 1)

Tasks:

  • Download the Dataset from Dropbox
  • Import Required Libraries
  • Load and Analyze the Dataset
  • Data Visualization
  • Create Training and Testing Split
  • Train & Evaluate a Linear Regression Model
  • Train & Evaluate a Multiple Linear Regression Model

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Task 1: Download the dataset from Dropbox

!wget https://www.dropbox.com/s/f1x86l7xkdkz6ke/Admission_Predict.csv

Task 2: Import Required Libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px

Task 3: Load and Analyze the Dataset

df = pd.read_csv(‘Admission_Predict.csv’)

df.head()

Output

Analyzing the dataset using Pandas Profiling

!pip install pandas-profiling==2.7.1

#Generating a Pandas Profiling Report

import pandas_profiling

from pandas_profiling import ProfileReport

prof = ProfileReport(df)

prof.to_file(output_file=’output.html’)

Output

Analyzing the data using Sweetviz

Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. The output is a fully self-contained HTML application.

The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

Refer to this link (https://pypi.org/project/sweetviz/) to learn more about Sweetviz

#Installing Sweetviz

!pip install sweetviz

# Importing sweetviz

import sweetviz as sv

#Analyzing the dataset

report = sv.analyze(df)

#Display the report

report.show_html(‘Admissions.html’)

Output

#Please refer to the HTML file created by the name of Admissions.html

#Let’s drop the Serial No. column as it’s of no use to us

df.drop(‘Serial No.’, inplace = True, axis = 1) #Dropping the Serial No. column

df.shape #Checking the shape of the data frame

Output

df.describe()

Output

df.isnull().sum() #Checking the Null Values

Output

Cool! We do not have any Null Values in the Data set. This time, we are lucky but, this is not the case most of the times

#Grouping by University Rankings

df_new = df.groupby(by = ‘University Rating’).median()

df_new

Output

Task 4: Data Visualization

plt.figure(figsize=(12,8))

for i in df.columns:

fig = px.histogram(df, x = i)

fig.show()

Output

and many more plots.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Observations:

  • Most of the people scored between 310 and 327
  • Very few people have scored greater than 330 and less than 300

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

sns.pairplot(df)

plt.show()

Output

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Observations:

- Higher GRE Score, TOEFL Score, SOP, LOP, CGPA, Research Experience have higher chances of getting an Admit

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Simple Linear Regression

We obtain a relationship between 2 variables x & y by predicting the value of y based on x

x — Independent Variable

y — Dependent Variable (Target Variable/ Output Variable)

It is called Simple Linear Regression because it examines the relationship between 2 variables only

Why Linear?

When the Independent Variable increases (or decreases), the dependent variable increases or decreases in a Linear Fashion

x = df[‘GRE Score’]

y = df[‘Chance of Admit’] #Target Variable

print(x.shape, y.shape)

Output

#Converting x & y into NumPy Arrays

x = np.array(x)

y = np.array(y)

x = x.reshape(-1,1)

x.shape

Output

y = y.reshape(-1,1)

y.shape

Output

Scaling the Data

#Scaling the Data

from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()

minmax = MinMaxScaler()

x = scaler.fit_transform(x)

y = scaler.fit_transform(y)

Splitting the Dataset Using train_test_split from the sklearn library

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

Output

Building a Simple Linear Regression Model

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, accuracy_score

lr_model = LinearRegression()

lr_model.fit(x_train, y_train)

Output

Evaluating the Model

accuracy_lr = lr_model.score(x_test, y_test)

print(accuracy_lr)

Output

Oh. We could achieve an accuracy of just 58.6% on the testing data using a Simple Linear Regression Model which is very bad

Let us check the correlation between the variables to understand how they affect the target variable (i.e., Chance of Admit)

#Pandas df.corr() is used to find the pairwise correlation of all columns in the data frame

plt.figure(figsize=(12,8))

sns.heatmap(df.corr(), annot=True)

plt.show()

Output

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Observations:

  • Students who have a High GRE Score tend to also have a high TOEFL Score. That means they are positively correlated
  • CGPA & TOEFL Score and Chance of Admit is also highly correlated which suggests that CGPA & TOEFL Score are very important factors

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Now, let us try using multiple features (for e.g. GRE Score, TOEFL Score, SOP, LOR, CGPA, etc.) to predict the Chance of Admit using Multiple Linear Regression

Multiple Linear Regression

Examines the relationship between more than 2 variables

Whoa! This is what we are going to use because we have many dependent variables such as GRE Score, TOEFL Score, etc and one independent(or target variable) Chance of Admit

x = df.drop(columns = [‘Chance of Admit’])

y = df[‘Chance of Admit’] #Target Variable

print(x.shape, y.shape)

Output

#Converting x & y into NumPy Arrays

x = np.array(x)

y = np.array(y)

y = y.reshape(-1,1)

y.shape

Output

Scaling the Data

#Scaling the Data

from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()

minmax = MinMaxScaler()

x = scaler.fit_transform(x)

y = scaler.fit_transform(y)

Splitting the data using the train_test_split function from the sklearn library

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

Output

Train and Evaluate a Multiple Linear Regression Model

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, accuracy_score

lr_model = LinearRegression()

lr_model.fit(x_train, y_train)

Output

accuracy_lr = lr_model.score(x_test, y_test)

print(accuracy_lr)

Output

Okay. Good. We achieved an accuracy of 81% on the testing data using a Multiple Linear Regression Model