import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pylab as plt
%matplotlib inline
reg = linear_model.LinearRegression()
X = iris[[‘petal_length’]]
y = iris[‘petal_width’]
reg.fit(X, y)
print(“y = x *”, reg.coef_, “+”, reg.intercept_)
predicted = reg.predict(X)
mse = ((np.array(y)-predicted)**2).sum()/len(y)
r2 = r2_score(y, predicted)
print(“MSE:”, mse)
print(“R Squared:”, r2)
Training and Testing Data
What we have done so far is to train and test the model on the same data. This is not good practice as we have no idea how good the model would be on new data. Better practice is to split the data into two sets – training and testing data. We build a model on the training data and test it on the test data.
Sklearn provides a function train_test_split to do this task. It returns two arrays of data. Here we ask for 20% of the data in the test set.
train, test = train_test_split(iris, test_size=0.2, random_state=142)
print(train.shape)
print(test.shape)
can now repeat the above procedure but this time train the model on the training data and evaluate on the test data. Do the MSE and 𝑅2 values change?
have to report MSE and 𝑅2
values on the training and test set. Also, provide interpretation of results. Based on the values on training and testing data, comment whether model is overfitting?