Polynomial Regression

What if the simple linear regression model can’t find any relationship between the target and the predictor variable?🤔

Deep Patel
5 min readJun 13, 2021

What if your linear regression model cannot establish the relationship between the target variable and the predictor variable? In other words, what if they don’t have a linear relationship at all. After this blog, you will definitely get all your answers. This is my third blog on Regression series. This blog requires prior knowledge of Linear Regression.

What is Polynomial Regression?

Polynomial regression is a special case of linear regression where we fit a polynomial equation on the data with a curvilinear relationship between the target variable and the independent variables.

As we can see in the above image that the straight line doesn’t fit the data at all. This arises the case of underfitting. To overcome this underfitting, we need to increase the complexity. Thus, the need of this regression came into existence when we found a type of completely non-linear relation.

The implementation of polynomial regression is a two-step process:

  • First, we transform our data into a polynomial using the Polynomial Features function from sklearn and,
  • Then use linear regression to fit the parameters.
Complete Pipeline

In a curvilinear relationship, the value of the target variable changes in a non-uniform manner with respect to the predictor(s). In polynomial regression, we have a polynomial equation of degree n represented as:

A Polynomial Equation

Here: 𝜃0 is the bias, 𝜃1, 𝜃2, …, 𝜃n are the weights in the equation of the polynomial regression, and n is the degree of the polynomial. The number of higher-order terms increases with the increasing value of n, and hence the equation becomes more complicated. Note that n must be greater than 1, because n=1 is a simple linear regression.

For n predictors, the equation includes all the possible combinations of different order polynomials. This is known as Multi-dimensional Polynomial Regression.

But, there is a major issue associated with multi-dimensional Polynomial Regression, and that is multicollinearity. Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results. This restricts the model from fitting properly on the dataset.

With the increasing degree of the polynomial, the complexity of the model also increases. Therefore, the value of n must be chosen precisely. If this value is low, then the model won’t be able to fit the data properly and if high, the model will overfit the data easily. Below, is a simple code that performs polynomial regression.

#Import necessary libraties
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
#Change degree to reduce the error
polynomial_features= PolynomialFeatures(degree=2)
x_poly = polynomial_features.fit_transform(x_train)
model = LinearRegression()
model.fit(x_poly, y)y_poly_pred = model.predict(x_poly)
rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
r2 = r2_score(y,y_poly_pred)
print(rmse)
print(r2)
#Plotting the polynomial line
plt.scatter(x, y, s=10)
# sort the values of x before line plot
sort_axis = operator.itemgetter(0)
sorted_zip = sorted(zip(x,y_poly_pred), key=sort_axis)
x, y_poly_pred = zip(*sorted_zip)
plt.plot(x, y_poly_pred, color='m')
plt.show()

Finally, trying multiple times, you will get something like this:

Visualizing the fit with different degree

Things to be kept in mind for better results.

The Bias vs Variance trade-off

Bias refers to the error due to the model’s simplistic assumptions in fitting the data. A high bias means that the model is unable to capture the patterns in the data and this results in under-fitting.

Variance refers to the error due to the complex model trying to fit the data. High variance means the model passes through most of the data points and it results in over-fitting the data. The below picture summarizes our learning.

Illness of model

From the below picture we can observe that as the model complexity increases, the bias decreases and the variance increases and vice-versa. Ideally, a machine learning model should have low variance and low bias. But practically it’s impossible to have both. Therefore to achieve a good model that performs well both on the train and unseen data, a trade-off is made. Trade-off is tension between the error introduced by the bias and the variance.

Source: http://scott.fortmann-roe.com/docs/BiasVariance.html

This was all about Polynomial Regression. GO ahead in this blog for other regression model.

CONCLUSION:

We can clearly observe that Polynomial Regression is better at fitting the complex data than linear regression. Also, due to better-fitting, the RMSE of Polynomial Regression is way lower than that of Linear Regression.

It is used in many experimental procedures to produce the outcome using it’s polynomial equation. It provides a great defined relationship between the independent and dependent variables. It is used to study the isotopes of the sediments. It is one of the difficult regression techniques as compared to other regression methods, so having in-depth knowledge about the approach and algorithm will help you to achieve better results.

I hope that you like this post. Thanks for reading😊

--

--

Deep Patel

Learning and exploring this beautiful world with amazing tech.