Regression analysis

Regression analysis

A form of supervised learning that estimates the relationships among variables, based on the data you have available to you.


Excel has an in-built tool for Regression analysis.

  • Simple regression involves one independent variable (which is controlled within the experiment) and one dependent variable, which is being predicted based on the values of the independent variable
  • Multiple regression analysis uses more than one independent variable, with techniques like the method of least square used to determine whether the independent variables are making a significant contribution to the model
  • Linear regression analysis describes where a relationship between variables can be approximated by a straight line
  • Simple linear regression is where the relationship between one independent variable and another dependent variable can be approximated by a straight line, with slope and intercept
  • Multiple linear regression analyses two or more independent variables to create straight line models or equations with the dependent variable
  • Curvilinear relationships are the results of regression analysis experiments where the variables do not have a linear relationship

Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) measure the residuals (the variance between predicted and actual values) in the units of the label being analysed.


Relative Absolute Error (RAE) and Relative Squared Error (RSE) are relative measures of error. The closer their values are to zero, the more accurate the model is predicting.


R Squared (also referred to as the Coefficient of Determination) is the result of a regression analysis. A value of 1 would demonstrate a perfect correlation between the variables.
Example: An R Squared value of 0.96 between price and sales would indicate that 96% of the variance in sales is explained by the price.



Line of best fit

The process of finding a line of best fit on a chart or plot which represents the relationship is known as regression.


Residuals are the vertical distances between the data points on the y-axis and the line. Data points above the line have positive residual values and below the line have negative residual values. The residual value = Data – Fit.



Least squares regression

A linear relationship can be modelled by fitting a least squares regression line to the data, which can be used to make predictions. The least squares regression line is calculated to minimise residual values. It doesn’t matter if there are more positive data points (above the line) or negative (below), just that the overall sum of the residuals is the lowest possible value. Often referred to simply as ‘the regression line’, it always goes through the point where the x axis mean meets the y axis mean.


Regression lines are only useful where a straight line is appropriate to represent the data i.e. where the residuals are not too high. Otherwise, a curved line is more appropriate. You can use a residual plot to help decide if your line fits the data well, if the residual plot has no pattern with data points scattered above and below the line then the line fits well. If a regression line does provide a reasonable model for a set of data, it can be used for predicting and forecasting future outcomes.



Min-Max Normalisation

A system, sometimes used to support a regression experiment, which orders the observations in your range by value and scales them zero to one.