Least-squares regression is a methodology for finding the equation of a best fit line through a set of data points. It also provides a means of describing how well the data correlates to a linear relationship. An example of data with a general linear trend is seen in the above graph. First, we will go over the derivation of the formulas from theory and then I have also appended at the end of this post Scilab code for implementation of the algorithm.
The equation of a line through a data point can be written as:
The value of any data points that are not directly on the line but are in the proximity of the line can be given by:
Where e is the vertical error between the y-value given by the line and the actual y-value of the data. The goal would be to come up with a line which minimizes this error. In least-squares regression, this is accomplished by minimizing the sum of the squares of the errors. The sum of the squares of the errors is given by:
In order to minimize this value, the minimum finding techniques of differential calculus will be used. First take the derivative with respect to the slope.
Then with respect to the y-intercept yields:
Which can be substituted in the previous equation to solve for the slope.
The y-intercept is then:
It can be seen that these last two formulas only require knowledge about the data point coordinates and the number of points and the equation for the least squares linear regression line can be found.
Finally, below is the Scilab code implementation.
//the linear regression function takes x-values and //y-values of data in the column vectors 'X' and 'Y' and finds //the best fit line through the data points. It returns //the slope and y-intercept of the line as well as the //coefficient of determination ('r_sq'). //the function call for this should be of the form: //'[m,b,r2]=Linear_Regression(x,y)' function [slope, y_int, r_sq]=Linear_Regression(X, Y) //determine the number of data points n=size(X,'r'); //initialize each summation sum_x=0; sum_y=0; sum_xy=0; sum_x_sq=0; sum_y_sq=0; //calculate each sum required to find the slope, y-intercept and r_sq for i=1:n sum_x=sum_x+X(i); sum_y=sum_y+Y(i); sum_xy=sum_xy+X(i)*Y(i); sum_x_sq=sum_x_sq+X(i)*X(i); sum_y_sq=sum_y_sq+Y(i)*Y(i); end //determine the average x and y values for the //y-intercept calculation x_bar=sum_x/n; y_bar=sum_y/n; //calculate the slope, y-intercept and r_sq and return the results slope=(n*sum_xy-sum_x*sum_y)/(n*sum_x_sq-sum_x^2); y_int=y_bar-slope*x_bar; r_sq=((n*sum_xy-sum_x*sum_y)/(sqrt(n*sum_x_sq-sum_x^2)*sqrt(n*sum_y_sq-sum_y^2)))^2; //determine the appropriate axes size for plotting the data and //linear regression line axes_size=[min(X)-0.1*(max(X)-min(X)),min(Y)-0.1*(max(Y)-min(Y)),max(X)+0.1*(max(X)-min(X)),max(Y)+0.1*(max(Y)-min(Y))]; //plot the provided data plot2d(X,Y,style=-4,rect=axes_size); //plot the calculated regression line plot2d(X,(slope*X+y_int)); endfunction
I hope this proves helpful. Let me know in the comments if you have any questions.