Blog Archives

Least Squares Linear Regression

Least-squares regression is a methodology for finding the equation of a best fit line through a set of data points. It also provides a means of describing how well the data correlates to a linear relationship. An example of data with a general linear trend is seen in the above graph. First, we will go over the derivation of the formulas from theory and then I have also appended at the end of this post Scilab code for implementation of the algorithm.

The equation of a line through a data point can be written as:

The value of any data points that are not directly on the line but are in the proximity of the line can be given by:

Where e is the vertical error between the y-value given by the line and the actual y-value of the data. The goal would be to come up with a line which minimizes this error. In least-squares regression, this is accomplished  by minimizing the sum of the squares of the errors. The sum of the squares of the errors is given by:

In order to minimize this value, the minimum finding techniques of differential calculus will be used. First take the derivative with respect to the slope.

Then with respect to the y-intercept yields:

Which can be substituted in the previous equation to solve for the slope.

The y-intercept is then:

It can be seen that these last two formulas only require knowledge about the data point coordinates and the number of points and the equation for the least squares linear regression line can be found.

Finally, below is the Scilab code implementation.

//the linear regression function takes x-values and
//y-values of data in the column vectors 'X' and 'Y' and finds
//the best fit line through the data points. It returns
//the slope and y-intercept of the line as well as the
//coefficient of determination ('r_sq').

//the function call for this should be of the form:
//'[m,b,r2]=Linear_Regression(x,y)'
function [slope, y_int, r_sq]=Linear_Regression(X, Y)
    //determine the number of data points
    n=size(X,'r');

    //initialize each summation
    sum_x=0;
    sum_y=0;
    sum_xy=0;
    sum_x_sq=0;
    sum_y_sq=0;

    //calculate each sum required to find the slope, y-intercept and r_sq
    for i=1:n
        sum_x=sum_x+X(i);
        sum_y=sum_y+Y(i);
        sum_xy=sum_xy+X(i)*Y(i);
        sum_x_sq=sum_x_sq+X(i)*X(i);
        sum_y_sq=sum_y_sq+Y(i)*Y(i);
    end

    //determine the average x and y values for the
    //y-intercept calculation
    x_bar=sum_x/n;
    y_bar=sum_y/n;

    //calculate the slope, y-intercept and r_sq and return the results
    slope=(n*sum_xy-sum_x*sum_y)/(n*sum_x_sq-sum_x^2);
    y_int=y_bar-slope*x_bar;
    r_sq=((n*sum_xy-sum_x*sum_y)/(sqrt(n*sum_x_sq-sum_x^2)*sqrt(n*sum_y_sq-sum_y^2)))^2;

    //determine the appropriate axes size for plotting the data and
    //linear regression line
    axes_size=[min(X)-0.1*(max(X)-min(X)),min(Y)-0.1*(max(Y)-min(Y)),max(X)+0.1*(max(X)-min(X)),max(Y)+0.1*(max(Y)-min(Y))];

    //plot the provided data
    plot2d(X,Y,style=-4,rect=axes_size);

    //plot the calculated regression line
    plot2d(X,(slope*X+y_int));
endfunction

I hope this proves helpful. Let me know in the comments if you have any questions.

Related Posts:

Maximizing and Minimizing

The Bisection Method Using Scilab

Structural Finite Element Analysis Software Installation


			

Calculus Applied to Manufacturing Design

Suppose you are in the can manufacturing industry. Your job is to create cans according to the specifications of canned goods suppliers who will put their products in your cans and distribute them for sale. Now suppose one of those canned good suppliers comes to you and says, “We need a can that can contain a 300 mL volume of our product. Joe Shmoe says he can provide us these cans for $X. What can you do?” If you want the canned good suppliers business you will have to underbid Joe. (Provided you can’t sale them on some other facet of your company, i.e. service, quality, etc.) How will you do it? Where can you cut cost? What is the most inexpensive can you can make to satisfy the spec? In the real world this is a very complex question because it involves labor, facility, process, and material expenses along with several others. For the sake of this exercise however, we will use the principle of the derivative to optimize our can design in order to minimize material cost.

Now let’s assume you are set up to produce a standard cylindrical can with a top and a botttom. (I realize the canned goods supplier will have to add their product before the top is sealed and I realize that actual cans have a small lip on the top and bottom, but for the sake of simplicity…) To minimize the cost of the cans we want to minimize the amount of material used. But what shape cylinder will contain the required 300 mL but use minimal material? A short fat can or a tall thin can? How “tall” or how “short”?

First, realize the total material for the can can be determined by finding the area of the metal. For a cylinder this is the area of the bottom plus the area of the top plus the area of the side.

We can rewrite the area of the top and bottom as:

If you imagine removing the top and bottom, cutting the side along a vertical line, and laying it out flat you can see that the area of the side is the area of a rectangle with the length of one side equal to the height of the can and the length of the other side equal to the circumference of the top or bottom. So its area is:

Putting this all together, the total area of a can is given by:

Now the volume of the cylinder is given by:

If we rearrange this expression for the height, h we get:

Now we substitute this into our expression for the area to eliminate the h and make the whole thing in terms of 1 variable. (r – because the volume is given to us in the spec)

This equation tells us how much area a can has of a particular volume with any radius. With this in hand we can now use the derivative to optimize (maximize or minimize) the can design and find out which radius will give us the smallest area and thus the most inexpensive (as far as material costs go) cylindrical can that can contain the prescribed volume. Taking the derivative of the area equation using the power rule we have:

Now we set this equal to zero to find the minimum area. Solving for r we get:

Knowing that the volume is 300 mL our radius is:

Putting this back in our equation for h we get a height of:

These are the dimensions for our can of minimum material. Below I have tabulated and graphed some other values for the radius and associated heights that yield the necessary volume and show that the resulting surface area is larger than our optimized design.

One other beautiful thing about this technique is that I have a general formula for an optimized cylindrical can for any volume. I can just put in whatever the specification says and get my design.

Although this is greatly simplified, I hope this provides another demonstration of the real world applicability of these techniques.

Maximizing and Minimizing

In the last post about the derivative we emphasized its utility with some examples about how it can be used to find some relationships between position, velocity, and acceleration or to maximize profit and minimize cost in numerous different practical situations. Well now I want to discuss that in more depth and provide an example. Last time we used the following definition for the derivative (or slope) using the limit:

With this we can find the exact slope at any point on a curve. Now consider the following graph. Each of the red lines represents the slope at that point with a tangent line

Consider what the slope of such a tangent line would be at a minimum or maximum on a curve. That’s right, the line would be flat and horizontal like this:

Therefore the slope at these points is zero. With this geometrical understanding of a minimum or maximum we can find the derivative and then determine where the derivative (or slope) is equal to zero.

Now let’s consider an example function that relates the production level of a manufacturer to the potential revenue per unit. Obviously we want to maximize revenue so the key will be to find the maximum. Here is the function:

Now you might ask, “Well how in the world did we know this tells us about this relationship?” The short answer is that after some experimenting by the company at various different production levels and measuring the resultant revenue, the corresponding data points were plotted and the following function was mathematically obtained as a best-fit approximation to describe those data points. (More on this in a future post.) So assuming this is the correct function that we are working with, let’s first find the derivative.

Now this derivative gives us the slope at any point along the curve of the original function above. Where does this new function equal zero? This is the same as finding the roots of the equation which is a typical problem in algebra. Here is how it’s done based on our example.

Now if we put these two production level values into the original function it will tell us the revenue produced. Here are the results.

Therefore production level 2 provides the maximum revenue potential. (The root zero is an interesting point called an inflection point. But that’s another topic.) To demonstrate that this process has in fact located the maximum, here is a table of revenue values at other production levels and a graph of the function.

Next time we will look more in depth at the derivative and some basic rules or short-cuts we can use so that we don’t have to go through the entire limit process every time we want to find the derivative.