Let’s talk about statistics, and particularly about statistical analysis of data, which is one of my favorite topics. Let’s say you are in a laboratory, doing research, and you obtain a lot of data. You may then be tasked with the job of analyzing the data, and particularly, with using the data that you have obtained to predict how certain systems might behave in the future. How do you do this? There are a number of options, but one of the most straightforward is to use linear regressions.
What is a linear regression? Well, broadly speaking, it is a statistical method that is used to extrapolate straight-line relationships between variables. Sounds kind of confusing to me too, to be honest, so I like to simplify it in the following way:
Linear regressions take a lot of complex data (hard to analyze) and turn it into linear graphs (much easier to analyze). Once we have a linear relationship, if we know the independent variable (i.e., the x-value) of a given system, we can predict the dependent variable (i.e., the y-variable).
How does this work?
Let’s say you have a scatterplot, which is composed of a lot of data that you have measured. A real-world example of data collected from a research laboratory is shown in the figure below:
It looks like there is sort of a linear relationship between the x– and y-values, since the y-values are increasing relatively consistently as the x-values increase. However, you can’t actually use this kind of visual approximation of a linear relationship for any further calculations.
What can you do with this data? You can use linear regressions to calculate the best linear fit for this data.
Sounds good, right? But how would you actually calculate the linear fit? Most commonly, you will use software to do the linear regression and find the best linear fit. My personal favorite software options are Microsoft Excel and OriginPro, and in the figure below, you can see that both programs were able to fit a line to the equation.
Those look like very nice lines, but you may still have some questions. More specifically, you may be wondering:
- What kind of math was used to obtain these linear equations?
- How good is the linear fit? How do we even define what “good” and “bad” means in this context?
- Is this the only way to do linear regression analysis? If there are other options available, what are they and how do they work?
Let’s address these questions one at a time.
Question 1: What kind of math was used to obtain these linear equations?
This linear regression was done using the “least squares method,” which involves the following steps:
- Step 1: Calculate the mean (i.e., average) of the x-values for all of your data
- Step 2: Calculate the mean (i.e., average) of the y-values for all of your data
- Step 3: Calculate the slope of your line by using the following equation:
Where xi refers to a particular x-value (for i-values between 1 and n), yi refers to a particular y-value (for i-values between 1 and n), X̄ refers to the mean of the x-values, Ȳ refers to the mean of the y-values, and the big symbol at the start means that you have to calculate the sum of all of the values for i-values between 1 and n.
Step 4: Calculate the y-intercept for the equation, by using the following equation:
Where b refers to the y-intercept, and m refers to the slope (that you have already calculated).
Step 5: Write the equation of the line, using the slope that you calculated in Step 3 and the y-intercept that you calculated in Step 4.
That’s it! And this is one of the simpler methods for calculating linear regressions.
Question 2: How good is the linear fit? How do we even define what “good” and “bad” means in this context?
When we talk about whether a fit is “good” or “bad,” we are essentially wondering how closely the line matches the data. In other words, we can draw straight lines whenever we want, but if the points on the line are far away from the actual data, then the line is not actually an accurate representation of the data that we are trying to analyze.
One way to quantify whether a linear regression has given us a “good” or “bad” fit is to calculate the R2 value for a particular linear regression. This value, which is also called the “coefficient of determination,” varies between 0 and 1. The closer that the R2 value is to 1, the better the linear fit, and the further away the R2 value is from 1, the worse the linear fit. A “good” linear regression generally has R2 values that are above 0.95.
How do you calculate the R2 value? Usually by using software that does this work for you! Using the data from the previous figures, we can ask Microsoft Excel to calculate the R2 value for this linear regression, and we see that it is pretty close to 1:
In contrast, the figure shown below has an example of a poor linear fit, resulting in a markedly lower R2 value:
For this data, when we asked the software to try to do linear regression, it came up with the best linear fit for the data. Unfortunately, however, the data doesn’t seem to follow a linear trend, and therefore even the “best” linear fit displayed a pretty low R2 value.
What is the actual math that the software is using to calculate the R2 value?
The equation for this calculation can take many forms, and can even look deceptively simple, such as in the formulation shown below:
R2 = SSregression / SStotal
Where SSregression represents the sum of squares due to regression, and SStotal represents the total sum of squares.
What do these terms actually mean though? The “sum of squares due to regression” is a way of measuring how well the linear fit actually represnts the data, and the “total sum of squares” is a way to measure how much variability exists in the data.
We can do this calculation manually, but honestly, I would not advise it.
Question 3: Is this the only way to do linear regression analysis? If there are other options available, what are they and how do they work?
Our last question here has multiple parts. In short, no, this is not the only way to do linear regression analysis; many other options exist! In fact, there are too many options available to even list here or to understand fully.
In short, however, other linear regression options focus on obtaining good fits to the data via other methods. For example, the linear regression method of “least absolute deviations,” focuses on obtaining a linear equation that has the lowest possible errors between the line obtained and the actual data. Another method is called the “ridge regression,” and is particularly useful in cases where there is correlation between variables (i.e., in situations where the “independent” variable may not truly be “independent.”).
In the meantime, I hope you’ve improved your understanding of linear regressions and how such regressions can be calculated. Enjoy!