A basic statistical method of finding relationships between variables that always helps as a benchmark model

If you are here it means you’re in your first steps to become a Data Scientist or just refreshing your memory, in any case it’s great having you here!

So what are __Linear regression models__ are all about?
They are being used to show or predict the relationship between two variables or factors.
What is the factor? Good question!
The factor who is being predicted (the factor that the equation *solves for*) is called the** dependent variable**.
The factors that are used to predict the value of the dependent variable are called the **independent variables**.

In linear regression, each observation consists of two values.
One value is for the dependent variable and one value is for the independent variable. Here you can see the outcome , a straight line approximates the relationship between the dependent variable and the independent variable:

One important thing to remember when you are applying this model to your data set,when two or more independent variables are used in regression analysis, the model is no longer a simple linear one- this is known as multiple regression.

## Formula For a Simple Linear Regression Model

In simple linear regression analysis there are two factors -> designated ** x** and

**and the equation that describes how**

*y***is related to**

*y***is known as the**

*x***regression model**.

Simple linear regression model formula:

** y** =

*β*0 +

*β*1

**+ε**

*x*** LRM = linear regression
The LRM contains an error term that is represented by ε.
The error term is used to account for the variability in ** y** that cannot be explained by the linear relationship between

**and**

*x***. If ε were not present, that would mean that knowing**

*y***would provide enough information to determine the value of**

*x***.**

*y*There also parameters that represent the population being studied. These parameters of the model are represented by *β*0 and *β*1.

The simple LRM is graphed as a straight line, where:

*β***0**is the y-intercept of the regression line.*β***1**is the slope.(*Ε*) is the mean or expected value of*y*for a given value of*y*.*x*

A regression line can show a positive linear relationship, a negative linear relationship, or no relationship .

**No relationship:**The graphed line in a simple linear regression is flat (not sloped) or non liniar. So there is no relationship between the two variables.**Positive relationship:**The regression line slopes upward with the lower end of the line at the y-intercept (axis) of the graph and the upper end of the line extending upward into the graph field, away from the x-intercept (axis). There is a positive linear relationship between the two variables: as the value of one increases, the value of the other also increases.**Negative relationship:**The regression line slopes downward with the upper end of the line at the y-intercept (axis) of the graph and the lower end of the line extending downward into the graph field, toward the x-intercept (axis). There is a negative linear relationship between the two variables: as the value of one increases, the value of the other decreases.

## The Estimated Linear Regression Equation

If the parameters of the population were known, the simple linear regression equation (shown below) could be used to compute the mean value of ** y** for a known value of

**.**

*x*** Ε**(

**) =**

*y**β*0 +

*β*1

**+ε**

*x*In practice, however, parameter values generally are not known so they must be estimated by using data from a sample of the population.
The population parameters are estimated by using sample statistics .
The sample statistics are represented by *β*0 and *β*1.
When the sample statistics are substituted for the population parameters, the estimated regression equation is formed.3

The estimated regression equation is:

(** ŷ**) =

*β*0 +

*β*1

**+ε**

*x*** Note: ( **ŷ**) is pronounced y hat.

The graph of the estimated simple regression equation is called the estimated regression line.

*β***0**is the y-intercept of the regression line.*β***1**is the slope.(

) is the estimated value of*ŷ*for a given value of*y*.*x*

## Limits of Simple Linear Regression

No matter how good your dataset is , it will not always tell the full story you are after...

In most cases , you will use regression analysis to establish that a correlation exists between variables.
**Important!**
Keep in mind that *correlation *is not the same as *causation*: a relationship between two variables does not mean one causes the other to happen. Even a line in a simple linear regression that fits the data points well* may not guarantee a cause-and-effect relationship*.

Using a linear regression model will allow you to discover whether a relationship between variables exists at all. To understand exactly what that relationship is, and whether one variable causes another, you will need additional research and statistical analysis.

## Comments