Regression analysis is a statistical method for studying the dependence of a random variable on variables. Regression in Excel: equation, examples

The goal of regression analysis is to measure the relationship between a dependent variable and one (pairwise regression analysis) or more (multiple) independent variables. Independent variables are also called factorial, explanatory, determinant, regressors and predictors.

The dependent variable is sometimes referred to as the defined, explained, or "response" variable. The extremely widespread use of regression analysis in empirical research is not only due to the fact that it is a convenient tool for testing hypotheses. Regression, especially multiple regression, is an effective modeling and forecasting technique.

Let's start explaining the principles of working with regression analysis with a simpler one - the pair method.

Pairwise regression analysis

The first steps when using regression analysis will be almost identical to those taken by us in the framework of calculating the correlation coefficient. The three main conditions for the effectiveness of correlation analysis using the Pearson method - the normal distribution of variables, the interval measurement of variables, the linear relationship between variables - are also relevant for multiple regression. Accordingly, at the first stage, scatterplots are constructed, a statistical and descriptive analysis of the variables is carried out, and a regression line is calculated. As in the framework of correlation analysis, regression lines are built using the least squares method.

To more clearly illustrate the differences between the two methods of data analysis, let's turn to the example already considered with the variables "SPS support" and "rural population share". The original data is identical. The difference in scatterplots will be that in the regression analysis it is correct to plot the dependent variable - in our case, "SPS support" along the Y axis, while in the correlation analysis it does not matter. After cleaning outliers, the scatterplot looks like:

The fundamental idea of regression analysis is that, having a general trend for variables - in the form of a regression line - you can predict the value of the dependent variable, having the values of the independent.

Let's imagine an ordinary mathematical linear function. Any line in Euclidean space can be described by the formula:

where a is a constant that specifies the offset along the y-axis; b - coefficient that determines the angle of the line.

Knowing the slope and the constant, you can calculate (predict) the value of y for any x.

This simplest function formed the basis of the regression analysis model with the caveat that we will predict the value of y not exactly, but within a certain confidence interval, i.e. approximately.

The constant is the point of intersection of the regression line and the y-axis (the F-intercept, usually referred to as "interceptor" in statistical packages). In our example of voting for the SPS, its rounded value will be 10.55. The slope coefficient b will be equal to approximately -0.1 (as in the correlation analysis, the sign shows the type of relationship - direct or inverse). Thus, the resulting model will look like SP C = -0.1 x Sel. us. + 10.55.

So, for the case of the "Republic of Adygea" with a share of the rural population of 47%, the predicted value will be 5.63:

ATP \u003d -0.10 x 47 + 10.55 \u003d 5.63.

The difference between the original and predicted values is called the residual (we have already encountered this term - fundamental for statistics - when analyzing contingency tables). So, for the case of the Republic of Adygea, the remainder will be 3.92 - 5.63 = -1.71. The larger the modulo value of the remainder, the less well predicted the value.

We calculate the predicted values and residuals for all cases:

Happening	Sat. us.	THX (original)	THX (predicted)	Remains
Republic of Adygea	47	3,92	5,63	-1,71 -
Altai Republic	76	5,4	2,59	2,81
Republic of Bashkortostan	36	6,04	6,78	-0,74
The Republic of Buryatia	41	8,36	6,25	2,11
The Republic of Dagestan	59	1,22	4,37	-3,15
The Republic of Ingushetia	59	0,38	4,37	3,99
Etc.

Analysis of the ratio of initial and predicted values serves to assess the quality of the resulting model, its predictive ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the correlation coefficient between the original and predicted values of the dependent variable. In paired regression analysis, it is equal to the usual Pearson correlation coefficient between the dependent and independent variable, in our case - 0.63. To meaningfully interpret the multiple R, it must be converted into a coefficient of determination. This is done in the same way as in correlation analysis - squaring. The coefficient of determination R-square (R 2) shows the proportion of variation in the dependent variable explained by the independent (independent) variables.

In our case, R 2 = 0.39 (0.63 2); this means that the variable "proportion of the rural population" explains about 40% of the variation in the variable "support for CPS". The larger the value of the coefficient of determination, the higher the quality of the model.

Another measure of model quality is the standard error of estimate. This is a measure of how much the points are "scattered" around the regression line. The measure of dispersion for interval variables is the standard deviation. Accordingly, the standard error of the estimate is the standard deviation of the distribution of the residuals. The higher its value, the greater the spread and the worse the model. In our case, the standard error is 2.18. It is by this amount that our model will “err on average” when predicting the value of the variable “SPS support”.

Regression statistics also includes analysis of variance. With its help, we find out: 1) what proportion of the variation (dispersion) of the dependent variable is explained by the independent variable; 2) what proportion of the variance of the dependent variable is accounted for by the residuals (unexplained part); 3) what is the ratio of these two values (/ "-ratio). Dispersion statistics is especially important for sample studies - it shows how likely it is to have a relationship between the independent and dependent variables in the general population. However, for continuous studies (as in our example), the study In this case, it is checked whether the revealed statistical pattern is caused by a coincidence of random circumstances, how characteristic it is for the complex of conditions in which the surveyed population is located, i.e. it is established that the result obtained is not true for some more extensive general aggregate, but the degree of its regularity, freedom from random influences.

In our case, the analysis of variance statistics is as follows:

	SS	df	MS	F	meaning
Regress.	258,77	1,00	258,77	54,29	0.000000001
Remaining	395,59	83,00	L,11
Total	654,36

The F-ratio of 54.29 is significant at the 0.0000000001 level. Accordingly, we can safely reject the null hypothesis (that the relationship we found is random).

A similar function is performed by the t criterion, but with respect to regression coefficients (angular and F-crossings). Using the criterion /, we test the hypothesis that the regression coefficients in the general population are equal to zero. In our case, we can again confidently reject the null hypothesis.

Multiple regression analysis

The multiple regression model is almost identical to the pairwise regression model; the only difference is that several independent variables are sequentially included in the linear function:

Y = b1X1 + b2X2 + …+ bpXp + a.

If there are more than two independent variables, we are not able to get a visual representation of their relationship; in this regard, multiple regression is less “visible” than pair regression. When there are two independent variables, it can be useful to display the data in a 3D scatterplot. In professional statistical software packages (for example, Statistica) there is an option to rotate a three-dimensional chart, which allows a good visual representation of the data structure.

When working with multiple regression, unlike pair regression, it is necessary to determine the analysis algorithm. The standard algorithm includes all available predictors in the final regression model. The step-by-step algorithm assumes sequential inclusion (exclusion) of independent variables, based on their explanatory "weight". The stepwise method is good when there are many independent variables; it "cleanses" the model of frankly weak predictors, making it more compact and concise.

An additional condition for the correctness of multiple regression (along with interval, normality and linearity) is the absence of multicollinearity - the presence of strong correlations between independent variables.

The interpretation of multiple regression statistics includes all the elements that we have considered for the case of pairwise regression. In addition, there are other important components in the statistics of multiple regression analysis.

We will illustrate the work with multiple regression on the example of testing hypotheses that explain the differences in the level of electoral activity in the regions of Russia. Specific empirical studies have suggested that voter turnout is affected by:

National factor (variable "Russian population"; operationalized as the share of the Russian population in the constituent entities of the Russian Federation). It is assumed that an increase in the proportion of the Russian population leads to a decrease in voter turnout;

Urbanization factor (variable "urban population"; operationalized as the share of the urban population in the constituent entities of the Russian Federation, we have already worked with this factor as part of the correlation analysis). It is assumed that an increase in the proportion of the urban population also leads to a decrease in voter turnout.

The dependent variable - "intensity of electoral activity" ("active") is operationalized through the average turnout data for the regions in the federal elections from 1995 to 2003. The initial data table for two independent and one dependent variable will have the following form:

Happening	Variables
Happening	Assets.	Gor. us.	Rus. us.
Republic of Adygea	64,92	53	68
Altai Republic	68,60	24	60
The Republic of Buryatia	60,75	59	70
The Republic of Dagestan	79,92	41	9
The Republic of Ingushetia	75,05	41	23
Republic of Kalmykia	68,52	39	37
Karachay-Cherkess Republic	66,68	44	42
Republic of Karelia	61,70	73	73
Komi Republic	59,60	74	57
Mari El Republic	65,19	62	47

Etc. (after cleanup of emissions, 83 cases out of 88 remain)

Statistics describing the quality of the model:

1. Multiple R = 0.62; L-square = 0.38. Therefore, the national factor and the factor of urbanization together explain about 38% of the variation in the variable "electoral activity".

2. Average error is 3.38. This is how “on average” the constructed model is wrong when predicting the level of turnout.

3. /l-ratio of explained and unexplained variation is 25.2 at the level of 0.000000003. The null hypothesis about the randomness of the revealed relationships is rejected.

4. The criterion / for the constant and regression coefficients of the variables "urban population" and "Russian population" is significant at the level of 0.0000001; 0.00005 and 0.007 respectively. The null hypothesis about the randomness of the coefficients is rejected.

Additional useful statistics in the analysis of the ratio of the initial and predicted values of the dependent variable are the Mahalanobis distance and Cook's distance. The first is a measure of the uniqueness of the case (shows how much the combination of values of all independent variables for a given case deviates from the average value for all independent variables at the same time). The second is a measure of the influence of the case. Different observations affect the slope of the regression line in different ways, and using the Cook's distance, you can compare them according to this indicator. This is useful when cleaning up outliers (an outlier can be thought of as an overly influential case).

In our example, Dagestan is one of the unique and influential cases.

Happening	Initial values	Predska values	Remains	Distance Mahalanobis	Distance
Adygea	64,92	66,33	-1,40	0,69	0,00
Altai Republic	68,60	69.91	-1,31	6,80	0,01
The Republic of Buryatia	60,75	65,56	-4,81	0,23	0,01
The Republic of Dagestan	79,92	71,01	8,91	10,57	0,44
The Republic of Ingushetia	75,05	70,21	4,84	6,73	0,08
Republic of Kalmykia	68,52	69,59	-1,07	4,20	0,00

The actual regression model has the following parameters: Y-intercept (constant) = 75.99; b (Hor. sat.) \u003d -0.1; b (Rus. nas.) = -0.06. Final formula:

Aactive, = -0.1 x Hor. sat.n+- 0.06 x Rus. sat.n + 75.99.

Can we compare the "explanatory power" of the predictors based on the value of the coefficient 61. In this case, yes, since both independent variables have the same percentage format. However, most often, multiple regression deals with variables measured on different scales (for example, income level in rubles and age in years). Therefore, in the general case, it is incorrect to compare the predictive capabilities of variables by the regression coefficient. In multiple regression statistics, there is a special beta coefficient (B) for this purpose, calculated separately for each independent variable. It is a partial (calculated after taking into account the influence of all other predictors) correlation coefficient of the factor and the response and shows the independent contribution of the factor to the prediction of the response values. In pairwise regression analysis, the beta coefficient is understandably equal to the pairwise correlation coefficient between the dependent and independent variable.

In our example, beta (Hor. nas.) = -0.43, beta (Russian nas.) = -0.28. Thus, both factors negatively affect the level of electoral activity, while the significance of the urbanization factor is significantly higher than the significance of the national factor. The combined effect of both factors determines about 38% of the variation in the variable "electoral activity" (see the L-squared value).

Regression analysis

regression (linear) analysis- a statistical method for studying the influence of one or more independent variables on a dependent variable. Independent variables are otherwise called regressors or predictors, and dependent variables are called criteria. Terminology dependent and independent variables reflects only the mathematical dependence of the variables ( see Spurious correlation), rather than a causal relationship.

Goals of regression analysis

Determination of the degree of determinism of the variation of the criterion (dependent) variable by predictors (independent variables)
Predicting the value of the dependent variable using the independent variable(s)
Determination of the contribution of individual independent variables to the variation of the dependent

Regression analysis cannot be used to determine whether there is a relationship between variables, since the existence of such a relationship is a prerequisite for applying the analysis.

Mathematical definition of regression

Strictly regressive dependence can be defined as follows. Let , be random variables with a given joint probability distribution. If for each set of values a conditional expectation is defined

(general regression equation),

then the function is called regression Y values by values , and its graph - regression line by , or regression equation.

Dependence on is manifested in the change in the average values of Y when changing . Although for each fixed set of values, the quantity remains a random variable with a certain dispersion.

To clarify the question of how accurately the regression analysis estimates the change in Y with a change, the average value of the variance of Y is used for different sets of values (in fact, we are talking about the measure of dispersion of the dependent variable around the regression line).

Least squares method (calculation of coefficients)

In practice, the regression line is most often sought as a linear function (linear regression) that best approximates the desired curve. This is done using the least squares method, when the sum of the squared deviations of the actually observed from their estimates is minimized (meaning estimates using a straight line that claims to represent the desired regression dependence):

(M - sample size). This approach is based on the well-known fact that the sum appearing in the above expression takes the minimum value precisely for the case when .

To solve the problem of regression analysis by the least squares method, the concept is introduced residual functions:

The condition for the minimum of the residual function:

The resulting system is a system of linear equations with unknowns

If we represent the free terms of the left side of the equations by the matrix

and the coefficients of the unknowns on the right side of the matrix

then we get the matrix equation: , which is easily solved by the Gauss method. The resulting matrix will be a matrix containing the coefficients of the regression line equation:

To obtain the best estimates, it is necessary to fulfill the LSM prerequisites (Gauss–Markov conditions). In English literature, such estimates are called BLUE (Best Linear Unbiased Estimators) - the best linear unbiased estimates.

Interpreting Regression Parameters

The parameters are partial correlation coefficients; is interpreted as the proportion of the variance of Y explained by fixing the influence of the remaining predictors, that is, it measures the individual contribution to the explanation of Y. In the case of correlated predictors, there is a problem of uncertainty in the estimates, which become dependent on the order in which the predictors are included in the model. In such cases, it is necessary to apply the analysis methods of correlation and stepwise regression analysis.

Speaking about non-linear models of regression analysis, it is important to pay attention to whether we are talking about non-linearity in independent variables (from a formal point of view, easily reduced to linear regression), or non-linearity in estimated parameters (causing serious computational difficulties). With the first type of nonlinearity, from a meaningful point of view, it is important to single out the appearance in the model of members of the form , , indicating the presence of interactions between features , etc. (see Multicollinearity).

Literature

Norman Draper, Harry Smith Applied regression analysis. Multiple Regression = Applied Regression Analysis. - 3rd ed. - M .: "Dialectics", 2007. - S. 912. - ISBN 0-471-17082-8
Sustainable Methods for Estimating Statistical Models: Monograph. - K. : PP "Sansparelle", 2005. - S. 504. - ISBN 966-96574-0-7, UDC: 519.237.5:515.126.2, LBC 22.172 + 22.152
Radchenko Stanislav Grigorievich, Regression Analysis Methodology: Monograph. - K. : "Korniychuk", 2011. - S. 376. - ISBN 978-966-7599-72-0

Wikimedia Foundation. 2010 .

As a result of studying the material of chapter 4, the student should:

know

basic concepts of regression analysis;
methods of estimation and properties of estimates of the method of least squares;
basic rules for significance testing and interval estimation of the equation and regression coefficients;

be able to

find estimates of the parameters of two-dimensional and multiple models of regression equations from sample data, analyze their properties;
check the significance of the equation and regression coefficients;
find interval estimates of significant parameters;

own

the skills of statistical estimation of the parameters of the two-dimensional and multiple regression equations; skills to check the adequacy of regression models;
skills in obtaining a regression equation with all significant coefficients using analytical software.

Basic concepts

After conducting a correlation analysis, when the presence of statistically significant relationships between variables has been identified and the degree of their tightness has been assessed, they usually proceed to a mathematical description of the type of dependencies using regression analysis methods. For this purpose, a class of functions is selected that links the effective indicator at and arguments„ calculate estimates of the parameters of the constraint equation and analyze the accuracy of the resulting equation .

Function| describing the dependence of the conditional average value of the effective feature at from the given values of the arguments, is called regression equation.

The term "regression" (from lat. regression- retreat, return to something) was introduced by the English psychologist and anthropologist F. Galton and is associated with one of his first examples, in which Galton, processing statistical data related to the question of the heredity of growth, found that if the height of the fathers deviates from the average height all fathers on X inches, then the height of their sons deviates from the average height of all sons by less than x inches The identified trend was called regression to the mean.

The term "regression" is widely used in the statistical literature, although in many cases it does not accurately characterize the statistical dependence.

For an accurate description of the regression equation, it is necessary to know the conditional law of distribution of the effective indicator y. In statistical practice, it is usually impossible to obtain such information, therefore, they are limited to finding suitable approximations for the function f(x u X 2, .... l *), based on a preliminary meaningful analysis of the phenomenon or on the original statistical data.

Within the framework of individual model assumptions about the type of distribution of the vector of indicators<) может быть получен общий вид regression equations, where. For example, under the assumption that the studied set of indicators obeys the ()-dimensional normal distribution law with the vector of mathematical expectations

Where, and by the covariance matrix,

where is the variance y,

The regression equation (conditional expectation) has the form

Thus, if a multivariate random variable ()

obeys the ()-dimensional normal distribution law, then the regression equation of the effective indicator at in explanatory variables has linear in X view.

However, in statistical practice, one usually has to limit oneself to finding suitable approximations for the unknown true regression function f(x), since the researcher does not have exact knowledge of the conditional law of the probability distribution of the analyzed performance indicator at for the given values of the arguments X.

Consider the relationship between true, model, and regression estimates. Let the performance indicator at associated with the argument X ratio

where is a random variable with a normal distribution law, moreover. The true regression function in this case is

Suppose that we do not know the exact form of the true regression equation, but we have nine observations on a two-dimensional random variable related by the relations shown in Fig. 4.1.

Rice. 4.1. The relative position of the truef(x) and theoreticalwowregression models

Location of points in fig. 4.1 allows us to confine ourselves to the class of linear dependencies of the form

Using the least squares method, we find an estimate for the regression equation.

For comparison, in Fig. 4.1 shows graphs of the true regression function and the theoretical approximating regression function. The estimate of the regression equation converges in probability to the latter wow with an unlimited increase in the sample size ().

Since we mistakenly chose a linear regression function instead of a true regression function, which, unfortunately, is quite common in the practice of statistical research, our statistical conclusions and estimates will not have the consistency property, i.e. no matter how much we increase the volume of observations, our sample estimate will not converge to the true regression function

If we had chosen the class of regression functions correctly, then the inaccuracy in the description using wow would be explained only by the limitedness of the sample and, therefore, it could be made arbitrarily small with

In order to best restore the conditional value of the effective indicator and the unknown regression function from the initial statistical data, the following are most often used: adequacy criteria loss functions.

1. Least square method, according to which the squared deviation of the observed values of the effective indicator, , from the model values is minimized, where the coefficients of the regression equation; are the values of the vector of arguments in "-M observation:

The problem of finding an estimate of the vector is being solved. The resulting regression is called mean square.

2. Method of least modules, according to which the sum of absolute deviations of the observed values of the effective indicator from the modular values is minimized, i.e.

The resulting regression is called mean absolute(median).

3. minimax method is reduced to minimizing the maximum deviation module of the observed value of the effective indicator y, from the model value, i.e.

The resulting regression is called minimax.

In practical applications, there are often problems in which the random variable is studied y, depending on some set of variables and unknown parameters. We will consider () as (k + 1)-dimensional general population, from which a random sample of volume P, where () is the result of the /-th observation,. It is required to estimate unknown parameters based on the results of observations. The task described above refers to the tasks of regression analysis.

regression analysis call the method of statistical analysis of the dependence of a random variable at on variables considered in regression analysis as non-random variables, regardless of the true distribution law

In statistical modeling, regression analysis is a study used to evaluate the relationship between variables. This mathematical method includes many other methods for modeling and analyzing multiple variables when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps you understand how the typical value of the dependent variable changes if one of the independent variables changes while the other independent variables remain fixed.

In all cases, the target score is a function of the independent variables and is called the regression function. In regression analysis, it is also of interest to characterize the change in the dependent variable as a function of regression, which can be described using a probability distribution.

Tasks of regression analysis

This statistical research method is widely used for forecasting, where its use has a significant advantage, but sometimes it can lead to illusion or false relationships, so it is recommended to use it carefully in this question, since, for example, correlation does not mean causation.

A large number of methods have been developed for performing regression analysis, such as linear and ordinary least squares regression, which are parametric. Their essence is that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression allows its function to lie in a certain set of functions, which can be infinite-dimensional.

As a statistical research method, regression analysis in practice depends on the form of the data generation process and how it relates to the regression approach. Since the true form of the data process generating is typically an unknown number, data regression analysis often depends to some extent on assumptions about the process. These assumptions are sometimes testable if there is enough data available. Regression models are often useful even when assumptions are moderately violated, although they may not perform at their best.

In a narrower sense, regression can refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The case of a continuous output variable is also called metric regression to distinguish it from related problems.

Story

The earliest form of regression is the well-known method of least squares. It was published by Legendre in 1805 and Gauss in 1809. Legendre and Gauss applied the method to the problem of determining from astronomical observations the orbits of bodies around the Sun (mainly comets, but later also newly discovered minor planets). Gauss published a further development of the theory of least squares in 1821, including a variant of the Gauss-Markov theorem.

The term "regression" was coined by Francis Galton in the 19th century to describe a biological phenomenon. The bottom line was that the growth of descendants from the growth of ancestors, as a rule, regresses down to the normal average. For Galton, regression had only this biological meaning, but later his work was taken up by Udni Yoley and Karl Pearson and taken to a more general statistical context. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is considered to be Gaussian. This assumption was rejected by Fischer in the papers of 1922 and 1925. Fisher suggested that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this regard, Fisher's suggestion is closer to Gauss's 1821 formulation. Prior to 1970, it sometimes took up to 24 hours to get the result of a regression analysis.

Regression analysis methods continue to be an area of active research. In recent decades, new methods have been developed for robust regression; regressions involving correlated responses; regression methods that accommodate various types of missing data; nonparametric regression; Bayesian regression methods; regressions in which predictor variables are measured with error; regressions with more predictors than observations; and causal inferences with regression.

Regression Models

Regression analysis models include the following variables:

Unknown parameters, denoted as beta, which can be a scalar or a vector.
Independent variables, X.
Dependent variables, Y.

In various fields of science where regression analysis is applied, different terms are used instead of dependent and independent variables, but in all cases the regression model relates Y to a function of X and β.

The approximation is usually formulated as E (Y | X) = F (X, β). To perform regression analysis, the form of the function f must be determined. More rarely, it is based on knowledge about the relationship between Y and X that does not rely on data. If such knowledge is not available, then a flexible or convenient form F is chosen.

Dependent variable Y

Let us now assume that the vector of unknown parameters β has length k. To perform a regression analysis, the user must provide information about the dependent variable Y:

If N data points of the form (Y, X) are observed, where N< k, большинство классических подходов к регрессионному анализу не могут быть выполнены, так как система уравнений, определяющих модель регрессии в качестве недоопределенной, не имеет достаточного количества данных, чтобы восстановить β.

If exactly N = K are observed, and the function F is linear, then the equation Y = F(X, β) can be solved exactly, not approximately. This boils down to solving a set of N-equations with N-unknowns (the elements of β) that has a unique solution as long as X is linearly independent. If F is non-linear, a solution may not exist, or there may be many solutions.
The most common situation is where there are N > points to the data. In this case, there is enough information in the data to estimate the unique value for β that best fits the data, and the regression model when applied to the data can be seen as an overridden system in β.

In the latter case, regression analysis provides tools for:

Finding a solution for unknown parameters β, which will, for example, minimize the distance between the measured and predicted value of Y.
Under certain statistical assumptions, regression analysis uses excess information to provide statistical information about the unknown parameters β and the predicted values of the dependent variable Y.

Required number of independent measurements

Consider a regression model that has three unknown parameters: β 0 , β 1 and β 2 . Let's assume that the experimenter makes 10 measurements in the same value of the independent variable of the vector X. In this case, the regression analysis does not give a unique set of values. The best you can do is estimate the mean and standard deviation of the dependent variable Y. Similarly, by measuring two different values of X, you can get enough data for a regression with two unknowns, but not for three or more unknowns.

If the experimenter's measurements were taken at three different values of the independent vector variable X, then the regression analysis would provide a unique set of estimates for the three unknown parameters in β.

In the case of general linear regression, the above statement is equivalent to the requirement that the matrix X T X is invertible.

Statistical Assumptions

When the number of measurements N is greater than the number of unknown parameters k and the measurement errors ε i , then, as a rule, then the excess information contained in the measurements is distributed and used for statistical predictions regarding unknown parameters. This excess of information is called the degree of freedom of the regression.

Underlying Assumptions

Classic assumptions for regression analysis include:

Sampling is representative of inference prediction.
The error is a random variable with a mean value of zero, which is conditional on the explanatory variables.
The independent variables are measured without errors.
As independent variables (predictors), they are linearly independent, that is, it is not possible to express any predictor as a linear combination of the others.
The errors are uncorrelated, that is, the error covariance matrix of the diagonals and each non-zero element is the variance of the error.
The error variance is constant across observations (homoscedasticity). If not, then weighted least squares or other methods can be used.

These sufficient conditions for the least squares estimate have the required properties, in particular these assumptions mean that the parameter estimates will be objective, consistent and efficient, especially when taken into account in the class of linear estimates. It is important to note that the actual data rarely satisfies the conditions. That is, the method is used even if the assumptions are not correct. Variation from assumptions can sometimes be used as a measure of how useful the model is. Many of these assumptions can be relaxed in more advanced methods. Statistical analysis reports typically include analysis of tests against sample data and methodology for the usefulness of the model.

In addition, variables in some cases refer to values measured at point locations. There may be spatial trends and spatial autocorrelations in variables that violate statistical assumptions. Geographic weighted regression is the only method that deals with such data.

In linear regression, the feature is that the dependent variable, which is Y i , is a linear combination of parameters. For example, in simple linear regression, n-point modeling uses one independent variable, x i , and two parameters, β 0 and β 1 .

In multiple linear regression, there are several independent variables or their functions.

When randomly sampled from a population, its parameters make it possible to obtain a sample of a linear regression model.

In this aspect, the least squares method is the most popular. It provides parameter estimates that minimize the sum of squares of the residuals. This kind of minimization (which is typical of linear regression) of this function leads to a set of normal equations and a set of linear equations with parameters, which are solved to obtain parameter estimates.

Assuming further that population error generally propagates, the researcher can use these estimates of standard errors to create confidence intervals and perform hypotheses testing about its parameters.

Nonlinear Regression Analysis

An example where the function is not linear with respect to the parameters indicates that the sum of squares should be minimized with an iterative procedure. This introduces many complications that define the differences between linear and non-linear least squares methods. Consequently, the results of regression analysis when using a non-linear method are sometimes unpredictable.

Calculation of power and sample size

Here, as a rule, there are no consistent methods regarding the number of observations compared to the number of independent variables in the model. The first rule was proposed by Dobra and Hardin and looks like N = t^n, where N is the sample size, n is the number of explanatory variables, and t is the number of observations needed to achieve the desired accuracy if the model had only one explanatory variable. For example, a researcher builds a linear regression model using a dataset that contains 1000 patients (N). If the researcher decides that five observations are needed to accurately determine the line (m), then the maximum number of explanatory variables that the model can support is 4.

Other Methods

Although the parameters of a regression model are usually estimated using the least squares method, there are other methods that are used much less often. For example, these are the following methods:

Bayesian methods (for example, the Bayesian method of linear regression).
A percentage regression used for situations where reducing percentage errors is considered more appropriate.
The smallest absolute deviations, which is more robust in the presence of outliers leading to quantile regression.
Nonparametric regression requiring a large number of observations and calculations.
The distance of the learning metric that is learned in search of a meaningful distance metric in the given input space.

Software

All major statistical software packages are performed using least squares regression analysis. Simple linear regression and multiple regression analysis can be used in some spreadsheet applications as well as some calculators. While many statistical software packages can perform various types of nonparametric and robust regression, these methods are less standardized; different software packages implement different methods. Specialized regression software has been developed for use in areas such as survey analysis and neuroimaging.

Regression analysis is one of the most popular methods of statistical research. It can be used to determine the degree of influence of independent variables on the dependent variable. The functionality of Microsoft Excel has tools designed to carry out this type of analysis. Let's take a look at what they are and how to use them.

But, in order to use the function that allows you to conduct regression analysis, first of all, you need to activate the Analysis Package. Only then the tools necessary for this procedure will appear on the Excel ribbon.

Now when we go to tab "Data", on the ribbon in the toolbox "Analysis" we will see a new button - "Data analysis".

Types of regression analysis

There are several types of regressions:

parabolic;
power;
logarithmic;
exponential;
demonstration;
hyperbolic;
linear regression.

We will talk in more detail about the implementation of the last type of regression analysis in Excel later.

Linear Regression in Excel

Below, as an example, is a table that shows the average daily air temperature on the street, and the number of store customers for the corresponding working day. Let's find out with the help of regression analysis exactly how weather conditions in the form of air temperature can affect the attendance of a retail establishment.

The general linear regression equation looks like this: Y = a0 + a1x1 + ... + axk. In this formula Y means the variable whose influence we are trying to study. In our case, this is the number of buyers. Meaning x are the various factors that affect the variable. Options a are the regression coefficients. That is, they determine the significance of a particular factor. Index k denotes the total number of these same factors.

Analysis results analysis

The results of the regression analysis are displayed in the form of a table in the place specified in the settings.

One of the main indicators is R-square. It indicates the quality of the model. In our case, this coefficient is 0.705 or about 70.5%. This is an acceptable level of quality. A relationship less than 0.5 is bad.

Another important indicator is located in the cell at the intersection of the line "Y-intersection" and column "Coefficients". Here it is indicated what value Y will have, and in our case, this is the number of buyers, with all other factors equal to zero. In this table, this value is 58.04.

Value at the intersection of the graph "Variable X1" and "Coefficients" shows the level of dependence of Y on X. In our case, this is the level of dependence of the number of store customers on temperature. A coefficient of 1.31 is considered a fairly high indicator of influence.

As you can see, it is quite easy to create a regression analysis table using Microsoft Excel. But, only a trained person can work with the data obtained at the output, and understand their essence.