17 Multivariate Analysis I: Multiple Regression Analysis

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 17 Multivariate Analysis I: Multiple Regression Analysis

Learning Objectives

Upon completion of this chapter, you will be able to:

Understand the applications of the multiple regression model
Understand the concept of coefficient of multiple determination, adjusted coefficient of multiple determination, and standard error of the estimate
Understand and use residual analysis for testing the assumptions of multiple regression
Use statistical significance tests for the regression model and coefficients of regression
Test portions of the multiple regression model
Understand non-linear regression model and the quadratic regression model, and test the statistical significance of the overall quadratic regression model
Understand the concept of model transformation in regression models
Understand the concept of collinearity and the use of variance inflationary factors in multiple regression
Understand the conceptual framework of model building in multiple regression

STATISTICS IN ACTION: HINDUSTAN PETROLEUM CORPORATION LTD (HPCL)

Indian oil major Hindustan Petroleum Corporation Ltd (HPCL) secured the 336th rank in the Fortune 500 list of 2007. It operates two major refineries producing a wide variety of petroleum fuels and specialities, one in Mumbai and the other in Vishakapatnam. HPCL also owns and operates the largest lube refinery in the country producing lube-based oils of international standard. This refinery accounts for over 40% of India’s total lube-based oil production.¹

HPCL has a number of retail outlets launched on the platform of “outstanding customer and vehicle care” and are branded as “Club HP” outlets. In order to cater to the rural market, HPCL operates through “Hamara Pump” which not only sells fuel but also sells seeds, pesticides, and fertilizers to farmers through “Kisan Vikas Kendras” set up at selected “Hamara Pump” outlets. The company remains firmly committed to meeting fuel requirements without compromising on quality and quantity, extending the refining capacity through brown field and green field additions, maintaining, and improving its market share across segments and its growth in the organic and inorganic growth areas of the value chain. With the growth in the Indian economy and rising income levels, the demand for petroleum products is expected to increase presenting opportunities to companies in the petroleum and refining segment.²

Table 17.1 presents compensation paid to employees, marketing expenses, travel expenses, and the profit after tax for HPCL from 2000 to 2007. Suppose that a researcher wants to develop a model to find the impact of marketing expenses, travel expenses, and profit after tax on compensation paid to employees. How can this be done? This chapter provides the answer to this question. Additionally, residual analysis, statistical significance test for regression model and coefficients of regression, non-linear regression model, model transformation, collinearity, variance inflationary factors, and model building in multiple regression are also discussed in this chapter.

Table 17.1 Compensation to employees, marketing expenses, travel expenses, and profit after tax (in million rupees) from 2000–2007 for HPCL

Source: Prowess (V. 3.1), Centre for Monitoring Indian Economy Pvt. Ltd, Mumbai accessed September 2008, reproduced with permission.

17.1 Introduction

In Chapter 16, we discussed simple regression analysis in which one independent variable, x, was used to predict the dependent variable, y. Even in case of more than one independent variable, a best-fit model can be developed using regression analysis. So, regression analysis with two or more independent variables or at least one non-linear predictor is referred to as multiple regression analysis. In this chapter, we will discuss cases of multiple regression analysis where several independent or explanatory variables can be used to predict one dependent variable.

Regression analysis with two or more independent variables or at least one non-linear predictor is referred to as multiple regression analysis

17.2 The Multiple Regression Model

In Chapter 16, we discussed that a probabilistic regression model for any specific dependent variable y_i can be given as

y_i = β₀ + β₁x_i + ϵ_i

where β₀ is the population y intercept, β₁ the slope of the regression line, y_i the value of the dependent variable for ith value, x_i the value of the independent variable for ith value, and ϵ_i the random error in y for observation i (ϵ is the Greek letter epsilon).

In case of multiple regression analysis where more than one explanatory variable is used, the above probabilistic model can be extended to more than one independent variable and the probabilistic model can be presented as multiple probabilistic regression model as:

Multiple regression model with k independent variables

y_i = β₀ + β₁x₁ + β₂x₂ + β₃x₃ + … + β_kx_k + ϵ_i

where y_i is the value of the dependent variable for ith value, β₀ the y intercept, β₁ the slope of y with independent variable x₁ holding variables x₂, x₃, …, x_k constant, β₂ the slope of y with independent variable x₂ holding variables x₁, x₃, …, x_k constant, β₃ the slope of y with independent variable x₃ holding variables x₁, x₂, x₄, …, x_k constant, β_k the slope of y with independent variable x_k holding variables x₁, x₂, x₃, …, x_k_{– 1}constant, and ϵ the random error in y for observation i (ϵ is the Greek letter epsilon).

In a multiple regression analysis, β_i is the slope of y with independent variable x_i holding all other independent variables constant. This is also referred to as a partial regression coefficient for the independent variable x_i. β_iindicates increase in dependent variable y, for unit increase in independent variable x_i holding all other independent variables constant.

In order to predict the value of y, a researcher has to calculate the values of β₀, β₁, β₂, β₃, …, β_k. Like simple regression analysis, challenges lie in observing the entire population. So, sample data is used to develop a sample regression model. This sample regression model can be used to make predictions about population parameters. So, an equation for estimating y with the sample information is given as

In a multiple regression analysis, β_iis slope of y with independent variable x_i holding all other independent variables constant. This is also referred to as partial regression coefficient for independent variable x_i.

Multiple regression equation

where is the predicted value of dependent variable y, b₀the estimate of regression constant, b₁the estimate of regression coefficient β₁, b₂ the estimate of regression coefficient β₂, b₃ the estimate of regression coefficient β₃, b_k the estimate of regression coefficient β_k, and k the number of independent variables. Figure 17.1 shows the summary of the estimation process for multiple regression.

Figure 17.1 Summary of the estimation process for multiple regression

17.3 Multiple Regression Model with Two Independent Variables

Multiple regression model with two independent variables is the simplest multiple regression model where the highest power of any of the variables is equal to one. Multiple regression model with two independent variables is given as

Multiple regression model with two independent variables is the simplest multiple regression model where highest power of any of the variables is equal to one.

Multiple regression model with two independent variables:

y_i = β₀ + β₁x₁ + β₂x₂ + ϵ_i

where y_i is the value of the dependent variable for ith value, β₀ the y intercept, β₁the slope of y, with independent variable x₁holding variable x₂ constant, β₂the slope of y with independent variable x₂holding variable x₁ constant, and ϵ_i the random error in y, for observation i.

Like simple regression analysis in a multiple regression analysis, sample regression coefficients (b₀, b₁, and b₂) are used to estimate population parameters ( β₀, β₁, and β₂).

In a multiple regression analysis, sample regression coefficients (b₀, b₁, and b₂) are used to estimate population parameters (β₀, β₁, and β₂). Multiple regression equation with two independent variables is given as

Multiple regression equation with two independent variables:

where is the predicted value of dependent variable y, b₀ the estimate of regression constant, b₁ the estimate of regression coefficient β₁, and b₂ the estimate of regression coefficient β₂.

Example 17.1

A consumer electronics company has adopted an aggressive policy to increase sales of a newly launched product. The company has invested in advertisements as well as employed salesmen for increasing sales rapidly. Table 17.2 presents the sales, the number of employed salesmen, and advertisement expenditure for 24 randomly selected months. Develop a regression model to predict the impact of advertisement and the number of salesmen on sales.

Table 17.2 Sales, number of salesmen employed, and advertisement expenditure for 24 randomly selected months of a consumer electronics company

On the basis of the multiple regression model, predict the sales of a given month when the number of salesmen employed are 35 and advertisement expenditure is 500 thousand rupees.

Soultion

Figures 17.2 and 17.3 depict the three-dimensional graphs between sales, salesmen, and advertisement produced using Minitab. Recall that in a simple regression analysis, we obtained a regression line that was the best-fit line through data points in the xy plane. In case of multiple regression analysis, the resulting model produces a response surface. In multiple regression analysis, the regression surface is a response plane (shown in Figures 17.2 and 17.3).

In case of multiple regression analysis, the resulting model produces a response surface and very specifically, in a multiple regression analysis the regression surface is a response plane.

Figure 17.2 Three-dimensional graph connecting sales, salesmen, and advertisement (scatter plot) produced using Minitab

Figure 17.3 Three-dimensional graph between sales, salesmen, and advertisement (surface plot) produced using Minitab.

The process of using MS Excel for multiple regression is almost the same as that for simple regression analysis. In case of using MS Excel for multiple regression, instead of placing one independent variable in Input X Range, place independent variables in Input X Range. The remaining process is the same as in the case of simple regression. Figure 17.4 is the MS Excel output (partial) for Example 17.1.

Figure 17.4 MS Excel output (partial) for Example 17.1

The process of using Minitab for multiple regression is also almost the same as for simple regression analysis. In case of using Minitab for simple regression analysis, we place the dependent variable in the Response box and one independent variable in the Predictors box. Whereas, in the case of multiple regression, we place the dependent variable in the Response box and independent (explanatory) variables in the Predictors box. The remaining process is the same as it is in the case of simple regression. Figure 17.5 is the Minitab output (partial) for Example 17.1.

Figure 17.5 Minitab output (partial) for Example 17.1

The method of using SPSS for conducting multiple regression analysis is analogous to the method of using SPSS for conducting simple regression analysis with a slight difference. While performing multiple regression analysis through SPSS, we place dependent variable in the Dependent box and independent variables in the Independent box. The remaining process is the same as it is in the case of simple regression. Figure 17.6 is the SPSS output (partial) for Example 17.1.

Figure 17.6 SPSS output (partial) for Example 17.1

From Figures 17.4, 17.5, and 17.6, the regression coefficients are

b₀ = 3856.69, b₁ = –104.32, b₂ = 24.60

So, multiple regression equation can be expressed as

Sales = 3856.69 – (104.32) Salesmen + (24.6) Advertisement.

Interpretation: The sample y intercept b₀ is computed as 3856.69. This indicates expected sales when zero salesmen are employed and expenditure in advertisement is also zero. In other words, this is the sales when x₁(number of salesmen employed) and x₂ (advertisement expenditure) is equal to zero. In general, the practical interpretation of b₀ is limited.

b₁ is the slope of y with independent variable x₁holding variable x₂ constant. That is, b₁is the slope of sales (y) with independent variable salesmen (x₁) holding advertisement expenditure (x₂) constant. b₁is computed as –104.32. The negative sign of the coefficient b₁indicates an inverse relationship between the dependent variable, sales (y) and the independent variable salesmen (x₁). This means that holding advertisement expenditure (x₂) constant, unit increase in the number of salesmen employed (x₁) will result in –104.32(1000) = –10,432 thousand rupees predicted decline in sales.

b₂is the slope of y with independent variable x₂ holding the variable x₁ constant. In other words, b₂ is the slope of sales ( y) with independent variable advertisement (x₂) holding the number of salesmen employed (x₁) constant. In Example 17.1, the computed value of b₂ is 24.6. This indicates that holding salesmen employed (x₁) constant, the unit increase in advertisement expenditure (thousand rupees) will result in a 24.6(1000), that is, 24,600 predicted increase in sales.

On the basis of the regression model developed above, the predicted sales of a given month when number of salesmen employed are 35 and advertisement expenditure is 500,000 can be calculated very easily. As explained earlier, regression equation is developed as

Sales = 3856.69 – (104.32) Salesmen = 35 and x₂ = 500, by placing the values in the equation, the predicted sales of a given month can be obtained as below:

= 12,505.49

Therefore, when the number of salesmen employed is 35 and advertisement expenditure is 500,000, the sales of the consumer electronics company is predicted to be 12,505.49 thousand.

17.4 DETERMINATION OF COEFFICIENT OF MULTIPLE DETERMINATION (R ²), ADJUSTED R ², AND STANDARD ERROR OF THE ESTIMATE

This section will focus on the concept of coefficient of multiple determination (R>²), adjusted R², and standard error of the estimate.

17.4.1 Determination of Coefficient of Multiple Determination (R ²)

In Chapter 16, we discussed the coefficient of determination (r²). The coefficient of determination (r²) measures the proportion of variation in dependent variable y that can be attributed to the independent variable x. This is valid for one independent and one dependent variable in case of a simple linear regression. In multiple regression, there are at least two independent variables and one dependent variable. Therefore, in case of multiple regression, the coefficient of multiple determination (R²) is the proportion of variation in the dependent variable y that is explained by the combination of independent (explanatory) variables. The coefficient of multiple determination is denoted by (for two explanatory variables). Therefore, coefficient of multiple determination can be computed as

From Figures 17.4, 17.5 and 17.6

In case of multiple regression, the coefficient of multiple determination (R²) is the proportion of variation in the dependent variable y that is explained by the combination of independent (explanatory) variables.

The coefficient of multiple determination is computed as 0.7390. This implies that 73.90% of the variation in sales is explained by the variation in the number of salesmen employed and the variation in the advertisement expenditure.

17.4.2 Adjusted R ²

While computing the coefficient of multiple determination R², we use the formula

If we add independent variables in the regression analysis, the total sum of squares will not change. Inclusion of independent variables is likely to increase SSR by an amount, which may result in an increase in the value of R². In some cases, additional independent variables do not add any new information to the regression model though it increases the value of R². In this manner, sometimes, we may obtain an inflated value of R². This difficulty can be solved by taking adjusted R² into account which considers both the factors, that is, the additional information that an additional independent variable brings to the regression model and the changed degrees of freedom. The adjusted R² formula can be given as adjusted coefficient of multiple determination (adjusted R²).

Adjusted

For Example 17.1, the value of adjusted R² can be computed as

Adjusted

Adjusted R² is commonly used when a researcher wants to compare two or more regression models having the same dependent variable but different number of independent variables. If we compare the values of R² and adjusted R², we find that the value of R² is 0.024 or 2.4% more than the value of adjusted R². This indicates that adjusted R² has reduced the overall proportion of the explained variation of the dependent variable attributed to independent variables by 2.4%. If more insignificant variables are added in the regression model, the gap between R² and adjusted R² tends to widen.

Adjusted R² is commonly used when a researcher wants to compare two or more regression models having the same dependent variable but different number of independent variables.

If we analyse the formula of computing the adjusted R², we find that it reflects both the number of independent variables and the sample size. For Example 17.1, the value of adjusted R² is computed as 0.714214. This indicates that 71.42% of the total variation in sales can be explained by the multiple regression model adjusted for the number of independent variables and sample size.

17.4.3 Standard Error of the Estimate

In Chapter 16, it has been discussed that in a regression model the residual is the difference between actual values ( y_i) and the regressed values . Using statistical software programs such as MS Excel, Minitab, and SPSS, the regressed (predicted) values can be obtained very easily. Figure 17.7 is the MS Excel output showing y, predicted y, and residuals. Figure 17.8 is the partial regression output from MS Excel showing the co-efficient of multiple determination, adjusted R², and standard error. Figures 17.9 and 17.10 are partial regression outputs produced using Minitab and SPSS, respectively. Similarly, in Minitab and SPSS, using the storage dialog box (discussed in detail in the Chapter 16), predicted y and residuals can be obtained easily.

Figure 17.7 MS Excel output showing y, predicted y, and residuals

Figure 17.8 Partial regression output from MS Excel showing coefficient of multiple determination, adjusted R², and standard error

Figure 17.9 Partial regression output from Minitab showing coefficient of multiple determination, adjusted R², and standard error

Figure 17.10 Partial regression output from SPSS showing coefficient of multiple determination, adjusted R², and standard error

As discussed in Chapter 16, standard error can be understood as the standard deviation of errors (residuals) around the regression line. In a multiple regression model, the standard error of the estimate can be computed as

Standard error =

where n is the number of observations and k the number of independent (explanatory) variables.

For Example 17.1, standard error can be computed as

Standard error =

Self-Practice Problems

17A1. Assume that x₁ and x₂are the independent variables and y the dependent variable in the data provided in the table below. Determine the line of regression. Comment on the coefficient of multiple determination (R²) and the standard error of the model. Let α = 0.05.

17A2. Assume that x₁ and x₂are the independent variables and y the dependent variable in the data provided in the table below. Determine the line of regression. Comment on the coefficient of multiple determination (R²) and the standard error of the model. Let α = 0.10.

17A3. Mahindra & Mahindra, the flagship company of the Mahindra group manufactures utility vehicles and tractors. Data relating to sales, compensation to employees and advertisement expenses of Mahindra & Mahindra from March 1990 to March 2007 are given in the following table. Taking sales as the dependent variable and compensation to employees and advertisement expenses as independent variables, determine the line of regression. Comment on the coefficient of multiple determination (R²) and the standard error of the model. Let α = 0.05.

Source: Prowess (V. 3.1), Centre for Monitoring Indian Economy Pvt. Ltd, Mumbai, accessed September 2008, reproduced with permission.

17.5 Statistical significance test for the regression model and the coefficient of regression

After developing a regression model with a set of appropriate data, checking the adequacy of the regression model is of paramount importance. The adequacy of the regression model can be verified by testing the significance of the overall regression model and coefficients of regression; residual analysis for verifying the assumptions of regression; standard error of the estimate; examining the coefficients of determination and variance inflationary factor (VIF) (will be discussed later in this chapter). In the previous sections, we have discussed residual analysis for verifying the assumptions of regression; standard error of the estimate, and coefficient of multiple determination. This section will focus on the statistical significance test for regression model and the coefficients of regression.

17.5.1 Testing the Statistical Significance of the Overall Regression Model

Testing the statistical significance of the overall regression model can be performed by setting the following hypotheses:

H₀ : β₁ = β₂ = β₃ = … = β_k = 0

H₁ : At least one regression coefficient is ≠ 0

H₀ : A linear relationship does not exist between the dependent and independent variables.

H₁ : A linear relationship exists between dependent variable and at least one of the independent variables.

In the previous chapter (Chapter 16), we have discussed that in regression analysis the F test is used to determine the significance of the overall regression model. More specifically, in case of a multiple regression model, the F test determines that at least one of the regression coefficients is different from zero. Most statistical software programs such as MS Excel, Minitab, and SPSS provide F test as a part of the regression output in terms of the ANOVA table. For multiple regression analysis, F statistic can be defined as

F statistic for testing the statistical significance of the overall regression model

where,

where k is the number of independent (explanatory) variables in the regression model F statistic follows the F distribution with degrees of freedom k and n – k – 1. Figures 17.11(a), 17.11(b), and 17.11(c) indicate the computation of the F statistic from MS Excel, Minitab, and SPSS, respectively. On the basis of the p value obtained from the outputs, it can be concluded that at least one of the independent variables (salesmen and/or advertisement) is significantly (at 5% level of significance) related to sales.

Figure 17.11(a) Computation of the F statistic using MS Excel (partial output for Example 17.1)

Figure 17.11(b) Computation of the F statistic using Minitab (partial output for Example 17.1)

Figure 17.11(c) Computation of the F statistic using SPSS (partial output for Example 17.1)

17.5.2 t Test for Testing the Statistical Significance of Regression Coefficients

In the previous chapter, we examined the significant linear relationship between the independent variable x and the dependent variable y by applying the t test. The same concept can be applied in an extended form, for testing the statistical significance of regression coefficients for multiple regression. In a simple regression model, the t statistic is defined as

In case of multiple regression, this concept can be generalized and the t statistic can be defined as

The test statistic t for multiple regression

where b_j is the slope of the variable j with dependent variable y holding all other independent variables constant, S_b_j the standard error of the regression coefficient b_j, and β_j the hypothesized population slope for variable j holding all other independent variables constant.

The test statistic t follows a t distribution with n – k – 1 degrees of freedom, where k is the number of independent variables.

The hypotheses for testing the regression coefficient of each independent variable can be set as

H₀: β₁ = 0

H₁: β₁ ≠ 0

H₀: β₂ = 0

H₁: β₂ ≠ 0

H₀: β_k = 0

H₁: β_k ≠ 0

Most statistical software programs such as MS Excel, Minitab, and SPSS provide the t test as a part of the regression output.

Figures 17.12(a), 17.12(b), and 17.12(c) illustrate the computation of the t statistic using MS Excel, Minitab, and SPSS, respectively. The p value indicates the rejection of the null hypothesis and the acceptance of the alternative hypothesis. On the basis of the p value obtained from the outputs, it can be concluded that at 95% confidence level, a significant linear relationship exists between salesmen and sales. Similarly, at 95% confidence level, a significant linear relationship exists between advertisement and sales.

Figure 17.12(a) Computation of the t statistic using MS Excel (partial output for Example 17.1)

Figure 17.12(b) Computation of the t statistic using Minitab (partial output for Example 17.1)

Figure 17.12(c) Computation of the t statistic using SPSS (partial output for Example 17.1)

Self-Practice Problems

17B1. Test the significance of the overall regression model and the statistical significance of regression coefficients for Problem 17A1.

17B2. Test the significance of the overall regression model and the statistical significance of regression coefficients for Problem 17A2.

17B3. Test the significance of the overall regression model and the statistical significance of regression coefficients for Problem 17A3.

17.6 INDICATOR (DUMMY VARIABLE MODEL)

Regression models are based on the assumption that all independent variables (explanatory) are numerical in nature. There may be cases when some of the variables are qualitative in nature. These variables generate nominal or ordinal information and are used in multiple regression. These variables are referred to as indicator or dummy variables. For example, we have taken advertisement as the explanatory variable to predict sales in previous sections. A researcher may want to include one more variable “display arrangement of products” in retail stores as another variable to predict sales. In most cases, researchers collect demographic information such as gender, educational background, marital status, religion, etc. In order to include these in the multiple regression model, a researcher has to use indicator or dummy variable techniques. In other words, the use of the dummy variable gives a firm grounding to researchers for including categorical variables in the multiple regression model.

Regression models are based on the assumption that all the independent variables (explanatory) are numerical in nature. There may be cases when some of the variables are qualitative in nature. These variables generate nominal or ordinal information and are used in multiple regression. These variables are referred to as indicator or dummy variables.

Researchers usually assign 0 or 1 to code dummy

Researchers usually assign 0 or 1 to code dummy variables in their study. Here, it is important to note that the assignment of code 0 or 1 is arbitrary and the numbers merely represent a place for the category.

variables in their study. Here, it is important to note that the assignment of the codes 0 or 1 are arbitrary and the numbers merely represent a place for the category. In many situations, indicator or dummy variables are dichotomous (dummy variables have two categories such as male/female; graduate/non-graduate; married/unmarried, etc). A particular dummy variable x_d is defined as

x_d = 0, if the observation belongs to category 1

x_d = 1, if the observation belongs to category 2

Example 17.2 clarifies the use of dummy variables in regression analysis.

Example 17.2

A company wants to test the effect of age and gender on the productivity (in terms of units produced by the employees per month) of its employees. The HR manager has taken a random sample of 15 employees and collected information about their age and gender. Table 17.3 provides data about the productivity, age, and gender of 15 randomly selected employees. Fit a regression model considering productivity as the dependent variable and age and gender as the explanatory variables.

Table 17.3 Data about productivity, age, and gender of 15 randomly selected employees.

Predict the productivity of male and female employees at 45 years of age.

Soultion

We need to define a dummy variable for gender for Example 17.2. A dummy variable for gender can be defined as

x₂ = 0 (For female)

x₂ = 1 (For male)

After assigning code numbers 0 to females and 1 to males, the data obtained from 15 employees is rearranged, as shown in Table 17.4.

The multiple regression model is based on the assumption that the slope of productivity with age is the same for gender, that is, for both males and females. Based on this assumption multiple regression model can be defined as

Multiple regression model with two independent variables

y_i = β₀ + β₁x₁ + β₂x₂ + ϵ_i

where y_i is the value of the dependent variable for the ith value, β₀ the y intercept, β₁ the slope of productivity with independent variable age holding the variable gender constant, β₂ the slope of productivity with independent variable gender holding the variable age constant, and ϵ_i the random error in y, for employee i.

Table 17.4 Data about productivity, age, and gender of 15 randomly selected employees (after coding)

After coding of the second explanatory variable, gender, the model takes the form of multiple regression with two explanatory variables—age and gender. The solution can be presented in the form of regression output using any of the software applications.

Note Figures 17.13, 17.14, and 17.15 are the MS Excel, Minitab, and SPSS outputs, respectively for Example 17.2. The procedure of using MS Excel, Minitab, and SPSS is exactly the same as used for performing multiple regression analysis for two explanatory variables. In multiple regression, for dummy variables, we take the second column (column with 0 and 1 assignment) as the second explanatory variable. The remaining procedure is exactly the same as for multiple regression with two explanatory variables. The following procedure can be used to create a dummy variable column in MS Excel.

Figure 17.13 MS Excel output for Example 17.2

Figure17.14 Minitab output for Example 17.2

Figure 17.15 SPSS output for Example 17.2

17.7 COLLINEARITY

A researcher may face problems because of the collinearity of

In multiple regression analysis, when two independent variables are correlated, it is referred to as collinearity and when three or more variables are correlated, it is referred to as multicollinearity.

independent (explanatory) variables while performing multiple regression. This situation occurs when two or more independent variables are highly correlated with each other. In a multiple regression analysis, when two independent variables are correlated, it is referred to as collinearity and when three or more variables are correlated, it is referred to as multicollinearity.

In situations when two independent variables are correlated, obtaining new information and the measurement of separate effects of these on the dependent variable will be very difficult. Additionally, it can generate an opposite algebraic sign of the regression coefficient that will be expected for a particular explanatory variable. For identifying the correlated variables, a correlation matrix with the help of statistical software programs can be constructed. This correlation matrix identifies the pair of variables which are highly correlated. In case of extreme collinearity between two explanatory variables, software programs such as Minitab automatically drop the collinear variable. For example, consider Table 17.5 with sales (in thousand rupees) as the dependent variable and advertisement (in thousand rupees), and number of showrooms as the independent variables. Figure 17.16 is the regression output produced using Minitab for the data given in Table 17.5. From the output (Figure 17.16), it can be seen that number of showrooms which is a collinear variable is identified and is automatically dropped from the model. The final output contains only one independent variable that is advertisement.

Figure 17.16 Minitab output (partial) for sales versus advertisement, number of showrooms

Table 17.5 Sales as the dependent variable and advertisement and number of showrooms as the independent variables

Collinearity is measured by the variance inflationary factor (VIF) for each explanatory variable. Variance inflationary factor (VIF) for an explanatory variable i can be defined as

Variance inflationary factor (VIF) is given as

Collinearity is measured by variance inflationary factor (VIF) for each explanatory variable.

where is the coefficient of multiple determination of explanatory variable x_i with all other x variables.

In a multiple regression analysis, if there are only two explanatory variables, is the coefficient of multiple determination of explanatory variables x₁ and x₂. Similarly, is the coefficient of multiple determination of explanatory variables x₂and x₁ (same as ). In case of a multiple regression analysis when there are three explanatory variables, is the coefficient of multiple determination of explanatory variable x₁ with x₂ and x₃. is the coefficient of multiple determination of explanatory variables x₂with x₁and x₃. is the coefficient of multiple determination of explanatory variable x₃ with x₂ and x₁. Figure 17.17 and 17.18 are Minitab and SPSS output (partial) respectively, indicating VIF for Example 17.1.

Figure 17.17 Minitab output (partial) indicating VIF for Example 17.1

Figure 17.18 SPSS output (partial) indicating VIF for Example 17.1

If explanatory variables are uncorrelated, then variance inflationary factor (VIF) is equal to 1. Variance inflationary factor (VIF) being greater than 10 is an indication of serious multicollinearity problems. For example, if the correlation coefficient between two explanatory variables is –0.2679. Hence, the variance inflationary factor (VIF) can be computed as

If explanatory variables are uncorrelated, then variance inflationary factor (VIF) will be equal to 1. Variance inflationary factor (VIF) being greater than 10 is an indication of serious multicollinearity problems.

This value of the variance inflationary factor (VIF) indicates that collinearity does not exist between the explanatory variables.

In multiple regression, collinearity is not very simple to handle. A solution to overcome the problem of collinearity is to drop the collinear variable from the regression equation. For example, let as assume that we are measuring the impact of three independent variables x₁, x₂, and x₃ on a dependent variable y. During the analysis, we find that the explanatory variable x₁ is highly correlated with the explanatory variable x₂. By dropping one of these variables from the multiple regression analysis, we will be able to solve the problem of collinearity. How to determine which variable should be dropped from the multiple regression analysis? This can be achieved comparing R² and adjusted R² with and without one of these variables. For example, suppose with all the three explanatory variables included in the analysis, R² is computed as 0.95. When x₁ is removed from the model, R² is computed as 0.89, and when x₂ is removed from the model, R² is computed as 0.93. In this situation, we can drop the variable x₂ from the regression model and variable x₁ should remain in the model. If adjusted R² increases after dropping the independent variable, we can certainly drop the variable from the regression model.

Collinearity is not very simple to handle in multiple regression. One of the best solutions to overcome the problem of collinearity is to drop collinear variables from the regression equation.

In some cases, due to the importance of the concerned explanatory variable in the study, a researcher is not able to drop the variable from the study. In this situation, some other methods are suggested to overcome the problem of collinearity. One way is to form a new combination of explanatory variables, which are uncorrelated with one another and then run the regression on the new uncorrelated combination of explanatory variables instead of running the regression on original variables. In this manner, the information content of the original variables is maintained; however, the collinearity is removed. Another method is to centre the data. This can be done by subtracting the means from the variables and then running the regression on newly obtained variables.

SUMMARY

Multiple regression analysis is a statistical tool where several independent or explanatory variables can be used to predict one dependent variable. In multiple regression, sample statistics b₀, b₁, b₂, …, b_k provide the estimate of population parameters β₀, β₁, β₂, …, β_k. In multiple regression, the coefficient of multiple determination (R²) is the proportion of variation in the dependent variable y that is explained by the combination of independent (explanatory) variables. In multiple regression, adjusted R² is used when a researcher wants to compare two or more regression models with the same dependent variable but having different independent variables. Standard error is the standard deviation of errors (residuals) around the regression line.

For residual analysis, in a multiple regression model, we test the linearity of the regression model, constant error variance (homoscedasticity), independence of error, and normality of error. The adequacy of the regression model can be verified by testing the significance of the overall regression model and coefficients of regression. The contribution of an independent variable can be determined by applying partial F criterion. This provides a platform to estimate the contribution of each explanatory (independent) variable in the multiple regression model. The coefficient of partial determination measures the proportion of variation in the dependent variable that is explained by each independent variable holding all other independent (explanatory) variables constant.

In case of the existence of a non-linear relationship between two explanatory variables, we have to consider the next option in terms of quadratic relationship (most common non-linear relationship) between two variables. There are cases when some of the variables are qualitative in nature. These variables generate nominal or ordinal information and are used in multiple regressions. These variables are referred to as indicator or dummy variables. A technique referred to as the dummy variable technique is adopted for using these variables in the multiple regression model.

In many situations, in regression analysis, the assumptions of regression are violated or researchers find that the model is not linear. In both the cases, either the dependent variable y, or the independent variable x, or both the variables are transformed to avoid the violation of regression assumptions or to make the regression model linear. There are many transformation techniques available such as the square root transformation and the log transformation techniques.

In multiple regression when two independent variables are correlated, it is referred to as collinearity and when three or more variables are correlated, it is referred to as multicollinearity. Collinearity can be identified either by correlation matrix or by variance inflationary factors (VIF).

Search procedure is used for model development in multiple regression. In this procedure, more than one regression model is developed for a given data base. These models are compared on the basis of different criteria depending upon the procedure opted. Various search procedures including all possible regressions, stepwise regression, forward selection, and backward elimination are available in multiple regression.

Key terms

Adjusted R², 422

Coefficient of multiple determination, 422

Dummy variables, 429

Variance inflationary factors, 436

NOTES

www.hindustanpetroleum.com/aboutsus.htm, accessed October 2008.
Prowess (V. 3.1), Centre for Monitoring Indian Economy Pvt. Ltd, Mumbai, accessed October 2008, reproduced with permission.

Discussion Questions

Explain the concept of multiple regression and explain its importance in managerial decision making.
Explain the use and importance of coefficient of multiple determination (R²) in interpreting the output of multiple regression.
Discuss the concept of adjusted coefficient of multiple determination (adjusted R²) and standard error in multiple regression?
How can we test the significance of regression coefficients and the overall significance of the regression model?
When does a researcher use the dummy variable technique in multiple regression analysis?
What is collinearity in multiple regression analysis? Explain variance inflationary factor (VIF) and its use in diagnosing collinearity in multiple regression analysis.

Numerical Problems

1. A consultancy wants to examine the relationship between the income of employees and their age and experience. The company takes a random sample of 15 employees and the data collected from these 15 employees are presented in the table below:

Taking income as the dependent variable and age and experience as the independent variables, develop a regression model based on the data provided.

2. A cement manufacturing company has discovered that sales turnover of cement is largely dependent on advertisements on hoardings and wall paintings and not on advertisements in the print media. The company has invested heavily on the first two modes of advertisement. The company’s research team wants to study the impact of these two modes of advertisement on sales. The research team has collected a random sample of the sales for 22 days (given in the table below). Develop a regression model to predict the impact of the two different modes of advertising: hoardings and wallpaintings on sales.

On the basis of the regression model, predict the sales on a given day when advertisement expenditure on hoardings and wall paintings are 130 thousand and 70 thousand rupees, respectively.

3. A company wants to predict the demand for a particular product by using the price of the product, the income of households, and the savings of households as related factors. The company has collected data for 15 randomly selected months (given in the table below). Fit a multiple regression model for the data and interpret the results.

4. The sales data of a fast food company for 20 weeks selected randomly are given below. Fit an appropriate regression model taking sales as the dependent variable and sales executives as the independent variable and jutify the model based on these data.

5. The Vice President (Sales) of a computer software company wants to know the relationship between the generation of sales volumes and the age of employees. He believes that some of the variation in sales may be owing to differences in gender. He has randomly selected 12 employees and collected the following data.

Fit a regression model, considering the generation of sales volume as the dependent variable and the age of employees and gender as the explanatory variables.

6. A consumer electronics company has 150 showrooms across the country. The Vice President (Marketing) wants to predict sales by using four explanatory variables—show room space (in square feet), electronic display boards in showrooms, number of salesmen, and showroom age. He has taken a random sample of 15 showrooms and collected data with respect to the four explanatory variables. Develop an appropriate regression model using the data given below.

CASE STUDY

Case 17: Maruti Udyog Ltd—The Wheels of India

Introduction

The passenger car industry in India was characterized by limited production owing to limited demand before the entry of Maruti Udyog Ltd. The real transformation of the industry took place after Maruti Udyog started operations in 1981. After liberalization, global players such as General Motors, Ford, Suzuki, Toyota, Mitsubishi, Honda, Fiat, Hyundai, Mercedes, and Skoda entered the passenger car segment in India. The sales volumes in the passenger car segment is estimated to touch 2,235,000 units by 2014–2015¹.

Many factors have contributed to the increase in demand in the Indian passenger car industry. In India, car penetration is low at 7 cars per 1000 persons as compared to developed countries. This has opened a host of opportunities for car manufacturers. The increasing disposable incomes, possible upgradation from a two wheeler to a four wheeler because of the launch of low priced cars, the aspirations of Indians to have a better lifestyle, etc. are factors that have expanded demand in the passenger car segment. The challenges ahead of the industry are the high fuel prices and interest rates, increasing input costs, and growth in mass transit systems, etc.² Apart from these factors, the overall scenario seems to be positive for the Indian passenger car industry.

Maruti Suzuki—A Leader in the Passenger Car Segment

Maruti Suzuki, earlier known as Maruti Udyog, is one of India’s leading automobile manufacturers and is the market leader in the passenger car segment. The company was established in February 1981 through an Act of Parliament, as a government company in technological collaboration with Suzuki Motor Corporation of Japan. In its initial years, the government had the major controlling major stake. In the post-liberalization era, the Indian government completely divested its stake in the company and exited it completely in May 2007. Maruti’s first product—the Maruti 800 was launched in India in December 1983. After its humble beginning, the company dominated the Indian car market for over two decades and became the first Indian car company to mass produce and sell more than a million cars by 1994. Till March 2007, the company had produced and sold over six million cars.²

Unique Maruti Culture

Maruti strongly believes in the strength of its employees and on account of this underlying philosophy has moduled its workforce into teams with common goals and objectives. Maruti’s employee-management relationship is characterized by participative management, team work and kaizen, communication and information sharing, and open office culture for easy accessibility. Maruti has also taken steps to implement a flat organizational structure. There are only three levels of responsibilities in the company’s structure—board of directors, division heads, and department heads. As a firm believer in this philosophy, Maruti has an open office, common uniform (at all levels), and a common canteen for all.³

On the Road to Becoming Global

Maruti Suzuki India is a major contributor to Suzuki’s global turnover and profits and has ambitious plans to capture the global automobile market. Maruti Suzuki India, Managing Director and CEO, Mr Shinzo Nakanishi said, “When we exported 53,000 cars in 2007–2008 that was the highest ever in our history. But we now want to take it to 2, 00,000 cars annually by 2010–2011.” ⁴

Maruti is aware that the passenger car market in India is highly competitive. Changing lifestyles and increasing incomes of Indian customers have attracted world players to the Indian market. These MNCs are widening the product range in order to expand the market. Confident of its strategies Chairman of Maruti Suzuki India R. C. Bhargava said, “The car market is growing increasingly competitive. This is not surprising as global manufacturers are bound to come where they see a growing market. Maruti has a strategy for the future.”⁵

Table 17.01 presents the sales turnover, advertising expenses, marketing expenses, and distribution expenses of the company from 1989–2007. Fit a regression model considering sales volume generation as the dependent variable and advertising expenses, marketing expenses and distribution expenses as explanatory variables.

Table 17.01 Sales turnover, advertising expenses, marketing expenses, and distribution expenses of Maruti Udyog from 1989–2007

Source: Prowess (V. 3.1), Centre for Monitoring Indian Economy Pvt. Ltd, Mumbai accessed September 2008, reproduced with permission.

Notes

www.indiastat.com, accessed September 2008, reproduced with permission.
Prowess (V. 3.1), Centre for Monitoring Indian Economy Pvt. Ltd, Mumbai, accessed September 2008, reproduced with permission.
www.maruti.co.in/ab/careers.asp?ch=1&ct=9&sc=1, accessed September 2008, reproduced with permission.
www.hinduonnet.com/businessline/blnus/02201806.htm, accessed September 2008.
www.thehindubusinessline.com/2008/08/21/stories/2008082152240200.htm, accessed September 2008.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 17 Multivariate Analysis I: Multiple Regression Analysis

Create new playlist

Sign In

Sign Up

Chapter 17

Multivariate Analysis I: Multiple Regression Analysis

STATISTICS IN ACTION: HINDUSTAN PETROLEUM CORPORATION LTD (HPCL)

17.1 Introduction

17.2 The Multiple Regression Model

17.3 Multiple Regression Model with Two Independent Variables

17.4 DETERMINATION OF COEFFICIENT OF MULTIPLE DETERMINATION (R 2), ADJUSTED R 2, AND STANDARD ERROR OF THE ESTIMATE

17.4.1 Determination of Coefficient of Multiple Determination (R 2)

17.4.2 Adjusted R 2

17.4.3 Standard Error of the Estimate

Self-Practice Problems

17.5 Statistical significance test for the regression model and the coefficient of regression

17.5.1 Testing the Statistical Significance of the Overall Regression Model

17.5.2 t Test for Testing the Statistical Significance of Regression Coefficients

Self-Practice Problems

17.6 INDICATOR (DUMMY VARIABLE MODEL)

17.7 COLLINEARITY

SUMMARY

Key terms

NOTES

Discussion Questions

Numerical Problems

CASE STUDY

Notes

Table of Contents for
17 Multivariate Analysis I: Multiple Regression Analysis

17.4 DETERMINATION OF COEFFICIENT OF MULTIPLE DETERMINATION (R ²), ADJUSTED R ², AND STANDARD ERROR OF THE ESTIMATE

17.4.1 Determination of Coefficient of Multiple Determination (R ²)

17.4.2 Adjusted R ²