JMP Analysis

Descriptive Analysis

A univariate descriptive analysis for the variables in the data set appears in the case “Building a Simple Predictive Model for Health Care Costs for Newborns in Adirondack Hospitals.” When conducting a multiple regression, a simple correlation analysis is a useful preparatory step that can help identify independent variables that may be good predictors and independent variables that are highly correlated. Including highly correlated independent variables in a multiple regression can be problematic as will be discussed subsequently.

The Pearson correlation coefficient describes the degree of linear association between two continuous variables. Correlation is measured on a scale of -1 to 1 where -1 represents a perfect inverse relationship, 0 represents no relationship, and 1 represents a perfect direct relationship. There are three continuous variables of interest in the data set: total costs, length of stay, and birthweight (lbs). To compute the correlation between all pairs, select Analyze > Multivariate Methods > Multivariate. From the drop-down menu choose Pairwise Correlations. The results are shown in Figure 13.1 Correlation Analysis.

Figure 13.1 Correlation Analysis

The correlation matrix shows the estimated correlation coefficients for each pair of variables. The matrix is symmetric about the diagonal. The Scatterplot Matrix shows scatterplots for each pair of variables. These graphs are useful to identify outliers. The Pairwise Correlation table shows the estimated correlation coefficient, 95% confidence bounds, and the p-value (Signif Prob) associated with the test of hypothesis that the correlation is zero (meaning no linear association) against an alternative that the correlation is not equal to zero (meaning significant linear association).

Total Costs and Length of Stay have a significant correlation of 0.66. The correlation between Total Costs and Birthweight (lbs) is not significant. The correlation between Birthweight (lbs) and Length of Stay is not significant at the 5% level. The low correlation suggests that including both of these predictors in a multiple regression will not be problematic.

Simple Regression Analysis

Another preparatory step prior to conducting a multiple regression analysis is to examine simple regression results for key independent variables. In the case “Building a Simple Predictive Model for Health Care Costs for Newborns in Adirondack Hospitals,” simple regression analysis found that taken alone, length of stay was a significant predictor of total costs but birthweight was not. The results are summarized in Summary of Simple Regression Analysis for Total Costs.

Table 13.1 Summary of Simple Regression Analysis for Total Costs
Independent Variable	Slope Significance	R²	RMSE
Length of stay	<0.0001	0.44	$1607
Birthweight (lbs)	0.1527	0.002	$2136

As discussed in “Building a Simple Predictive Model for Health Care Costs for Newborns in Adirondack Hospitals,” while the length of stay is a significant predictor of total costs, there are likely other factors that influence total cost such as procedures and treatments associated with birth complications.

Multiple Regression Analysis

The objective of multiple regression analysis is to find a set of independent variables from those available that predict the dependent variable well. There is not necessarily one best set of predictors; there may be several different sets of predictors that perform well.

Fitting the Regression Model

Since birthweight alone is not a significant predictor of total costs, it is tempting to not include it in a multiple regression model. But to satisfy the problem statement, we will construct a multiple regression equation with length of stay and birthweight (lbs) as predictors.

From the JMP menu select Analyze > Fit Model and enter the variables Total Costs ($), Length of Stay (Days), and Birthweight (lbs) as shown in Figure 13.2 Fit Model Dialog for Multiple Regression.

Figure 13.2 Fit Model Dialog for Multiple Regression

Be sure and check “Keep dialog open.” This makes it easy to change predictors in the regression. The basic multiple regression output is shown in Figure 13.3 Basic Multiple Regression Results from Fit Model. From this output, the Regression Reports option can be invoked from the drop-down menu to select output to display. Non-essential portions of the output have been hidden using the gray toggles at the left of each table or graph.

Figure 13.3 Basic Multiple Regression Results from Fit Model

The Parameter Estimates table gives the regression coefficients and the t Ratio and p-value (Prob>|t|) associated with the significance tests. The estimated multiple regression equation is:

Total Costs = -2696.06 + 1425.86Length of Stay + 179.38Birthweight (lbs).

The interpretation of the regression coefficients are:

When Length of Stay and Birthweight (lbs) are zero, the estimated average Total Costs are -$2696.06. Clearly, such an observation is not possible and the intercept serves as a fitting constant for the model.
For each increase in one day of hospitalization there is an estimated average increase of $1425.86 in total cost holding birthweight constant.
For each increase in one pound of birthweight, there is an estimated average increase of $179.38 in total cost holding length of stay constant.

As with simple linear regression, the method of least squares is used to obtain the regression coefficients.

The Parameter Estimates table shows the significance test of each regression coefficient with the null hypothesis is that the regression coefficient is equal to zero against the alternative that the regression coefficient is not equal to zero. Both length of stay and birthweight are significant predictors (at the 5% level) of total costs since the p-values associated with the significance tests for the regression coefficients are less than 0.05. The significance of a predictor in a multiple regression equation depends on the other predictors in the equation. While birthweight was not significant as a predictor of total costs in a simple regression, it is significant in a multiple regression that includes length of stay. For this reason, independent variables should be added or removed one at a time as you seek a good predictive equation. The goal is to have a multiple regression that includes only significant predictors of the dependent variable. However, there are special circumstances where non-significant variables should be included, such as when required by a regulatory agency.

To predict the total costs for specific values of length of stay and birthweight, the values are substituted into the regression equation. For example, a two day hospital stay for a seven pound newborn would have a predicted total cost of $1411.32. Predictions can be made from the estimated regression in JMP by selecting Save Columns > Prediction Formula from the drop-down menu. This will create a new column in the JMP data table that contains the formula for the regression equation. Add a new row to the data table, fill in the desired values for the predictors and the predicted value of total costs will appear in the new column. Evaluating a regression model for values of the independent variables beyond the observed ranges is referred to as extrapolation and is not recommended. We do not know if the relationships between the dependent and independent variables estimated by the regression equation are valid outside of the observed ranges.

Assessing the Model Fit – R2 and RMSE

Similar to simple regression, R² and the RMSE are used to assess the adequacy of the model fit. The R² and RMSE can be found in the Summary of Fit table shown in Figure 13.3 Basic Multiple Regression Results from Fit Model. Comparison of Simple Regressions and a Two-Variable Multiple Regression summarizes the goodness-of-fit measures for the two simple regressions and the multiple regression containing both length of stay and birthweight.

Table 13.2 Comparison of Simple Regressions and a Two-Variable Multiple Regression
Independent Variables	Slope Significance	R²	RMSE
Length of stay	<0.0001	0.436	$1607
Birthweight (lbs)	0.1527	0.002	$2136
Length of Stay Birthweight (lbs)	<0.0001 0.0006	0.443	$1596

While both length of stay and birthweight are significant in the multiple regression, there is only slight improvement in the R² and RMSE compared to the simple regression with length of stay.

Assessing the Model Fit – Multicollinearity

An additional concern in multiple regression is multicollinearity, which occurs when there are highly correlated independent variables in a regression equation. This can affect regression coefficients causing them to have the wrong sign, have decreased precision, or have very different magnitudes depending on the other variables in the model. Examining pairwise correlations can suggest such highly correlated predictors, however there are situations where multicollinearity is not revealed through the correlation coefficients. Variance Inflation Factors (VIF) are used to diagnose multicollinearity. To obtain the VIFs, right click anywhere in the Parameter Estimates table and from the resulting menu select Columns > VIF as shown in Figure 13.4 Dialog to Request Variance Inflation Factors.

Figure 13.4 Dialog to Request Variance Inflation Factors

The Parameter Estimates table is expanded to include the VIFs as shown in Figure 13.5 Parameter Estimates Table with Variance Inflation Factors.

Figure 13.5 Parameter Estimates Table with Variance Inflation Factors

The following are general guidelines for interpreting VIFs. When a VIF is a less than five, there is no cause for concern. When the VIF ranges from 5 to 10, there is potential for multicollinearity; and a VIF greater than 10 means there is multicollinearity in the model. When VIFs are greater than five, the analyst should remove one of the correlated variables and assess the effect on the regression coefficients and the goodness-of-fit measures. In this example, both VIFs are less than five and hence there is no concern with multicollinearity.

Assessing the Model Fit with Residual Plots

Residual plots are used to assess the goodness-of-fit of the multiple regression. A multiple regression with two predictors can be visualized as a plane and a three dimensional plot allows comparison of the observations to the regression plane. When there are more than two predictors, the model cannot be visualized. Residual plots are used instead to visually assess the model fit. Patterns observed in the residual plots suggest a systematic effect that should be included in the model as either additional predictors or non-linear terms. Patterns of randomly scatter residuals are indicative of a good model fit.

The Fit Model platform offers an option to save the residuals to a new column in the data table as shown in Figure 13.6 Dialog to Request Residuals to be Saved to Data Table. This will allow a variety of residual plots to be created.

Figure 13.6 Dialog to Request Residuals to be Saved to Data Table

Figure 13.7 Residual Plot for Multiple Regression shows one of the default residual plots that can be obtained from the drop-down menu by selecting Row Diagnostics > Plot Residual by Predicted.

Figure 13.7 Residual Plot for Multiple Regression

In this plot we see three groupings. Two groups show decreasing residuals as predicted total costs increase. The third group has positive residuals at higher predicted total costs, meaning that the multiple regression underestimates higher total costs when they are higher. This pattern did not emerge in the comparable residual plot for the simple linear regression with length of stay as the predictor which is shown in Figure 13.8 Residual Plot for Simple Linear Regression with Length of Stay as obtained from the Fit Y by X platform.

Figure 13.8 Residual Plot for Simple Linear Regression with Length of Stay

Notice that in the simple regression length of stay is integer and as such there will be limited number of predicted values. This is not the case in the multiple regression.

The residual groupings in Figure 13.7 Residual Plot for Multiple Regression suggest additional effects are present that are not included in the model. When a good-fitting model is obtained, residual plots will show a random scatter, free of noticeable patterns. To better understand these groupings, we will use the lasso tool to select groups, create individual data tables and then examine other variables in these groups using the Distribution platform. For example, Figure 13.9 Histograms of CCS Procedure Aggregated for Lower and Upper Residual Groups shows the histograms of CCS Procedure Aggregated for the lower and upper groups of linearly decreasing residuals. The Clinical Classification Software (CCS) is a widely used system for grouping diagnoses and procedures.

Figure 13.9 Histograms of CCS Procedure Aggregated for Lower and Upper Residual Groups

Notice that for the upper residual group circumcision is the dominant procedure while in the lower residual group inoculations are the dominant group. It makes sense that circumcision, a surgical procedure, would have higher costs than vaccinations.

The group of residuals in the higher total cost region is associated with newborns that have complications based on the APR DRG Description. This result is similar to what was discovered from the residual plots of the simple regression analysis with length of stay.

This residual analysis suggests that the model should include predictors that account for different procedures and/or complications.

Assessing Regression Assumptions with Residual Plots

To assess the normality of the residuals, a Normal quantile plot and histogram of the multiple regression residuals were obtained from the Distribution platform as shown in Figure 13.10 Normal Quantile Plot and Histogram of Residuals.

Figure 13.10 Normal Quantile Plot and Histogram of Residuals

Notice the departure from normality in the large positive residuals. In Figure 13.9 Histograms of CCS Procedure Aggregated for Lower and Upper Residual Groups we see evidence of non-constant variance. These results are similar to that observed with the simple regression with length of stay.

Table of Contents for JMP Analysis

Create new playlist

Sign In

Sign Up