Simple Regression Analysis: Total Costs and Length of Stay

Fitting the Regression Model

A good practice prior to developing a bivariate predictive model is to visualize the relationship on a scatterplot. From the Analyze menu select the Fit Y by X platform and enter Total Costs in the Y, Response field and Length of Stay in the X, Factor field. The resulting scatterplot is shown in Figure 12.5 Scatterplot of Total Costs and Length of Stay.
Figure 12.5 Scatterplot of Total Costs and Length of Stay
As expected, as length of stay increases, total costs increase. The correlation between Total Costs and Length of Stay is 0.66 as found in the correlation matrix obtained by selecting Analysis > Multivariate Methods > Multivariate. A simple linear regression analysis will model this relationship as a straight line and allow us to quantify the relationship. It is good practice to start with a simple model, assess the adequacy of the model, and if necessary proceed to developing more complex models.
To find the best fitting line select Fit Line from the drop-down menu. The fitted line and data are shown graphically in Figure 12.6 Linear Fit for Total Costs and Length of Stay along with the associated numerical output.
Figure 12.6 Linear Fit for Total Costs and Length of Stay
The estimated regression equation is:
Total Costs = -1355.061 + 1441.227*Length of Stay
This best fit line is found using the ordinary least squares method where the estimated slope and intercept are chosen to minimize the sum of the squared distances from the observations to the fitted line. The intercept of -1355.06 is the estimated average total cost when length of stay is zero. Since it does not make sense to have an inpatient hospital stay of 0 days, the intercept is not interpreted in the problem context, but serves as a fitting constant. The slope indicates that for each increase of one day in length of stay there is an estimated average increase of $1441.23 in total costs. The slope is an estimate of the daily cost of hospitalization for newborns. Always assess the slope coefficient for reasonableness in the problem context. Do both the sign and magnitude make sense? The slope is positive which says that as length of stay goes up total costs go up. A quick internet search for average daily hospital costs will assist in determining the reasonableness of the magnitude of the slope. It appears that $1441.23 is a plausible daily charge.
To establish if length of stay is a significant predictor of total costs, a test of hypothesis should be conducted for the slope coefficient. The Parameter Estimates table in Figure 12.6 Linear Fit for Total Costs and Length of Stay shows the t-ratio and p-value (Prob > |t|) associated with a test of hypothesis that the slope is equal to zero versus the alternative that the slope is different from zero. The p-value is <0.0001 which indicates that length of stay is a significant predictor of total costs for infants born in Champlain Valley Physicians Hospital in 2014.

Assessing the Model Fit

Once the regression line is determined to be significant, an assessment of the goodness-of-fit of the model to the data is warranted. The coefficient of determination, R2, is a measure of goodness-of-fit and gives the proportion of the variation in total costs explained by length of stay. This is found in the JMP output of Figure 12.6 Linear Fit for Total Costs and Length of Stay in the Summary of Fit table. The Rsquare is 0.44 for this simple linear regression. There is no general rule of thumb for what constitutes a good R2, but since it is unitless it is useful for comparing different models. R2 is sensitive to outliers and the range of the independent variable.
The root mean squared error (RMSE) is the standard deviation about the regression line. The RMSE is in the units of the dependent variable and can be found in the Summary of Fit table in Figure 12.6 Linear Fit for Total Costs and Length of Stay. Comparing the RMSE to the standard deviation of the dependent variable is useful in assessing the goodness-of-fit of the linear regression. A RMSE that is less than the standard deviation of the dependent variable (total costs) indicates that the model has explained some of the variability in the dependent variable. For the CVPH regression the RMSE is $1607 and the standard deviation of total costs (see Figure 12.3 Descriptive Analysis for Total Costs, Length of Stay, and Birthweight) is $2137, which indicates the linear regression has explained some of the variability in total cost.
Assessing the RMSE in the problem context helps determine if the model is adequate for prediction. For example, a RMSE on the order of tens of dollars would indicate that the model is useful for predicting total costs associated with childbirth while a RMSE on the order of thousands of dollars is not sufficiently precise to be of much practical value.
Finally, assess the line fit visually. In the scatterplot shown in Figure 12.6 Linear Fit for Total Costs and Length of Stay we see that the line underestimates total costs for longer lengths of stay. Intuitively, we would expect longer lengths of stay for a newborn when there are complications. The JMP Lasso tool allows groups of points to be selected by “drawing” a region on a JMP graph. Some of the longer length of stay observations have been selected from the scatterplot using the Lasso tool as shown in Figure 12.7 Longer Length of Stay Observations Selected with the Lasso Tool. This will allow to us look for patterns in other variables that may help explain why the model underestimates total costs for these observations.
Figure 12.7 Longer Length of Stay Observations Selected with the Lasso Tool
The selected observations appear darker than those that are not selected. The corresponding rows in the JMP data table are highlighted and by applying Table > Subset a new data table can be created to facilitate examination of other variables. Most of these newborns had complications such as infections or respiratory conditions.

Assessing Regression Assumptions with Residual Plots

Evaluating the regression equation at a given length of stay yields a predicted total cost. For example, the predicted total cost for a five day length of stay is $5851.07. The difference between an observed and predicted value at a given length of stay is referred to as a residual. Residuals capture the variation in total cost that is not explained by the linear model. Plotting residuals allows the analyst to assess the regression model fit and to verify regression assumptions.
To obtain residual plots select Plot Residuals from the Linear Fit drop-down menu as shown in Figure 12.8 Requesting Residual Plots from Fit Y by X Linear Fit.
Figure 12.8 Requesting Residual Plots from Fit Y by X Linear Fit
Figure 12.9 Selected Residual Plots shows two of the resulting residual plots that assist in assessing regression assumptions.
Figure 12.9 Selected Residual Plots
The Residual by X Plot is useful for assessing the assumption of constant variation of the residuals about the regression line over the range of values for the independent variable. In the Residual by X Plot in Figure 12.9 Selected Residual Plots we see increased variation in the residuals as length of stay increases suggesting that this assumption is not satisfied. Another regression assumption is that the residuals are normally distributed. The Normal Quantile plot is shown in Figure 12.9 Selected Residual Plots for the CVPH regression residuals where we see a serious departure from the linear pattern that would be consistent with a Normal distribution. In light of these issues, data transformations or different models should be considered.

Exploring Relationships with Total Costs: Birthweight

Repeating the simple regression process with birthweight (in pounds) as the independent variable will address the second research question posed in the problem statement. The results are shown in Figure 12.10 Simple Regression Analysis for Total Costs and Birthweight.
Figure 12.10 Simple Regression Analysis for Total Costs and Birthweight
The fitted line has a very shallow slope and there is considerable scatter about the regression line. The slope coefficient quantifies that linear relationship as an increase of one pound in birthweight results in an estimated average increase in total costs of $99.05. However, the slope coefficient is not significantly different from zero (using a 5% significance level) since the p-value (Prob > |t| from the Parameter Estimates table) is 0.1527. The means that birthweight is not a significant predictor of total costs. When the test of hypothesis for the slope is not significant, no further analysis, such as assessing goodness-of-fit, should be conducted.
Last updated: October 12, 2017
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset