Chapter 3: Dirty Data

Dirty data refers to fields or variables within a data set that are erroneous. Possible errors could range from spelling mistakes, incorrect values associated with fields or variables, or simply missing or blank values. Most real-world data sets have some degree of dirty data. As shown in Figure 3.1, dealing with dirty data is one of the multivariate data discovery steps.

In some situations (for example, when the original data source can be obtained), you can be 100% certain of the proper value for the variable. Typically, you cannot be 100% certain, so you must put the data through various programmed “cleaning” processes in which you do the best you can to make sure that the values are correct or at least reasonable.

Realize that the best, most unbiased solution to dirty data is not to have any bad data in the first place. So starting at the initial stages of data input, every measure possible, should be taken to guarantee quality of data. However, many times this guarantee of quality is out of your control. For example, you might obtain the data from some outside source. Nonetheless, the goal should be that the data over which you have control should be as clean as possible—it will save you money in the long run!

Figure 3.1: A Framework to Multivariate Analysis

Cleaning the data, manipulating the data, creating and deriving variables (as mentioned in Chapter 1 and discussed to a certain degree further in Chapter 17), and arranging the data in a suitable format for building models takes about 75% of the time. The entire data cleaning or scrubbing process, also known as ETL (extraction, transformation, loading), is very data set dependent and is beyond the scope of this book.

This chapter will focus on the major steps that you can take to clean the data when you do not have access to the raw data. The first part of the chapter addresses how JMP can assist you with descriptive statistics and data visualization methods in detecting and removing major errors, inconsistencies, and outliers. If these values remain in the data set, parameter estimates from statistical models might be biased and possibly produce significantly biased results. The remainder of the chapter will address missing values. Missing values can be a serious problem since most standard statistical methods presume complete information (no blank fields) for all the variables in the analysis. If one or more fields are missing, the observation is not used by the statistical technique.

Data Set

To provide context to our discussion of these topics, you will use the data set file hmeq.jmp.1 The data set contains 5,960 records of bank customers that have received a home equity loan and whether they have defaulted on the loan and some of their attributes. The variables included are as follows:

Default

Loan status (1 = defaulted on loan; 0 = paid load in full)

Loan

Amount of loan requested

Mortgage

Amount due on existing mortgage

Value

Current value of property

Reason

Reason for the loan request (HomeImp = home improvement; DebtCon = debt consolidation)

Job

Six occupational categories

YOJ

Years at present job

Derogatories

Number of major derogatory reports

Delinquencies

Number of delinquent credit lines

CLAge

Age of oldest credit line in months

Inquiries

Number of recent credit inquiries

CLNo

Number of existing credit lines

DEBTINC

Debt-to-income ratio

Error Detection

When you are given a new data set, one of the first steps is to perform descriptive statistics on most of the variables. You have two major goals during this exploratory process: (1) to check on the quality of the data, and (2) to start to get a basic understanding of the data set.

First, examine the data set’s categorical variables, Default, Reason and Job:

1. Select Analyze ▶ Distribution.

2. In the Distribution dialog box (Figure 3.2), select Default, Reason, and Job. Then click Y, Columns. (If you want to bring all three variables over at one time, hold down the Ctrl key and click each variable).

3. Click OK.

Figure 3.2: Distribution Dialog Box

Figure 3.3 displays the distribution results for these three categorical variables. The data set contains 5,960 records. You can see that there are no missing values for Default (look toward the bottom of the output, where it says “N Missing 0”). On the other hand, the variable Reason has 252 missing values, and Job has 279 missing values. You will return to address these missing values later in the chapter in the discussion about missing values. For now, just note this occurrence.

Figure 3.3: Distribution Output for Variables Default, Reason, and Job

Besides the 279 missing values, the variable Job looks fine. However, it appears that the variable Reason has a few typographical errors—Debtcon, DebtConn and Homeimp. One way to change these errors is to scan through the data table to find each of them. Because there are few typographical errors in this case, you could use this approach. But, with a large data set and many errors, this approach would be very tedious. A more general and very powerful approach is to use the JMP Recode tool:

1. Click on the column heading Reason, as shown in Figure 3.4.

2. Select Cols ▶ Recode. The Recode dialog box (Figure 3.5) will appear.

3. Highlight Debtcon, DebtCon and DebtConn by clicking Debtcon, holding down the Shift key, and clicking DebtConn.

4. Right-click to open the option list and select Group To DebtCon. This groups the variable names together.

5. Similarly, group Homeimp and HomeImp to HomeImp.

6. Click the down arrow in the Done box.

7. You can have your changes replace the values for that column/variable or create a new column/variable with the changes. Now change the current variable by clicking In Place.2

8. To check your changes, rerun the distribution function to see that the typographical errors have been corrected. You have two categories of Reason—DebtCon and HomeImp.

Figure 3.4: Data Table for hmeq.jmp

Figure 3.5: Recode Dialog Box

Outlier Detection

Here are two approaches to identifying outliers.

Approach 1

Examine the continuous variables, Loan, Mortgage, and Value:

1. Select Analyze ▶ Distribution.

2. In the Distribution dialog box, select Loan, Mortgage and Value. Then select Y, Columns. (Since all three variables are listed consecutively, click Loan first, hold down the Shift key, and click Value. As a result, all three should be highlighted).

3. Click OK.

Figure 3.6 displays the distribution results for these three variables.

Figure 3.6: Distribution Output for Variables Loan, Mortgage, and Value

If you want more statistics or to customize what summary statistics are displayed, click the red triangle next to Summary Statistics:

1. Click Customize Summary Statistics.

2. Click the check boxes for the summary statistics that you want displayed.

3. Click OK.

Examining the summary statistics and histograms in Figure 3.6, you will see that they look plausible. All three distributions are right-skewed. However, notice that the variable Value seems to have around the value of 850,000 several rather large observations (called outliers). You are not sure how many observations are out there. Complete the following steps:

1. Hold down the left mouse key and draw a rectangle around these large values. The Data Table window in the Rows panel should have 4 Selected. That is, 4 Value observations have large numbers.

2. Click Selected.

3. Right-click the mouse to open the list of options and click Data View.

A new data table similar to Figure 3.7 will appear with the 4 outlier observations. The variable Value represents the current variable of the property. Suppose you happen to know that the values of the homes in the bank’s region are not greater than $600,000. So you that know these four values of $850,000, $854,112, $854,114, and $855,909 are not possible. You will change these outlier values later after you look at another approach to identifying outliers. Close this new data table and return to the hmeq data table.

Figure 3.7: Data Table of Outliers for Variable Value

Approach 2

Another useful approach in identifying extreme values, as well as discovering missing values or error codes in the data, is to use the Quantile Range Outliers report. You produce this report by doing the following:

1. Select Analyze ▶ Screening ▶ Explore Outliers. The Explore Outliers dialog box will appear.

2. Under Select Columns, click Value and then click Y, Columns.

3. Click OK. The Explore Outliers dialog box will appear.

4. Click Quantile Range Outliers.

The Quantile Range Outlier report will appear as shown in Figure 3.8.

Figure 3.8: Quantile Range Outliers Report for Variable Value

Quantiles divide the data into 10 equal parts. That is, each quantile has 10% of the observations. Quantiles are similar to quartiles that divide the data into four equal parts. To obtain the values at which each quantile occurs, you must first sort the data in ascending order. The 1st quantile value, or the lower 10% value, is the value that is greater than the lowest 10% of values for that variable. Conversely, the 9^th quantile is the value where 90% of the values are less than are equal to this value. Only 10% of the values are above this 9^th quantile value.

The Tail Quantile probability (Figure 3.8), whose default value is 0.1, defines the interquantile range, which is from the Tail Quantile probability to (1 − Tail Quantile probability). So with a Tail Quantile probability equal to .1, the interquantile range is between the 0.1 and 0.9 quantiles. Corresponding quantile values for this data set are 48,800 and 175,166, respectively (Figure 3.8). The range of interquantile range (or the difference) is 175,166 − 48,800 = 126,366.

Q is a multiplier used to determine outliers for the chosen variable, or variables (Figure 5.8). Its default value is 3. Outliers are defined as values Q times the interquantile range below or above the lower and upper Tail quantile value. So, with this data and for the variable Value, as shown in Figure 3.8, you observe the following:

● Q * Interquantile range = 3 * 126,366 = 379,098.

● So the Low Threshold = 10% Quantile – 3Q = 48,800 – 379,098 = -330,298.

● And the High Threshold = 175,166 + 379,098 = 554,264. (Both numbers are off by one, due to rounding.)

Looking at the Quantile Range Outlier report (Figure 3.8), you can see that your four outliers were identified. Now complete the following steps:

1. Click the Value variable. The row in the report should now be highlighted.

2. In the middle of the Quantile Range Outlier report, click the Select rows tab.

3. Go back to the data table. In the Rows panel, the value selected is equal to 4.

4. As discussed earlier, you can click Selected, right-click the mouse, and then click Data View to look at the same 4 outlier observations.

Now, what do you do, given that you have identified these observations as outliers? One of two situations could be true:

● The actual value of the outlier is correct. If so, you might want to examine this observation further to try to understand why such a large value occurred.

● Or the value is incorrect. If possible, go and find out what the actual value is.

With this data set, it is not possible to verify these values. Yet, you know they are incorrect because you know (you assumed earlier in the chapter) that the largest value for Value must be less than 600,000.

However, look closely at these four outlier values in Table 3.1. They have repeated numbers. So you might suspect that whoever entered these values happened to accidentally press a key a few more times than was correct. You assume that this actually happened. So you want to make the changes shown in Table 3.1.

Table 3.1: Outlier Values for Variable Value and the Suggested Corrections

Current	Corrected
850,000	85,000
854,112	85,412
854,114	85,414
855,909	85,909

One way to make these changes is to search through the data table until you find them. However, even with this small data set of 5,960 records, that would take time. Another approach would be to sort the data by descending values of Value:

1. Click the column heading Value.

2. Right-click and select Sort ▶Descending. The four outliers should be the first four observations.

Another approach would be to use the Recode tool.

Missing Values

The remainder of this chapter covers observations that have missing values. Many statistical techniques will ignore, delete, or not use an observation if any values are missing. If your data set has many variables and each variable has a small percentage of missing values, the number of usable observations can drop drastically. If you are using 10 variables in your model, for example, and each has 5% missing values (assume independently distributed), then you will be able to use only about 60% of the observations in the data set (.95^10 = 0.5987). It seems a shame to “throw away” an observation with 9 good pieces of information because you are missing 1 value out of 10. If you can “fix” the missing values, in a reasonable way, then you can use more observations.

Data fields or variables with missing values are a rather common occurrence when you are dealing with real-world data sets. Consequently, a given technique’s results can be misleading or biased. The problem of missing values and approaches for dealing with them have only been addressed and developed since the late 1970s.

The reason for the missing values might or might not be logically justifiable. An example of a frequent, logically justifiable, and legitimate reason for missing data occurs with surveys. Suppose in a survey there is a question about whether you own a house. Subsequent questions might concern details about your homeownership, such as whether you have a mortgage, the amount of the mortgage, the amount of yearly real estate taxes, and so on. If you do not own a house, those data fields should be blank or missing. Justifiable missing values can also occur with government data sets. For example, suppose the data set contains information about census tracts, and there are data fields pertaining to land use. Census tract observations that happen to be in urban areas would most likely contain blank fields for data pertaining to agricultural use, and, equally likely, rural census tract areas could have blank fields for commercial use.

Care should be taken in making adjustments to variables with justifiable missing values if any of those values will be used as a denominator in a later calculation. For example, referring back to the homeownership survey, if you plan on reporting the percentage of people who own homes, the denominator should be the total number sampled. However, if the question is what percentage of homeowners have mortgages, the denominator should be the number of respondents that own a house.

When examining possibly unjustifiable missing values, you would like know why the data is missing. The underlying mechanisms for the missing values can range from being completely random to purposeful patterns of missingness. For example, people often underreport socially unacceptable behaviors, such as drug use, or many high income people do not report salary information.

You cannot be 100% certain about the probability of causes of missing values since you do not actually know the values of the missing values and cannot establish their true relationships. As a second-best solution, you look at the pattern of missingness. To illustrate, Table 3.2 shows an example of the patterns of missingness for 100 observations. From the patterns of missingness, you can see the following: 60% of the observations were complete and had no missing values; 20% had the Y variable missing, and 15 of those observations had only Y missing, while 5 observations had Y and X1 missing, and so on.

Table 3.2: An Example of Patterns of Missingness for 100 Observations (0 Implies Missing)

Y	x1	x2	N
1	1	1	60
1	1	0	20
0	1	1	15
0	0	1	5

Statistical Assumptions of Patterns of Missing

There are three important statistical assumptions that you can make about the reason that data are missing (called the missing data mechanism) and their possible impact on statistical analysis: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

Missing Completely at Random

A variable Y is considered to be missing completely at random (MCAR) if the probability of its being missing does not depend on the values of Y or X. In essence, P(Y is missing│X,Y) = P(Y is missing).

MCAR implies that data are missing independently of both observed and unobserved data. That is, some values were deleted randomly. MCAR’s only effect is that the sample size is smaller. It does not introduce any biases to the results. However, many times in practice, this is an unrealistically strong assumption.

A violation of MCAR would be if the people who did not provide their mortgage were on average older than people who did report it.

Missing at Random

The missing at random (MAR) assumption is a weaker assumption of MCAR. For example, suppose some data are missing for Y. The data for Y that are missing are considered MAR if the probability that Y is missing does not depend on Y after controlling for other variables in the analysis. That is, P(Y is missing│X,Y) = P(Y is missing│X).

MAR implies, given the observed data, that data are missing independently of unobserved data. The missingness is not random but systematic. That is, it is conditional (on itself or on other variables). The MAR assumption is satisfied if the probability of missing data on Y (mortgage) depended on an individual’s age. But within age groups, the probability of missing mortgage was not related to mortgage. However, you cannot test whether missingness on Y (mortgage) depends on Y. That is, you cannot compare the Y values of those with and without missing data to see whether they systematically differ on Y since you do not know the Y values of the missing data.

The underlying reason for the missing values of MAR data might or might not be important in understanding the relationships of the data.

Missing Not at Random

When the data are neither assumed to be MCAR nor MAR, then the data are classified as MNAR. Suppose some data are missing for Y. Y is considered to be MNAR if the probability that Y is missing does depend on the values of Y. MNAR implies that the missing values are related to the true value of the variable. In this situation, the amount of a mortgage is more likely to be missing with certain age groups. The fact that the data are missing contains information about the response. The observed data is a biased sample, and missing data cannot be ignored. In this case, the missing data mechanism should be modeled as part of the estimation process.

If the MAR assumption is satisfied (that is, MCAR or MAR), then the missing data mechanism is said to be ignorable, which implies that there is no need to model the missing data mechanism as part of the analysis. Most commercial statistical software, including JMP, handles missing data as if the missing data were based on this assumption of ignorability.

If the MAR assumption is not satisfied (that is, if the missing data mechanism is assumed MNAR), the missing value issue must be addressed and corrected to produce correct estimates of the parameters of interest. Both of the model-based methods (discussed later in this chapter) can produce valid estimates in the MNAR case if a model for the missing data mechanism is correctly specified.

Conventional Correction Methods

The four primary methods used for dealing with missing values are as follows:

● Listwise deletion method. Remove rows with missing values.

● Variable removal method. Eliminate any variable/column with missing values.

● Conventional estimate values methods. Replace missing values using typical statistical methods to estimated values.

● Model-based methods. Impute missing values by using model-based correction methods.

The first two methods remove or discard information (either a row or a column) while the last two methods estimate the missing values. A concern is, if the missing values are ignored, that later when you are using a predictive model you won’t be able to score a new observation with missing values. (Another option is to use a predictive modeling technique such as decision trees or regression trees that can handle missing values. See Chapter 10.)

Listwise Deletion Method

The most common method for handling missing values (and the default of many statistical techniques) is merely to refrain from using any observations when one or more of the X and Y variables used in the analysis has a missing value. This method is called listwise deletion (or complete case analysis). Advantages of the listwise deletion method are that it is simple (usually requires the analyst to do nothing) and that the analyses across statistical techniques are comparable in that they use the same observations.

On the other hand, not including these observations can produce biased estimates (if the data are not MCAR). Also, by reducing the number of observations being used, the statistical power of the statistical test is reduced. The listwise deletion method often works well, particularly if the number of rows with any missing data is limited to a small (less than 5%) number of observations. In this case, any biases or loss of statistical power is likely inconsequential.

A special case of the listwise deletion method is the pairwise deletion (or available case analysis) method. This method keeps as many observations as possible for each analysis. For example, using the small example in Table 3.2, Observations 2 and 6 would not be used in analyzing Gender and Mortgage. And Observations 1, 6, 8, and 11 would not be used in analyzing Gender and Equity Loan. The pairwise deletion method does use all the information possible for each analysis. However, you cannot compare the analyses because there is a different sample for each model.

Table 3.2: A Small Example of Missing Data

Observation	Gender	Mortgage	Equity Loan
1	M	450,000
2	F		9,900
3	M	550,000	8,600
4	M	285,000	8,800
5	M	280,000	5,700
6		381,000	28,000
7	F	475,000	8,000
8	M	395,000
9	M	286,000	9,000
10	F	570,000	7,500
11	F	485,000

Variable Removal Method

A similar approach might be taken in dropping some variables/fields from the model if a substantial proportion of observations have missing values. For example, in a survey, a great deal of missing values in one variable might be the result of a poorly worded question. A non-survey example could be problems in collecting/reporting the data for those variables. If it is believed that the variables with a large proportion of missing values have insignificant effect on Y, there is no great loss in not including them.

Conventional Estimate Value Methods

The following subsections give the methods that historically have been used to estimate missing values.

Mean, Median, or Mode Substitution

The idea with this method is that without any data, the mean, median or mode would provide the best estimate to replace the missing value. This approach results in artificially reducing the variance of the variable and could yield lower covariance and correlation estimates.

Dummy Variable

Similar to the above Mean/Median/Mode Substitution method, the missing values are estimated to a constant such as a mean, median, or mode. In addition, a dummy/indicator variable is created (1 = value is missing; 0 = otherwise). If the missing values are legitimately not there, the estimates from this method are unbiased. However, if the missing values are not legitimately missing, this method produces biased estimates.

Regression Imputation

Also known as conditional mean imputation, this method models the missing value based on the other variables. Suppose that you are examining a regression model with multiple independent X variables. Further suppose that one independent variable, Xi, has missing values. You select all those observations with complete information use Xi as the dependent variable (also called regress Xi) and use all the other Xs as independent variables to predict the missing values of Xi. This method reduces the variance of Xi and could produce higher covariance between the X variables and increase correlation values.

In general, the major limitations of the regression imputation methods are that they lead to underestimates of the variance and can produce biased test statistics. These issues are caused by the fact that the imputed values are determined by the observed data (hence, they contain no error).

Model-Based Methods

Model-based methods to imputing missing values explicitly model the missingness of the data and provide confidence intervals of the parameter estimates. These methods provide a significant improvement over the conventional correction methods. As a result, the model-based methods are today the most common approaches to handling missing data. The two major model-based methods are maximum likelihood (ML) and multiple imputation (MI). Both of these approaches assume the joint distribution of data is multivariate normal, and the missing data mechanism is ignorable (MCAR or MAR). The major advantage of the multivariate normal distribution assumption is that any independent variable follows a normal distribution conditional on all the other variables including the dependent/outcome variable3.

Maximum Likelihood Methods

The prominent ML method uses an expectation-maximization (EM) algorithm. The EM algorithm produces maximum likelihood estimates through an iterative process. This iterative process has two steps:

● Expectation Step. If all the values are present, the sums, the sums of squares, and sums of cross-products are incremented. Otherwise, if one or more of the data fields are missing, the missing value(s) is (are) estimated using regression with all the variables. This estimated value(s) is (are) added to the sums. If only one value is missing, the sums of squares and cross-products are incremented. If more than one value is missing, a random correction error (from a multivariate normal distribution) is added to the sums of squares and cross-products. This step assumes that you have an estimate of the population mean and variance/covariance matrix, so you have a specification of the population distribution. From this population distribution, you can compute the expected value of sums.

● Maximization Step. Based on the current expected sums, you estimate the sums of squares and sums of cross-products and produce population mean and variance/covariance matrix. Using this variance/covariance matrix, new regression equations are calculated for each variable using all the other variables. These new regression equations are then used to update the best guess for the missing values during the Expectation Stage of the next iteration.

This two-stage process continues back and forth until the estimates do not change substantially. A disadvantage of this EM algorithm is that it does not provide standard errors (to be used to generate confidence intervals or perform hypothesis testing). So an additional step is necessary, such as a bootstrapping procedure, to produce the standard errors.

Multiple Imputation Methods

The MI methods use a simulation-based approach. Unlike the ML method, the goal of the MI methods is to obtain estimates of the missing values rather than the expected values.

The MI methods also have a two-stage process:

● Impute Estimates Stage. A set of possible values for the missing observations are generated based on the multivariate normal distribution. Estimates of the missing values are produced by simulating random draws from the distribution of missing values given the observed values.

● Generate Complete Data Sets Stage. Repeat the Impute Estimates Stage m times to produce m complete data sets.

Averages and standard errors of the parameter estimates are produced from the m complete data sets. The resulting parameter estimates and confidence intervals have been shown to provide more robust estimates than the conventional methods.

The JMP Approach

When there is a high percentage of missing data, no method can reliably estimate the parameters. Even in such a case of high rate of missingness, the estimates from using one of the model-based methods provide reasonable results. So, in general, a model-based method is the best approach to handling missing data. Both model-based methods, ML and MI, have been shown to produce parameter estimates superior to those produced by conventional methods. The ML method has been shown to provide more robust parameter estimates than the MI methods. However, due to the extra step of estimating standard errors, the MI methods have been found to be easier to implement and those are the methods most often found in commercial statistical software applications.

JMP provides two MI methods: multivariate normal imputation and multivariate singular value decomposition (SVD) imputation. The multivariate normal imputation replaces the missing values based on a multivariate normal distribution and provides an option to improve the estimation of the covariance matrix with a shrinkage estimator. The multivariate SVD method is the suggested method to use with wide data sets. That is, data sets with a large number of columns/variables (100s or 1000s of variables).

The two MI methods provided in JMP and model-based methods do not work with categorical data, because of the multivariate normal distribution assumption. Historically, the popular approach to replacing missing categorical data was to generate and replace the missing categorical values with a new category called Missing. Recently, studies have shown this approach can produce biased estimates. An alternative approach has been to change the data type of the dummy variables to continuous variables. This approach can produce imputed values outside the range of 0 to 1. If the categorical variables are being used as predictor variables in, for example, a regression model, the imputed values are used.

Example Using JMP

Now return to the hmeq.jmp data set that you were using earlier in this chapter.

To get a high-level value of the degree of missingness in the data set, complete the following steps:

1. At the top menu, select Tables ▶ Missing Data Pattern. The Missing Data Pattern dialog box will appear, as in Figure 3.9.

2. Select all the columns, from Default to DEBTINC and click Add Columns.

3. Click OK.

A Missing Data Pattern data table will appear (Figure 3.10). This new data table contains all the different patterns of missingness.

Figure 3.9: Missing Data Pattern Dialog Box

Figure 3.10: Missing Data Pattern

As you can see in Figure 3.10, there are 117 rows, so there are 117 different patterns of missingness. Each row represents a particular pattern of missingness. Notice in the first row that the count is 3,364, and the number of columns missing is 0. So, out of the original 5,960 rows, 3,364 have complete data. In the Missing Data Pattern window, in the upper-leftmost panel, click the green triangle next to Treemap. Do the same for Cell Plot. In each case, you get a visualization of the patterns of missingness in the data set.

In the Missing Data Pattern data table, look at the categorical variables Default, Reason, and Job by examining their distributions (use Analyze▶ Distributions). As shown earlier in the chapter in Figure 3.3, Default does not have any missing values, but Reason and Job have 252 and 279 missing values, respectively.

Complete the following steps:

1. Go back to the hmeg.jmp data table. In the Columns panel, select Reason and Job. Then from the top menu, select Cols ▶ Utilities ▶ Make Indicator Columns.

2. In the new dialog box, click the Append Column Name box.

3. Click OK.

The data table will change as shown in Figure 3.11. This procedure created an indicator variable for each level of a categorical variable. For example, for the variable Reason, you now have two indicator variables, Reason_DebtCon, and Reason_HomeImp.

Figure 3.11 New Data Table after Creating Indicator Columns

If there are missing values for the categorical variable, such as Reason, you now want these new indicator variables to show missing values. As shown in Figure 3.11, you first want to sort the column:

1. Click on the column name Reason. Make sure that the entire column is highlighted. With your cursor still on the column heading, right-click, and select Sort ▶ Ascending. The first 252 rows should now be blank for the Reason column and there should be zeros in cells Reason_DebtCon and Reason_HomeImp.

2. You would like all those zeros in the first 252 records for Reason_DebtCon and Reason_HomeImp to be missing. To accomplish this, first, highlight all the data in Reason_DebtCon and Reason_HomeImp for the first 252 records in the data table. Then press Delete on your keyboard. Now, all those zeros have a dot, indicating missing.

Similarly, follow this entire process of sorting the data table to delete zeros to Job and its indicator variables.

Another approach to examining the missingness in the data set is to do as follows:

1. On the top menu, select Analyze ▶ Screening ▶ Explore Missing Values. The Explore Missing Values dialog box will appear.

2. Highlight all the variables and click Y, Columns.

3. Click OK.

The Explore Missing Values report box will appear as shown in Figure 3.12. The report lists by variable the number of missing observations; for example, DEBTINC is missing in 1267 rows.

Figure 3.12: Explore Missing Values Report

Two visualization tools are provided. Just click their respective boxes to examine the pattern of missingness of the data:

● Missing Value Clustering. This tool provides a hierarchical clustering tree of the missing values within rows and also within columns (notice the hierarchical tree at the bottom). Missing values are shown in red, and nonmissing values are shown in blue. You can explore to see which variables or observations tend to be missing together (click anywhere on the tree).

● Missing Value Snapshot. This tool shows the missing data for each observation. Exploring the snapshot patterns could help identify patterns of missing values.

Notice that none of the categorical variables, Default, Reason, and Job, are listed, but the indicator variables for Reason and Job are listed, because they are now continuous.

Now go back and select Analyze ▶ Screening ▶ Explore Missing Values. This time, do not highlight all the variables. Instead, highlight all the continuous variables. A new Explore Missing Values report box will appear as shown in Figure 3.13.

Figure 3.13: Explore Missing Values Report with Only Continuous Variables

In Figure 3.13, you can see that you have two imputation options. You will use the first one:

1. Click the Multivariate Normal Imputation box.

2. Click Yes, Shrinkage, to improve the estimation of the covariances. (A JMP Alert will appear suggesting to use a “Save As” because the data table has changed.)

3. Click OK and close the Explore Missing Value report window.

All the missing values are replaced by imputed values, and the data set is ready for further statistical analysis. If you check the data table, then the only remaining missing values are in the categorical columns of Reason and Job.

Finally, the next step would be to run a statistical model, such as a regression, with the missing data and to compare it to the regression with the imputed values.

This brings to question whether cleaning the data will improve your (regression) model or not. The answer is that you are not sure. A clean data set could improve the model, and it could also make it worse. So why clean the data set?

A cleaned data set represents your population most accurately. Thus, statistics, inferences and models accurately represent your population of interest. If the data set is not clean, you could possibly produce significant biases and inferences.

As discussed earlier in the chapter, if the MAR assumption is satisfied (that is, MCAR or MAR), then the missing data mechanism is said to be ignorable. This implies that there is no need to model the missing data mechanism as part of the analysis. However, if the MAR assumption is not satisfied (that is, if the missing data mechanism is assumed MNAR), the missing value issue must be addressed/corrected to produce correct estimates of the parameters of interest.

General First Steps on Receipt of a Data Set

Here is a list of recommended initial steps to clean a data set:

1. Understand all the columns/variables, their definitions, and coding. Also of interest might be who entered the data and how the data were entered. Identify any variables that are not being used or should not be used.

2. Perform univariate descriptive statistics on all the variables. Create some visualizations of the data.

3. Try to identify any typographical errors and correct them.

4. Try to identify outliers. If there is a typographical error, correct it if possible. Otherwise, note the degree of outliers.

5. Note the degree of missing values. If a significant percentage of values of a variable are missing, you might want to disregard that variable. There is no set rule. You’re only concerned with what is significant. This might depend on the situation and data set. In general, if 40% or more of the data is missing you might want to disregard that variable. And you might want to also explore why so much is missing.

6. Understand the basic distribution of the data for each variable.

7. Explore the pattern of missingness. If you believe that the MAR assumption is satisfied, the missing data mechanism is considered ignorable, and you do not have to impute the missing values. On the other hand, if you feel this MAR assumption is not satisfied, then you should impute the missing values.

Exercises

1. In the Titanic Passengers_new.jmp data set, identify if there are any outliers

2. In the Promotions_new.jmp data set, identify if there are any outliers.

3. In the Sub_CC Churn.jmp data set, identify if there are any outliers.

4. In the Sub_Enrollment.jmp data set, identify if there are any outliers.

5. In the Titanic Passengers_new.jmp data set, do the following:

a. Examine Sex and Passenger Class. Are there any typographical errors? If so, correct them.

b. Examine Lifeboat: Are there any missing values? If so, how many?

c. Examine Age and Midpoint Age. Are there any missing values? If so, how many? Impute them.

6. In the Promotions_new.jmp data set, do the following:

a. Examine Race and Position. Are there any typographical errors? If so, correct them.

b. Are there any missing values for all the variables? If so, impute them.

7. In the Sub_CC Churn.jmp data set, do the following:

a. Examine Martial Status, LTV Group and Gender. Are there any typographical errors? I f so, correct them.

b. Examine Gender. Are there any missing values? If so, how many?

c. Examine Age of Account (Months) and Age Group. Are there any missing values? If so, how many? Impute them.

8. In the Sub_Enrollment.jmp data set, do the following:

a. Examine State Province, Nation Description, Citizenship Description and Ethnicity. Are there any missing values? If so, how many?

b. Examine SAT Verbal, SAT Mathematics and SAT Total Score. Are there any missing values? If so, how many? Impute them.

1 Thanks to Tom Bohannon of SAS for providing the data set.

2 Although it is generally considered bad practice to change the raw data, you may want to do a Save As. If you do change the raw data, consider documenting your changes.

3 This multivariate normal distribution assumption is different from the assumption used in linear regression. In linear regression, you assume that the dependent variable is univariate normal conditional on the values of the fixed independent variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Dirty Data

Create new playlist

Sign In

Sign Up