Explore Outliers Utility
Exploring and understanding outliers in your data is an important part of analysis. Outliers in data can be due to mistakes in data collection or reporting, measurement systems failure, or the inclusion of error or missing value codes in the data set. The presence of outliers can distort estimates. Therefore, any analyses that are conducted are biased toward those outliers. Outliers also inflate the sample variance. Sometimes retaining outliers in data is necessary, however, and removing them could underestimate the sample variance and bias the data in the opposite direction.
Whether you remove or retain outliers, you must locate them. There are many ways to visually inspect for outliers. For example, box plots, histograms, and scatter plots can sometimes easily display these extreme values. See the Visualizing Your Data in the Discovering JMP book for more information.
The Explore Outliers tool provides four different options to identify, explore, and manage outliers in your univariate or multivariate data.
Quantile Range Outliers
Uses the quantile distribution of each column to identify outliers as extreme values. This tool is useful for discovering missing value or error codes within the data. This is the recommended method to begin exploring outliers in your data. See “Quantile Range Outliers”.
Robust Fit Outliers
Finds robust estimates of the center and spread of each column and identifies outliers as those far from those values. See “Robust Fit Outliers”.
Multivariate Robust Outliers
Uses the Multivariate platform with Robust option to find outliers based on the Mahalanobis distance from the estimated robust center. See “Multivariate Robust Outliers”.
Multivariate k-Nearest Neighbor Outliers
Finds outliers as values far from their k-nearest neighbors. See “Multivariate k-Nearest Neighbor Outliers”.
Example of the Explore Outliers Utility
The Probe.jmp sample data table contains 387 characteristics (the Responses column group) measured on 5800 semiconductor wafers. The Lot ID and Wafer Number columns uniquely identify the wafer. You are interested in identifying outliers within a select group of columns of the data set. Use the Explore Outliers utility to identify outliers that can then be examined using the Distribution platform.
1. Select Help > Sample Data Library and open the Probe.jmp sample data table.
2. Select Analyze > Screening > Explore Outliers.
3. Select columns VDP_M1 through VDP_SICR and click Y, Columns. You should have 14 columns selected (see Figure 3.2).
Figure 3.2 Explore Outliers Launch Window
Explore Outliers Launch Window
4. Click OK.
5. Click Quantile Range Outliers.
The Quantile Range Outliers report shows each column and lists the number and identity of the outliers found.
6. In the Quantile Range Outliers report, select the check box named Show only columns with outliers. This limits the list of columns to only those that contain outliers.
Note that several columns contain outlier values of 9999. Many industries use nines as a missing value code.
7. In the Nines report, select each column.
8. Click Add Highest Nines to Missing Value Codes.
A JMP Alert indicates that you should use the Save As command to preserve your original data.
9. Click OK.
10. In the Quantile Range Outliers report, click Rescan.
11. Select the check box named Restrict search to integers.
In most cases of continuous data, integer values are often error codes or other coded data values. Notice that no additional error codes are included in this set of columns.
12. Deselect Restrict search to integers.
Examine the Data
1. Select all of the remaining columns in the Quantile Range Outliers report.
2. Click Select Rows.
3. Select Analyze > Distribution.
4. Assign the selected columns to the Y, Columns role. Because you selected these column names in the Quantile Range Outliers report, they are already selected in the Distribution launch window.
5. Click OK.
Figure 3.3 shows a simplified version of the report.
Figure 3.3 Distribution of Columns with Outliers Selected
Distribution of Columns with Outliers Selected
In columns VDP_M1 and VDP_PEMIT, notice that the selected outliers are somewhat close to the majority of data. For the rest of the columns, the selected outliers appear distant enough to exclude them from your analyses.
Refine Excluded Outliers
1. In the Quantile Range Outliers report, hold Ctrl and deselect columns VDP_M1 and VDP_PEMIT.
2. With the remaining columns selected in the report, click Exclude Rows.
3. Change Q to 20.
4. Click Rescan.
5. Select columns VDP_M1 and VDP_PEMIT in the report. Click Select Rows.
Reexamine the Data
1. Examine the Distributions report again. Notice the selected outliers are now separate enough from the majority of the data to select and exclude them from your analyses.
2. In the Quantile Range Outliers report, click Exclude Rows.
3. In the Distributions report, click the red triangle menu next to Distributions.
4. Select Redo > Redo Analysis.
Figure 3.4 shows a simplified version of the report.
Figure 3.4 Distributions of Columns with Outliers Excluded
Distributions of Columns with Outliers Excluded
The displays of the distributions of the data are now more informative without the outliers.
Launch the Explore Outliers Utility
Note: The Explore Outliers commands only analyze columns with a Continuous modeling type. Other columns can be entered in the launch window but are ignored.
To launch Explore Outliers, select Analyze > Screening > Explore Outliers. The launch window appears.
Figure 3.5 Explore Outliers Utility Launch Window
Explore Outliers Utility Launch Window
In the launch window, select the analysis columns as Y, Columns. You can also specify a By variable. After you click OK, the Explore Outliers report appears. You are presented with the following four outlier analysis commands:
Quantile Range Outliers
The Quantile Range Outliers method of outlier detection uses the quantile distribution of the values in a column to locate the extreme values. Quantiles are useful for detecting outliers because there is no distributional assumption associated with them. Data are simply sorted from smallest to largest. For example, the 20th quantile is the value at which 20% of values are smaller. Extreme values are found using a multiplier of the interquantile range, the distance between two specified quantiles. For more details about how quantiles are computed, see the “Distributions” chapter in the Basic Analysis book.
The Quantile Range Outliers utility is also useful for identifying missing value codes stored within the data. As noted earlier, in some industries, missing values are entered as nines (such as 999 and 9999). This utility finds any nines greater than the upper quartile as suspected missing value codes. The utility then enables you to add those missing value codes as a column property in the data table.
Quantile Range Outliers Options
The Quantile Range Outliers panel enables you to specify how outliers are to be calculated and how you want to manage them. Figure 3.6 shows the default Quantile Range Outliers window.
Figure 3.6 Quantile Range Outliers Window
Quantile Range Outliers Window
An outlier is considered any value more than Q times the interquantile range from the lower and upper quantiles. You can adjust the value of Q and the size of the interquantile range.
Tail Quantile
The probability for the lower quantile that is used to calculate the interquantile range. The probability of the upper quantile is considered Equation shown here. For example, a Tail Quantile value of 0.1 means that the interquantile range is between the 0.1 and 0.9 quantiles of the data. The default value is 0.1.
Q
The multiplier that helps determine values as outliers. Outliers are considered Q times the interquantile range past the Tail Quantile and Equation shown here values. Large values of Q provide a more conservative set of outliers than small values. The default is 3.
Restrict search to integers
Restricts outlier values to only integer values. This setting limits the search for outliers in order to find industry-specific missing value codes and error codes.
Show only columns with outliers
Limits the list of columns in the report to those that contain outliers.
After the report is displayed using your specifications, there are many ways to act on these extreme values. You can select the outliers in a column by selecting the specified column in the Quantile Range Outliers report.
Select Rows
Selects the rows of outliers in the selected columns in the data table.
Exclude Rows
Turns on the exclude row state for the selected rows. Click Rescan to update the Quantile Range Outliers report.
Color Cells
Colors the cells of the selected outliers in the data table.
Color Rows
Colors the rows containing outliers for the selected columns in the data table
Add to Missing Value Codes
Adds the selected outliers to the missing value codes column property. Use this option to identify known missing value or error codes within the data. Missing value and error codes are often integers and are sometimes either a positive or negative series of nines. Click Rescan to update the Quantile Range Outliers report.
Change to Missing
Changes the outlier value to a missing value in the data table. Use caution when changing values to missing. Change values to missing only if the data are known to be invalid or inaccurate. Click Rescan to update the Quantile Range Outliers report.
Rescan
Rescans the data after outlier actions have been taken.
Close
Closes the Quantile Range Outliers panel.
Quantile Range Outliers Report
The Quantile Range Outliers report lists all columns with the outliers found using the specified options. The report shows values for the upper and lower quantiles along with their low and high thresholds. Values outside of these threshold limits are considered outliers. The number of outliers in each column is indicated. The values of each outlier are listed in the last column of the report. Outliers that occur more than once in a column are listed with their count in parentheses. To remove columns without outliers from the report, select Show only columns with outliers.
There are several things to look for when reading this report.
Error codes. For some continuous data, suspiciously high integer values are likely to be error codes. For example, if your upper and lower quantile values are all less than 0.5, outliers such as 1049 or -777 are likely to be error codes.
Zeros. Sometimes zeros can indicate missing values. If the majority of your data is reasonably large and you notice zeros as outliers, they are likely to be due to missing data.
Nines Report
The Nines report within the Quantile Range Outliers window shows a list of columns that contain probable missing value codes. These missing value codes are a series of nines (usually 9999) and are the highest number that is all nines and also higher than the upper quantile. If the count is high, it is likely that these outliers are actually missing value codes. If the count is very low, you should explore further to determine whether the value is an outlier or a missing value code. The Nines Report includes the upper quantile value.
This report is displayed only when probable missing value codes are identified.
Add Highest Nines to Missing Value Codes
Adds the selected outlier values to the missing value codes column property. You must click Rescan to update the Quantile Range Outliers report.
Change Highest Nines to Missing
Replaces the selected outlier values with missing values in the data table.
Note: The first time you use choose an action (such as Change to Missing or Exclude Rows) to change your data, the alert window warns you to use the Save As command to save your data table as a new file to preserve a copy of your original data. When this window appears, click OK. If you decide to save your new data file, select File > Save As and save the file with a new name.
Robust Fit Outliers
Robust estimates of parameters are less sensitive to outliers than non-robust estimates. Robust Fit Outliers provides several types of robust estimates of the center and spread of your data to determine those values that can be considered extreme. Figure 3.7 shows the default Robust Fit Outliers window.
Figure 3.7 Robust Fit Outliers Window
Robust Fit Outliers Window
Robust Fit Outliers Options
Given a robust estimate of the center and spread, outliers are defined as those values that are K times the robust spread from the robust center. The Robust Fit Outliers window provides several options for calculating the robust estimates and multiplier K as well as provides tools to manage the outliers found.
Huber
Uses Huber M-Estimation to estimate center and spread. This option is the default. See Huber and Ronchetti (2009).
Cauchy
Assumes a Cauchy distribution to calculate estimates for the center and spread. Cauchy estimates have a high breakdown point and are typically more robust than Huber estimates. However, if your data are separated into clusters, the Cauchy distribution tends to consider only the half of the data that makes closer clusters, ignoring the rest.
Quartile
Uses the interquartile range (IQR) to estimate the spread. The estimate for the center is the median. The estimate for spread is the IQR divided by 1.34898. Dividing the IQR by this factor makes the spread correspond to one standard deviation if it was normally distributed data.
K
The multiplier that determines outliers as K times the spread away from the center. Large values of K provide a more conservative set of outliers than small values. The default is 4.
Show only columns with outliers
Limits the list of columns in the report to those that contain outliers.
Once the report is displayed using your specifications, there are many ways to explore these extreme values. You can select the outliers in a row by selecting the specified row in the Robust Estimates and Outliers report.
Select Rows
Selects the rows containing outliers for the selected columns in the data table.
Exclude Rows
Sets the Exclude Row state for outliers in the selected columns in the data table. Click Rescan to update the Robust Estimates and Outliers report.
Color Cells
Colors the cells of the selected outliers in the data table.
Color Rows
Colors the rows containing outliers for the selected columns in the data table.
Add to Missing Value Codes
Adds the selected outliers to the missing value codes column property for the selected columns. Use this option to identify known missing value or error codes within the data. Click Rescan to update the Robust Estimates and Outliers report.
Change to Missing
Changes the outlier value to a missing value in the data table. Click Rescan to update the Robust Estimates and Outliers report.
Rescan
Rescans the data after outlier actions have been taken.
Close
Closes the Robust Fit Outliers panel.
Multivariate Robust Outliers
The Multivariate Robust Fit Outliers tool uses the Robust option in the Multivariate platform to examine the relationships between multiple variables. For more information about how the Multivariate platform works, see Correlations and Multivariate Techniques in the Multivariate Methods book.
Outlier Analysis
The Outlier Analysis calculates the Mahalanobis distances from each point to the center of the multivariate normal distribution. This measure relates to contours of the multivariate normal density with respect to the correlation structure. The greater the distance from the center, the higher the probability that it is an outlier. For more information about the Mahalanobis distance and other distance measures, see the Correlations and Multivariate Techniques in the Multivariate Methods book.
After the rows are excluded, you are given the option to either rerun the analysis or close the utility. Rerunning the analysis recalculates the center of the multivariate distribution without those excluded rows. Note that unless you hide the excluded rows in the data table, they still appear in the graph.
You can save the distances to the data table by selecting the Save option from the Mahalanobis Distances red triangle menu.
Figure 3.8 Multivariate Robust Outliers Mahalanobis Distance Plot
Multivariate Robust Outliers Mahalanobis Distance Plot
Figure 3.8 shows the Mahalanobis distances of 16 different columns. The plot contains an upper control limit (UCL) of 4.82.This UCL is meant to be a helpful guide to show where potential outliers might be. However, you should use your own discretion to determine which values are outliers. For more details about this upper control limit (UCL), see Mason and Young (2002).
Multivariate with Robust Estimates Options
The red triangle menu for Multivariate with Robust Estimates contains numerous options to analyze your multivariate data. For a list and description of these options, see the Correlations and Multivariate Techniques in the Multivariate Methods book.
Multivariate k-Nearest Neighbor Outliers
The basic approach of outlier detection is to consider points distant from other points as outliers. One way of determining the distance of a point to other clusters of points is explore the distance to its nearest neighbors. For each value of K, the Multivariate k-Nearest Neighbor Outliers utility displays a plot of the Euclidean distance from each point to it’s Kth nearest neighbor. You specify the largest value of K, denoted as k. Plots are provided for Equation shown here, skipping values by the Fibonacci sequence to avoid displaying too many plots.
This approach is sensitive to the specified value of k. A small value of k can miss identifying points as outliers and a large value of k can falsely classify points as outliers:
Suppose that the specified k is small, so that you are only studying a few neighbors. If there is a cluster of more than k points that is far from the rest of the points, then the points within the cluster will have small distances to their nearest neighbors. You may be unable to detect the cluster of outliers.
Suppose that the specified k is large, so that you are studying a large number of neighbors. If there are clusters with fewer than k data points, then the points within these clusters may appear to be outliers. You may overlook the fact that the points form a cluster, interpreting the individual cluster members as outliers instead.
K-Nearest Neighbor Report
When you select Multivariate k-Nearest Neighbor Outliers from the list of commands, you are asked to specify the value of k to use as an upper bound for the furthest neighbor to be considered. Notice that the default value is set to 8.
The report shows plots for select values of K up to the value k. The value of K for each plot is displayed in its vertical axis label, which is of the form Distance to Neighbor K = <a>, where a is an integer denoting the ath closest neighbor. Each plot shows the distance from the point in the ith row to its ath nearest neighbor. The points that have large distances from their neighbors, across multiple values of K, are likely to be outliers.
The buttons above the plots do the following:
Exclude Selected Rows
Excludes rows corresponding to selected points from further analysis. The rows are assigned the Excluded row state in the data table. You are asked if you want to rerun or close the K Nearest Neighbors report. Rerunning the analysis identifies new nearest neighbors. The plots are updated and the excluded points are not shown.
Scatterplot Matrix
Opens a separate window containing a scatterplot matrix for all columns in the analysis. You can explore potential outliers by selecting them in the K Nearest Neighbors plots and viewing them in the scatterplot matrix.
Close
Closes the K Nearest Neighbors report.
Explore Outliers Utility Options
See the JMP Reports chapter in the Using JMP book for more information about the following options:
Redo
Contains options that enable you to repeat or relaunch the analysis. In platforms that support the feature, the Automatic Recalc option immediately reflects the changes that you make to the data table in the corresponding report window.
Save Script
Contains options that enable you to save a script that reproduces the report to several destinations.
Save By-Group Script
Contains options that enable you to save a script that reproduces the platform report for all levels of a By variable to several destinations. Available only when a By variable is specified in the launch window.
Additional Examples of the Explore Outliers Utility
Multivariate k-Nearest Neighbor Outliers Example
The Water Treatment.jmp data set contains daily measurement values of 38 sensors in an urban waste water treatment plant. You are interested in exploring these data for potential outliers. Potential outliers could include sensor failures, storms, and other situations.
1. Select Help > Sample Data Library and open Water Treatment.jmp.
2. Select Analyze > Screening > Explore Outliers.
3. Select the Sensor Measurements column group and click Y, Columns.
4. Click OK.
5. Select Multivariate k-Nearest Neighbor Outliers.
6. Enter 13 for k-nearest neighbors.
7. Click OK.
Figure 3.9 Outliers in Multivariate k-Nearest Neighbor Outliers Example
Outliers in Multivariate k-Nearest Neighbor Outliers Example
Notice the three extreme outliers selected in the K Nearest Neighbors plots in Figure 3.9. Each of these three rows corresponds to a date when the secondary settler in the water treatment plant was reported as malfunctioning. Because these three data points are due to faulty equipment, exclude them from future analyses.
8. Select the three extreme outliers and click Exclude Selected Rows.
You are prompted to Rerun the utility or Close the window.
9. Click Rerun.
10. Type 13 for k-nearest neighbors.
11. Click OK.
Figure 3.10 Outliers in Multivariate k-Nearest Neighbors Example
Outliers in Multivariate k-Nearest Neighbors Example
Now locate the two light-green outliers close to row 400. Notice how they tend to stay close to each other as k increases. These two rows correspond to dates when solids overloads were experienced by the water treatment plant. Even though these data points have a relatively high Distance to Neighbor K=13, because they are due to a situation that you want to include in your study, you do not exclude them. Instead, you keep them in mind as you conduct further analyses.
Explore Missing Values Utility
The presence of missing values in a data set can affect the conclusions made using the data. If, for example, several healthy participants dropped out of a longitudinal study and their data continued on as missing, the results of the study can be biased toward those unhealthy individuals who remained. Missing data values must not only be identified, they must also be understood before further analysis can be conducted.
The Explore Missing Values utility provides several ways to identify and understand the missing values in your data. It also provides methods for conducting multivariate normal imputation for missing values. These methods assume that data are missing at random, which means that the probability that an observation is missing depends only on the values of the other variables in the study. If you suspect that missing values are not missing at random, then consider using the Informative Missing procedure, which is available in a number of platforms. For more information, see the Model Specification chapter in the Fitting Linear Models book.
Example of the Explore Missing Values Utility
The Arrhythmia.jmp sample data table contains information from 452 patient electrocardiograms (ECGs). The data was originally collected to classify different patterns of ECGs as cardiac arrhythmia. However, there are missing values in this data table. You are primarily interested in exploring these missing values and imputing them when necessary. Since you can only conduct missing value imputation for columns that have a continuous modeling type, you will conduct your analysis in two stages.
Examine Missing Values
1. Select Help > Sample Data Library and open Arrhythmia.jmp.
2. Select Analyze > Screening > Explore Missing Values.
3. Select all columns (280 in all) and click Y, Columns.
4. Click OK. Select the Show only columns with missing checkbox.
Figure 3.11 Missing Value Report
Missing Value Report
The Missing Columns report shown in Figure 3.11 indicates that only five columns have missing data. Out of a total of 452 rows, Column J has 376 missing values. Because it is largely missing, it is not useful for data analysis, even with imputed values. However, it might be useful to model Column J using the Informative Missing option in a platform that supports this option to see if values are perhaps not missing at random.
Note that the two Imputation options, Multivariate Imputation and Multivariate SVD Imputation, are not shown. A message indicates that imputation is disabled because some columns included in the analysis were categorical. The data table contains several columns that are numeric, but have a nominal modeling type. These cannot be used for imputation.
Impute Missing Values
The five columns that have missing values are continuous. You proceed to impute values for the four columns other than Column J using multivariate imputation for the continuous columns in your data table. By doing so, you tacitly assume that the probabilities that values are missing depend only on the values of the continuous variables and not on the values of excluded nominal variables. To conduct this new analysis, you need to launch the Explore Missing Values utility again.
1. Select Analyze > Screening > Explore Missing Values.
2. In the launch window, click the red triangle next to 280 Columns.
You will use the columns filter menu to view only the columns with a Continuous modeling type in the Select Columns list.
3. Select Modeling Type > Uncheck All.
This removes all columns from the Select Columns list.
4. Select Modeling Type > Continuous.
The Select Columns list now contains only the 207 columns that are Continuous.
5. Select all 207 columns. Then Ctrl-click the J column (to deselect it) and click Y, Columns.
6. Click OK.
7. Click Multivariate Normal Imputation.
A window appears and asks whether you want to use a Shrinkage estimator for covariances.
8. Click Yes Shrinkage.
A JMP Alert appears, informing you that you should use the Save As command to preserve your original data.
9. Click OK.
Figure 3.12 Imputation Report
Imputation Report
The Imputation Report in Figure 3.12 indicates how many missing values were imputed and the specific imputation details. No missing data remain in the four columns that had missing values.
Launch the Explore Missing Values Utility
Launch the Explore Missing Values modeling utility by selecting Analyze > Screening > Explore Missing Values. Enter the columns of interest into the Y, Columns list.
Note: You can enter only columns with a Numeric modeling type in the Explore Missing Values utility.
The Missing Value Report
After you click OK in the launch window, the report opens to show a Commands outline and a Missing Columns report. The commands are the following:
“Multivariate Normal Imputation” (Not available if you entered a Numeric column with a Nominal or Ordinal modeling type in the launch window.)
“Multivariate SVD Imputation” (Not available if you entered a Numeric column with a Nominal or Ordinal modeling type in the launch window.)
Missing Value Report
The Missing Value Report opens the Missing Columns report, which lists the name of each column and the number of missing values in that column.
Show only columns with missing
Removes columns from the list that do not have missing values.
Close
Closes the Missing Columns report.
Select Rows
Selects the rows in the data table that contain missing values for the column(s) that you select in the Missing Columns report.
Exclude Rows
Applies the excluded row state for rows in the data table that contain missing values for the column(s) that you select in the Missing Columns report.
Color Cells
Colors the cells in the data table that contain missing values for the column(s) that you select in the Missing Columns report.
Color Rows
Colors the rows in the data table that contain missing values for the column(s) that you select in the Missing Columns report.
Missing Value Clustering
Missing Value Clustering provides a hierarchical clustering analysis of the missing data.
The dendrogram to the right of the plot shows clusters of missing data pattern rows. These are the rows that you would obtain by using Tables > Missing Data Pattern.
The dendrogram beneath the plot shows clusters of variables.
Use this report to determine if certain groups of columns tend to have similar patterns of missing values.
The rows of the plot are defined by the missing data patterns; there is a row for each pattern. The columns correspond to the variables. Each red cell indicates a group of missing values for the column listed beneath the plot. Place your cursor in a cell to see the list of values represented. Click in the plot to select missing data pattern rows. Vertical bars appear to indicate the selected patterns.
Missing Value Snapshot
The Missing Value Snapshot shows a cell plot for the missing values. The columns represent the variables. Black cells indicate a missing value. This plot is especially useful in understanding missingness for longitudinal data, where subjects may withdraw from a study before the end of the data collection period.
Multivariate Normal Imputation
The Multivariate Normal Imputation utility imputes missing values based on the multivariate normal distribution. The procedure requires that all variables have a Continuous modeling type. The algorithm uses least squares imputation. The covariance matrix is constructed using pairwise covariances. The diagonal entries (variances) are computed using all non-missing values for each variable. The off-diagonal entries for any two variables are computed using all observations that are non-missing for both variables. In cases where the covariance matrix is singular, the algorithm uses minimum norm least squares imputation based on the Moore-Penrose pseudo-inverse.
Multivariate Normal Imputation allows the option to use a shrinkage estimator for the covariances. The use of shrinkage estimators is a way of improving the estimation of the covariance matrix. For more information about shrinkage estimators, see Schafer and Strimmer (2005).
Note: If a validation column is specified, the covariance matrices are computed using observations from the Training set.
Multivariate Normal Imputation Report
The imputation report explains the results of the multivariate imputation process. Results include the following:
Method of imputation (either least squares or minimum-norm least squares)
How many values were replaced
Shrinkage estimator on/off
Factor by which the off-diagonals were scaled
How many rows and columns were affected
How many different missing value patterns there were
Once the imputation is complete, the cells corresponding to imputed values in the data table are colored in light blue. If the Missing Columns report is open, it is updated to show no missing values.
Click Undo to undo the imputation and replace the imputed data with missing values.
Multivariate SVD Imputation
The Multivariate SVD Imputation utility imputes missing values using the singular value decomposition (SVD). This utility is useful for data with hundreds or thousands of variables. Because SVD calculations do not require calculation of a covariance matrix, the SVD method is recommended for wide problems that contain large numbers of variables. The procedure requires that all variables have a Continuous modeling type.
The singular value decomposition represents a matrix of observations X as X = UDV‘, where U and V are orthogonal matrices and D is a diagonal matrix.
The SVD algorithm used by default in the Multivariate SVD Imputation utility is the sparse Lanczos method, also known as the implicitly restarted Lanczos bidiagonalization method (IRLBA). See Baglama and Reichel (2005). The algorithm does the following:
1. Each missing value is replaced with its column’s mean.
2. An SVD decomposition is performed on the matrix of observations, X.
3. Each cell that had a missing value is replaced by the corresponding element of the UDV‘ matrix obtained from the SVD decomposition.
4. Steps 2 and 3 are repeated until the SVD converges to the matrix X.
Imputation Method Window
When you click Multivariate SVD Imputation, the Imputation Method window opens to show recommended settings.
Number of Singular Vectors
Number of singular vectors that are computed and used in the imputation.
Note: It is important not to specify too many singular vectors, otherwise the SVD and the imputations do not change from iteration to iteration.
Maximum Iterations
The number of iterations used in imputing the missing values.
Show Iteration Log
Opens a Details report that shows the number of iterations and gives details on the criteria.
For large problems, a progress bar shows how many dimensions the SVD has completed. You can stop the imputation and use that number of dimensions at any time.
Multivariate SVD Imputation Report
The imputation report explains the results of the multiple imputation process.
Method of imputation
How many values were replaced
How many rows and columns were affected
Once the imputation is complete, the Missing Columns report is automatically shown indicating no missing values in the columns that were imputed. Imputed values are displayed in light blue.
Click Undo to undo the imputation and replace the imputed data with missing values.
Explore Missing Values Utility Options
See the JMP Reports chapter in the Using JMP book for more information about the following options:
Redo
Contains options that enable you to repeat or relaunch the analysis. In platforms that support the feature, the Automatic Recalc option immediately reflects the changes that you make to the data table in the corresponding report window.
Save Script
Contains options that enable you to save a script that reproduces the report to several destinations.
Save By-Group Script
Contains options that enable you to save a script that reproduces the platform report for all levels of a By variable to several destinations. Available only when a By variable is specified in the launch window.
Image shown hereMake Validation Column Utility
Validation is the process of using part of a data set to estimate model parameters and using another part to assess the predictive ability of a model. With complex data, this can reduce the risk of model overfitting.
A validation column partitions the data into two or three parts.
The training set is used to estimate the model parameters.
The validation set is used to help choose a model with good predictive ability.
The testing set checks the model’s predictive ability after a model has been chosen.
A validation column can be used as a validation method in the Fit Model platform.
Image shown hereExample of the Make Validation Column Utility
The Lipid Data.jmp data table contains blood measurements, physical measurements, and questionnaire data from 95 subjects at a California hospital. You are interested in using a validation column as a way of validation during future analyses.
1. Select Help > Sample Data Library and open Lipid Data.jmp.
2. Select Analyze > Distribution.
3. Assign Gender to the Y, Columns role. Click OK.
Figure 3.13 Distribution of Gender in Lipid Data.jmp
Distribution of Gender in Lipid Data.jmp
Figure 3.13 illustrates the distribution of Gender in the data set. Notice that there is not an equal proportion of males and females represented. Because there is a scarcity of females within the data, you want to be sure to balance the genders across the validation and training sets.
4. Select Analyze > Predictive Modeling > Make Validation Column.
5. Click Stratified Random.
6. Select Gender as the column used for validation holdback.
7. Click OK.
A Validation column is added to the data table. You can explore the distribution of the validation and training sets by creating a Mosaic Plot.
8. Select Analyze > Fit Y by X.
9. Assign Validation to Y, Response, and Gender to the X, Factor.
10. Click OK.
Figure 3.14 Distribution of Gender across Validation and Training Sets
Distribution of Gender across Validation and Training Sets
Figure 3.14 illustrates the distribution of Gender across each of the validation and training sets. Note that about 75% of both females and males are in the training set and about 25% of both females and males are in the validation set.
Image shown hereLaunch the Make Validation Column Utility
You can launch the Make Validation Column utility in two ways:
Select Analyze > Predictive Modeling > Make Validation Column. See “Make Validation Column Window”.
Click Validation in a platform launch window. See“Click Validation in a Platform Launch”.
Image shown hereMake Validation Column Window
In the Make Validation Column window, you specify the proportion or number of rows for each of your holdback sets and then you select a method for constructing the holdback sets.
Figure 3.15 Make Validation Column Window
Image shown here
Next to Training Set, Validation Set, and Test Set, enter values that represent the proportions or numbers of rows that you would like to include in each of these sets. The default values construct a training set that contains about 75% of the rows and a validation set that contains about 25% of the rows.
Enter a name for your validation column next to New Column Name.
There are five methods available to create the holdback sets.
Formula Random
Partitions the data into sets based on the allocations entered. For example, if the default values are entered, each row has a probability of 0.75 to be included in the training set and 0.25 probability of being included in the validation set. The formula is saved to the column. To see it, click on the plus icon to the right of the column name in the Columns panel.
Fixed Random
Partitions the data into sets based on the allocations entered. For example, if the default values are entered, each row has a probability of 0.75 to be included in the training set and 0.25 probability of being included in the validation set. You can specify a random seed that enables you to reproduce the allocations in the future. No formula is saved to the column.
Stratified Random
Partitions the data into balanced sets based on levels of columns that you specify. Use this option when you want a balanced representation of a column’s levels in each of the training, validation, and testing sets.
When you click Stratified Random, a window appears that enables you to select one or more columns by which to stratify the data. When you click OK, the validation column is added to the data table. As in the Fixed Random case, rows are randomly assigned to the holdback sets based on the specified allocations. However, this is done at each level or combination of levels of the stratifying columns.
A column is added to the data table with a Notes property that gives the stratifying variables.
Grouped Random
Partitions the data into sets in such a way that entire levels of a specified column or combinations of levels of two or more columns are placed in the same holdback set. Use this option when splitting levels across holdback sets is not desirable.
When you click Grouped Random, a window appears that enables you to select one or more columns to be grouping columns. When you click OK, the levels are randomly assigned to holdback sets. When a level is larger than the proportion or number of rows you specify, it stays in its assigned holdback set. However, fewer rows are allocated into the training set. Because of this, the sizes of the resulting sets vary slightly from the sizes that you specified.
Cutpoint
Partitions the data into sets based on time series cutpoints. Use this option when you want to assign your data to holdback sets based on time periods.
When you click Cutpoint, a window appears that enables you to select one or more columns to define time periods. When you click OK, a JMP Alert appears that shows the assigned cutpoints. A column that reflects this assignment is added to the data table. The training set consists of rows between the first cutpoint and the second cutpoint. The validation set consists of rows between the second and third cutpoints. The test set consists of the remaining rows. These sets are chosen to reflect the proportions or numbers of rows that you specified.
Image shown hereClick Validation in a Platform Launch
Use this method if you are in a platform launch window and need to construct a validation column quickly. Note the following:
The platform must support a Validation column.
No columns must be selected in the Select Columns list.
Click the Validation button in the platform launch window. A Make Validation Column window appears with default settings of 0.7 for the Training Set, 0.3 for the Validation Set, and 0.0 for the Test Set.
1. Enter your desired proportions or numbers next to Training Set, Validation Set, and Test Set.
2. Type a name for the new column next to New Column Name.
3. Click OK.
The new column appears in the data table with a formula. In the launch window, the new column is assigned to the Validation role.
Note: Launching the Make Validation Column utility through a platform launch window is equivalent to selecting the Formula Random method from Analyze > Predictive Modeling > Make Validation Column.The Fixed Random, Stratified Random, Grouped Random, and Cutpoint methods are not available.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset