Hypothesis testing

Hypothesis testing is a statistical tool used to determine whether a result is statistically significant. Additionally, it can also be used to justify whether the result you received occurred by chance or whether it is an actual result.

In this regard, moreover, according to Oracle developers at https://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/method.htm, a certain workflow would provide better performance tuning. The typical steps they have suggested are as follows:

  • Set clear goals for tuning
  • Create minimum repeatable tests
  • Test the hypothesis
  • Keep all records
  • Avoid common errors
  • Stop tuning when the objectives are achieved

Usually, the observed value tobs of the test statistic T is first computed. After that, the probability, also called p-value, is calculated under the null hypothesis. Finally, if and only if the p-value is less than the significance level (the selected probability) threshold, the null hypothesis is rejected in favor of the alternative hypothesis.

To find out more, refer to the following publication: R.A. Fisher et al., Statistical Tables for Biological Agricultural and Medical Research, 6th ed., Table IV, Oliver & Boyd, Ltd., Edinburgh.

Here are two rules of thumb (although these may vary for your case depending on data quality and types):

  • If the p-value is p > 0.05, accept your hypothesis. Note that if a deviation is small enough that chance could be the acceptance level. A p-value of 0.6, for example, means that there is a 60% probability of any deviation from the expected result. However, this is within the range of an acceptable deviation.
  • If the p-value is p < 0.05, reject your hypothesis by concluding that some factors other than chance are operating for the deviation to be perfect. Similarly, a p-value of 0.01 means that there is only a 1% chance that this deviation is due to chance alone, which means that other factors must be involved, and these need to be addressed.

However, these two rules might not be applicable in every hypothesis testing. In the next subsection, we will show an example of hypothesis testing using Spark MLlib.

Hypothesis testing using ChiSqTestResult of Spark MLlib

According to the API documentation provided by Apache at http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing, the current implementation of the Spark MLlib supports Pearson's chi-squared (χ2) tests for goodness of fit and independence:

  • The goodness of the fit, or if the independence test is being conducted, is determined by the input data types
  • The goodness of the fit test requires an input of type vector (mostly dense vectors although it works for sparse vectors). On the other hand, the independence test requires a Matrix as an input format

In addition to these, Spark MLlib also supports input type RDD [LabeledPoint] to enable feature selection via chi-square independence tests, especially for the SVM or regression based test where the statistical class provides the necessary methods to run Pearson's chi-squared tests.

Additionally, Spark MLlib provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov (KS) test for equality of probability distributions. Spark MLlib provides online implementations of some tests to support use cases like A/B testing. These tests may be performed on a Spark Streaming DStream [(Boolean, Double)] where the first element of each tuple indicates the control group (false) or treatment group (true) and the second element is the value of an observation.

However, due to brevity and page limitation, this two testing technique will not be discussed. The following example demonstrates how to run and interpret hypothesis tests through the ChiSqTestResult. In the example, we will show three tests: goodness of fit of the result on the dense vector created from the breast cancer diagnosis dataset, an independence test for a randomly created matrix and finally, an independence test on a contingency table from the cancer dataset itself.

Step 1: Load required packages

Here is the code to load the required packages:

import java.io.BufferedReader; 
import java.io.FileReader; 
import java.io.IOException; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.mllib.linalg.DenseVector; 
import org.apache.spark.mllib.linalg.Matrices; 
import org.apache.spark.mllib.linalg.Matrix; 
import org.apache.spark.mllib.linalg.Vector; 
import org.apache.spark.mllib.regression.LabeledPoint; 
import org.apache.spark.mllib.stat.Statistics; 
import org.apache.spark.mllib.stat.test.ChiSqTestResult; 
import org.apache.spark.rdd.RDD; 
import org.apache.spark.sql.SparkSession; 
import com.example.SparkSession.UtilityForSparkSession; 

Step 2: Create a Spark session

The following code helps us to create a Spark Session:

static SparkSession spark = UtilityForSparkSession.mySession(); 

The implementation of the UtilityForSparkSession class is as follows:

public class UtilityForSparkSession { 
  public static SparkSession mySession() { 
  SparkSession spark = SparkSession 
               .builder() 
               .appName("JavaHypothesisTestingOnBreastCancerData ") 
               .master("local[*]") 
              .config("spark.sql.warehouse.dir", "E:/Exp/") 
              .getOrCreate(); 
    return spark; 
  } 
} 

Step 3: Perform the goodness of fit test

First we need to prepare a dense vector from a categorical dataset such as the Wisconsin Breast Cancer Diagnosis dataset. As we have already provided many examples on this dataset, we will not discuss the data exploration anymore in this section. The following line of code collects the vectors that we created using the myVector() method:

Vector v = myVector(); 

Where, the implementation of the myVector() method goes as follows:

public static Vector myVector() throws NumberFormatException, IOException {     
BufferedReader br = new BufferedReader(new FileReader(path)); 
    String line = nulNow let's compute the goodness of the fit. Note, if a second vector tol; 
    Vector v = null; 
    while ((line = br.readLine()) != null) { 
      String[] tokens = line.split(","); 
      double[] features = new double[30]; 
      for (int i = 2; i < features.length; i++) { 
        features[i-2] =     
                       Double.parseDouble(tokens[i]); 
      } 
      v = new DenseVector(features); 
    } 
    return v; 
  } 

Now let's compute the goodness of the fit. Note, if a second vector to test is not supplied as a parameter, the test runs occur against a uniform distribution automatically:

ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(v); 

Now let's print the result of the goodness using the following:

System.out.println(goodnessOfFitTestResult + "
"); 

Note the summary of the test includes the p-value, degrees of freedom, test statistic, the method used, and the null hypothesis. We got the following output:

Chi squared test summary: 
method: pearson 
degrees of freedom = 29  
statistic = 4528.611649568829  
pValue = 0.0  

There is a very strong presumption against a null hypothesis: the observed follows the same distribution as expected.

Since the p-value is low enough to be insignificant, consequently we cannot accept the hypothesis based on the data.

Step 4: An independence test on the contingency matrix

First let's create a contingency 4x3 matrix randomly. Here, the matrix appears as follows:

 ((1.0, 3.0, 5.0, 2.0), (4.0, 6.0, 1.0, 3.5), (6.9, 8.9, 10.5, 12.6)) 
Matrix mat = Matrices.dense(4, 3, new double[] { 1.0, 3.0, 5.0, 2.0, 4.0, 6.0, 1.0, 3.5, 6.9, 8.9, 10.5, 12.6});     

Now let's conduct Pearson's independence test on the input contingency matrix:

ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat); 

Now let's evaluate the test result and give the summary of the test including the p-value and the degrees of freedom:

System.out.println(independenceTestResult + "
"); 

We got the following statistic summarized as follows:

Chi squared test summary: 
method: pearson 
degrees of freedom = 6  
statistic = 6.911459343085576  
pValue = 0.3291131185252161  
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.  

Step 5: An independence test on the contingency table

First, let's create a contingency table by means of RDDs from the cancer dataset as follows:

static String path = "breastcancer/input/wdbc.data"; 
RDD<String> lines = spark.sparkContext().textFile(path, 2);     
JavaRDD<LabeledPoint> linesRDD = lines.toJavaRDD().map(new Function<String, LabeledPoint>() { 
    public LabeledPoint call(String lines) { 
    String[] tokens = lines.split(","); 
    double[] features = new double[30]; 
    for (int i = 2; i < features.length; i++) { 
    features[i - 2] = Double.parseDouble(tokens[i]); 
            } 
    Vector v = new DenseVector(features); 
    if (tokens[1].equals("B")) { 
    return new LabeledPoint(1.0, v); // benign 
      } else { 
    return new LabeledPoint(0.0, v); // malignant 
        } 
      } 
    }); 

We have constructed a contingency table from the raw (feature, label) pairs and used it to conduct the independence test. Now let's conduct the test as ChiSquaredTestResult for every feature against the label:

ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(linesRDD.rdd()); 

Now let's observe the test result against each column (that is, for each 30 feature point) using the following code segment:

int i = 1; 
for (ChiSqTestResult result : featureTestResults) { 
System.out.println("Column " + i + ":"); 
System.out.println(result + "
");  
i++; 
} 
 
Column 1: 
Chi-squared test summary: 
method: Pearson 
degrees of freedom = 455  
statistic = 513.7450859274513  
pValue = 0.02929608473276224  
Strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent. 
 
Column 2: 
Chi-squared test summary: 
method: Pearson 
degrees of freedom = 478  
statistic = 498.41630331377735  
pValue = 0.2505929829141742  
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent. 
 
Column 3: 
Chi-squared test summary: 
method: Pearson 
degrees of freedom = 521  
statistic = 553.3147340697276  
pValue = 0.1582572931194156  
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent. 
. 
. 
Column 30: 
Chi-squared test summary: 
method: Pearson 
degrees of freedom = 0  
statistic = 0.0  
pValue = 1.0  
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.  

From this result, we can see that for some feature points (that is, column) we have a large p-value compared to the others. Readers are, therefore, advised to select the proper dataset and do the hypothesis test prior to applying the hyperparameter tuning. There is no concrete example in this regard since the result may vary against the datasets you have.

Hypothesis testing using the Kolmogorov–Smirnov test from Spark MLlib

Since Spark release 1.1.0, Spark also provides the facility of doing the hypothesis testing using for the real-time streaming data through the Kolmogorov-Smirnov test. Where, the probability of obtaining a test statistic result (at least as extreme as the one) that was actually observed. It actually assumes that the null hypothesis is always true. 

Tip

For more details, interested readers should refer to the Java class (JavaHypothesisTestingKolmogorovSmirnovTestExample.java) in the Spark distribution under the following directory: spark-2.0.0-bin-hadoop2.7examplessrcmainjavaorgapachesparkexamplesmllib.

Streaming significance testing of Spark MLlib

Other than the Kolmogorov-Smirnov test, Spark also supports Streaming significance testing, which is an online implementation of the hypothesis testing like the A/B testing? These tests can be performed on a Spark streaming using DStream (more technical discussion on this topic will be carried out in Chapter 9, Advanced Machine Learning with Streaming and Graph Data).

The MLlib based streaming significance testing supports the following two parameters:

  • peacePeriod: This is the number of initial data points from the stream to ignore. This is actually used to mitigate novelty effects and the quality of the streaming you will be receiving.
  • windowSize: This is the number of past batches over which to perform hypothesis testing. If you set its value to 0, it will perform cumulative processing using all the prior batches received and processed.

Interested readers should refer to the Spark API documentation at http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset