Identically distributed

Let's move on to the second sub-requirement, the one regarding the distribution of independent variables. First, we need to understand the proper meaning of this distribution. We mean here frequency distribution or probability distribution. You can understand it easily by resorting to the classic example of a dice.

You take a dice and ask yourself, What are the possible outputs of this action? We will exclude the possibility of the dice getting lost or eaten by your dog for the sake of simplicity, taking therefore as a list of possible outputs one number from one to six. We formally define this as follows: the random process of throwing a dice has a sample space composed of all real positive numbers from one to six.

You now ask yourself, Is it more probable to get a one or a three? As you already know, there is an equal probability of getting a one or a three, and there is actually the same probability of getting every number from one to six when rolling a (fair) dice. What is the probability distribution of this random process? Let's first compute the probability associated with every possible outcome. Considering that the total probability of a sample space is conventionally assumed to be equal to one, and that every possible outcome has the same probability, we can compute this as one divided by the total number of possible outcomes, that is one divided by six, and therefore a number around 0.16. We can then describe the probability distribution as follows:

Number Probability
1 0.16
2 0.16
3 0.16
4 0.16
5 0.16
6 0.16

 

You get a sense of what a probability distribution is. Let's now apply it to our context. What is the probability distribution of our independent variable? There are two possible ways to answer this question:

  • You already know the answer from some outside knowledge of the variable, such as the sex variable that is overall equally distributed between males and females, having therefore a probability distribution equal to 0.5 for both of the possible values
  • You don't know the answer and you try to get it from your data 

This latter answer can be considered for sure as the easiest one, nevertheless, it requires you to answer some more questions:

  • Is your data a sample from an originating population or is it the populated by itself?
  • If a sample, is your sample of the independent variable representative of the population from which it was drawn?

While the first question should usually get an easy answer, since we should know how our data was generated, the second question requires a careful analysis of sampling methodologies applied and an overall evaluation of the dimension of the sample, usually based on some formula defining the minimum sample size required to obtain representatives. I will point you to some resources on this topic.

OK then, let's imagine that you now know what the probability distribution of your dependent variable should be, what should you look for now? Well, you should compare the actual probability distribution you can observe in your data with the one you know should be the right one. 

Let's go back to our dice example and imagine we now have a sample of outcomes drawn from a population of 1,000 tosses. Imagine that we have already checked for the sample being representative; what should you look for now? You should definitely look for the probability distribution of your sample, computing the number of occurrences for each outcome. The desired final output of this computation should resemble the table we have seen above, showing an equal probability of occurrence for each possible outcome.

Should they be exactly equal? No, we are not dealing here with physics or chemistry, a reasonably similar probability distribution will do the job. 

Two final caveats:

  • We are not assuming here that all outcomes should have the same probability of occurrence. If we know that our coin is tossed and will therefore give a higher probability of occurrence to two as an outcome, we will look for a probability distribution within our sample with a higher probability for two, and this will be completely correct.
  • We are checking here for this assumption computing the observed probability distribution with the theoretical one, considering that if all observations are drawn from the same theoretical distribution and the sample is representative of the population, the final observed distribution should be equal to the theoretical one. What if there are some observations that are not drawn from the same theoretical distribution, such as some record coming after some structural break in the process? We have two answers here. One, if your sample is large enough you should check for the drawing of small sub-samples and look at their theoretical distribution. Two, if these anomalous records are in a small number, this should not let you conclude that the assumption is not satisfied and you should go on with your modeling activity.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset