Summary

Outliers must be considered while exploring real data, and this chapter has given some techniques for spotting them as a part of a recommended systematic process that allows the root cause behind the creation of the outlier to be determined. In addition, automated handling could be implemented while bearing in mind that it may be dangerous to give complete autonomy to a system because it may delete perfectly good data. It is better perhaps to implement automated checking to highlight outliers in unseen data so as to allow a human to get involved.

Bear in mind that real data never behaves as well as fake data. What matters is being able to quickly determine what data could be an outlier, then work out whether it is or not. This chapter has given some tools to help you with this.

Another big issue with real data is missing values. As we shall see in the next chapter, it is important to determine some rules to handle these.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset