Chapter 11. Taking Stock

Let us take stock of where we are. One of the aims of this book was to give an overview of the process of data exploration to demonstrate that there is a lot to it, but this need not be daunting especially if the explorer is armed with sharp yet flexible tools, and has the ability and confidence to use them.

The methodology first mentioned in Chapter 1, Setting the Scene shows how a typical data mining activity is an iterative process, but with a general direction starting from the requirements and ending with the benefits. By its nature, a book has to present things in a linear order but I hope it is clear that the chapter ordering will not generally apply when real-world data is handled. Indeed, all the stages do not have to happen at all; for example the requirements could be met simply by importing and visualizing data.

Another aim is to provide real examples in sufficient detail to download with the processes. This allows them to be re-used without having to invent them first. While it is not necessarily time consuming to create RapidMiner Studio processes (the clue is in the name), sometimes a helping hand or a Hello World example to start from can save a lot of time. The product is huge, so knowing everything is a challenge and knowing how all the operators fit together is another. What is perhaps missing is context about what to do when a certain type of activity needs to be performed. This context, in the form of the exploration of data is one route into the book, and the processes and techniques that are shown should allow easier re-use.

The final aim is to show what could be possible. Sometimes seeing something being done in a different or unexpected way leads to new ideas and can certainly save time. I am certain there will be new examples invented as well as new approaches since I am by no means the only RapidMiner Studio practitioner; there are plenty of creative people out there. This is in fact one of the very desirable things about RapidMiner. Once you have arrived at a certain state of knowledge and confidence, all processes become extremely easy and in fact almost second nature. This frees the mind from having to worry about whether something is possible and allows the problem itself to be solved. This is the best bit about data exploration and data mining; no two problems are the same and there are tremendous opportunities to be creative.

Exploring new techniques

Of course, there is more to RapidMiner in general than this book has covered and there is certainly more to data exploration. The interested reader is encouraged to keep finding more because, in my experience, new techniques lead to new insights and results in a self propelling virtuous circle. If only there was more time in a day.

The following sections give you a short list of areas that are well worth looking into.

Time series

There are many examples of time series in the real world. Examples include stock prices, tree ring data, temperature records, sunspots, and audio files. RapidMiner Studio has an extension for series data and in fact, the Window operator is a part of it. This book has only scratched the surface of time series.

Web mining

Text mining was briefly touched upon but there is a great deal that could be done to explore data derived from web pages or feed APIs. There will never be a shortage of data from the web.

Using R

RapidMiner Studio integrates with R, the de facto standard package for statistical analysis in the academic world and increasingly outside it. R has a fantastic range of packages covering a huge array of subject areas with new ones right at the leading edge being added all the time. An R script can easily be integrated into RapidMiner Studio, so if there is something missing from RapidMiner Studio (and there is sometimes) it is almost certainly available in R.

R also has very good graphics and there is no reason not to use it as part of an exploration process. There is a downside to this. R has a huge learning curve but the value is so great you might as well start now.

Java or Groovy

RapidMiner Studio is built using Java, and given that the community edition is open source, it would be completely possible to make changes to the software to solve a particular problem as well as give back to the community so that everyone benefits. This book has deliberately not looked at Java partly because I am not an expert but mostly because it would have put people off.

Having said that, this book does have some Groovy examples and there will be times when only Groovy will do. So, I do encourage this to become one of your areas of knowledge.

Third-party components

Rapid-I, the company behind RapidMiner, operates a market place where third-party extensions are available. This is integrated with the RapidMiner GUI and it is worth visiting this location to see if there is anything there that could prove useful.

RapidMiner Server

Rapid-I also produces RapidAnalytics. This is a server-based solution that provides a repository location and an environment for remote execution of processes that integrates with the RapidMiner Studio GUI. This can be very useful if you have a powerful server available. RapidMiner Server allows a long running process to be initiated on the server, which should complete more quickly; but while you are waiting, you can work on something else.

There is a lot more to RapidMiner Server, such as scheduled process execution, custom web-based reports, and user management; but it is beyond the scope of this book to go into detail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset