Data sources

As seen earlier, this is where everything begins: the data. R is well-known for being able to treat different kinds of data coming from a great variety of data sources. This is due to the flexibility we described in the first chapter. The R language is open to be expanded into every direction by means of its packages. When dealing with new data mining projects, you should therefore ask yourself—what kind of data I am going to handle for this project?

Is the data already residing on the web? Is the data still stored on the old and reliable paper? Is it just recorded sounds or images? Resorting to the CRISP-DM methodology, this is part of our business understanding phase. Once we have got this point clear, we can surf one of the most useful pages within the R website—the CRAN Task View, at: cran.r-project.org/web/views.

Within this page, you can find a list of pages each of which relate to a specific task that can be performed with R. For instance, you will find pages for natural language processing, medical image analysis, and similar great things.

Once you have found the page related to the kind of data you are going to acquire into your data mining architecture, you can surf to it and discover all the available packages developed to perform the given task. Those pages are really useful due to the good maintenance activity performed on them, and their well-articulated content.

What if no page arises for your specific kind of data?

You have at least three more roads to put yourself through, ordered by effort required:

  • Look for tasks that are not exactly treating the data you will be facing, but something close to them, and could therefore be useful (at the price of a small amount of customization)
  • Look outside the CRAN Task View for packages recently developed or still under development, and therefore not included in CRAN and in the CRAN Task View
  • Develop the code required for data acquisition by yourself

Now that we have hopefully discovered how to import our data into R, we can move on to the data warehouse step, and where to store the data once it is acquired.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset