The data mining process just means extracting relevant information from the tons of available data in the database. The relevance of data is with respect to the problem statement of any particular project. Data miner in SQL Developer 4.1 has enhanced features, and though it is significant to touch upon the new features that can be useful for emerging technologies and the users, in this chapter we will be discussing all the new features provided with the Data Miner tool and also the general enhancements in and around the tool. One of the most relevant and significant additions to the data mining capability is the growing popularity of JSON data and its use in Big Data configuration. Data Miner now provides an easy-to-use JSON Query node. In this chapter, we will start off with data source node preparation.
As a first step, to invoke the data miner tool within SQL Developer, the next screenshot explains the data miner architecture diagrammatically. Data Miner by default is integrated into SQL Developer since version 2.0 and when invoked from Tools | Data Miner, it checks for the Data Miner repository.
As a prerequisite, Oracle Enterprise Edition is required to hold the Data Miner Repository under the schema called ODMRS.
When invoked, SQL Developer prompts for the installation of the Data Miner repository, if the repository is not already built. In my case, my database did not have a repository and I was prompted to create one, as shown in the following screenshot:
A Data Source node becomes the source of data for the data mining project. A data source node specifies the build data for a model. We will use the examples that were installed during the installation of the Data Miner repository. The following is the sequence that we will follow to add the data source node:
We can have the Workflow Jobs tab also open while we start creating our first project. To do this, go to View | Data Miner | Workflow Jobs.
Before you begin working on a Data Miner workflow, you need to create a Data Miner project, which serves as a container for one or more workflows. We created the data mining user during the installation; the user name is DM user. In the Data Miner tab, right-click on the data mining user connection that you previously created and select New Project, as shown in the following screenshot:
A Data Miner workflow is a collection of connected nodes that describe data mining processes and provide directions for the Data Mining server. The workflow actually emulates all phases of a process designed to solve a particular business problem.
The workflow enables us to interactively build, analyze, and test a data mining process within a graphical environment, as shown in the following screenshot:
Immediately after creating a new workflow, we will be able to see a blank workflow screen ready to build the workflow. The graphical representation for the workflow can be built by dragging and dropping into the workflow area. The components pane will give us all the nodes that we wish to add to the workflow.
The components palette shows all the available types of nodes that we can use to build our workflow, but we will only use a couple of nodes as examples in this chapter.
The following table shows all the available types of nodes, only for reference purposes.
Remember, we had installed the sample data along with the data miner repository. For the rest of the chapter, we will use the sample tables from the sample to show the data miner concepts. As shown in the following screenshot, we will be using the table called INSUR_CUST_LTV_SAMPLE
owned by the DM user to mine the data and exhibit the analytics capability of Data Miner:
In the Define Data source dialog box, select the said table, click on Next and then finish to have the data source created. Explore Data is another node that will be added, which can help us validate the data source. For this, just right-click on the Data Source node and select Connect, drag the arrow up to the Explore Data node, and we have completed the step.
Once all the nodes are placed in the required order, the next step would be to link the nodes in a meaningful and correct way. In the following example, you can see how the Data Source node is connected to the Explore Data node by right-clicking and selecting the Connect option.
After connecting the nodes in a meaningful fashion, we will be able to run the node and submit a workflow job related to it. Right-click on the Explore Data node and select the Run option to submit the workflow job. The status of the related workflow job will be displayed in runtime on the Workflow Jobs pane.
Once the nodes are defined, we are ready to run the nodes, which in turn submits the workflow job. The workflow job pane displays the submitted job and its status. A completed job is shown with a green tick (√) under the status column. We are now ready to generate the statistics report for the Explore Data node.
By clicking on View Data, Data Miner creates statistics based on a lot of information about each attribute in the dataset including a histogram, distinct values, mode, average, min and max value, standard deviation, variance, skewness, and kurtosis.
The display enables you to visualize and validate the data, and also to manually inspect the data for patterns or structure.